|0.1.7||Feb 8, 2022|
|0.1.6||Feb 4, 2022|
#299 in Text processing
41 downloads per month
pithy 0.1.0 - an absurdly fast, strangely accurate, summariser
Something important to note is that pithy is more of a highlighter than a summariser. It just so happens that the most important sentences in a text are often good summaries. You can control this via the --density flag.
Here are some examples of what it outputs:
https://plato.stanford.edu/entries/chinese-room/, The Chinese Room Argument
- The narrow conclusion of the argument is that programming a digital computer may make it appear to understand language but could not produce real understanding.
https://www.gutenberg.org/files/55/55-0.txt, The Wonderful Wizard of Oz
- Dorothy did not know what to say to this, for all the people seemed to think her a witch, and she knew very well she was only an ordinary little girl who had come by the chance of a cyclone into a strange land.
https://archive.org/stream/ProgrammingRust1stEdition1491927283/Programming%20Rust%201st%20Edition%201491927283_djvu.txt, "Programming Rust 1st Edition"
- It is ironic that the dominant systems programming languages, C and C++, are not type safe, while most other popular languages are. Given that C and C++ are meant to be used to implement the foundations of a system, entrusted with implementing security boundaries and placed in contact with untrusted data, type safety would seem like an especially valuable quality for them to have. This is the decades-old tension Rust aims to resolve: it is both type safe and a systems programming language
https://www.gutenberg.org/cache/epub/5827/pg5827.txt, The Problems of Philosophy by Bertrand Russell
- It is chiefly in this sense that Berkeley denies matter; that is to say, he does not deny that the sense-data which we commonly take as signs of the existence of the table are really signs of the existence of something independent of us, but he does deny that this something is non-mental, that it is neither mind nor ideas entertained by some mind. He admits that there must be something which continues to exist when we go out of the room or shut our eyes, and that what we call seeing the table does really give us reason for believing in something which persists even when we are not seeing it. But he thinks that this something cannot be radically different in nature from what we see, and cannot be independent of seeing altogether, though it must be independent of our seeing
Quick example: pithy -f your_file_here.txt --sentences 4
Print this help message
The file pithy will read from. Required.
The number of sentences for pithy to return. Defaults to 3.
Will return a decent approximation of the summary. Good for extremely long texts where you don't care about precision.
slash (i.e \"/\") separated list of words to bias the summary towards. If you are using pithy on a large text, increase the chunk_size to 2500-5000 to get relevant results. Note that this doesn't work in approximate mode.
The strength of the bias, must be an integer. Defaults to 6.
If set, pithy splits the text into sections, and each section is summarized separately. Defaults to false.
The number of sentences to read at a time. Defaults to 500 if unspecified.
If set, pithy reads the text all at once. Can be quite slow once you go past the 7k mark. Defaults to false.
If set, regardless of how large the text is, pithy splits it into chunks. Should be used in combination with chunk_size and by_section.
If set, pithy uses ngrams rather than words. It's usually crap, but you might use it as a last resort for non-spaced languages that you can't pre-tokenise. Defaults to false.
The minimum sentence length before filtering. Defaults to 30.
The maximum sentence length before filtering. Defaults to 1500.
The separator used to split the text into sentences. Defaults to '. '. You can type newline to separate by newlines.
If set, removes sentences with excessive whitespace. Useful for pdfs and copy-pastes from websites.
If set, removes sentences with too many non-alphabetic characters.
If set, removes sentences with too many capital letters. Useful if the text contains a lot of references or indices.
The length penalty. Defaults to 1.5. Decrease to make glance for longer sentences, increase for shorter sentences.
Experimental setting. Defaults to 3. Setting it lower seems to bias pithy's summaries towards more common words, setting it higher seems to bias summaries towards rarer but more informative words.
If set, the context surrounding sentences isn't provided. Defaults to false.
If set, the sentences are sorted by their relevance rather than their order in the original text. Defaults to false.
If set, the progress bar is not printed. Defaults to false because progress bars are cool.