#nlp #text #summarize #cli #summarization

bin+lib pithy

Ultra-fast, spookily accurate text summarizer that works on any language

7 releases

0.1.7 Feb 8, 2022
0.1.6 Feb 4, 2022

#1290 in Text processing

26 downloads per month

MIT license

41KB
766 lines

pithy 0.1.0 - an absurdly fast, strangely accurate, summariser

Something important to note is that pithy is more of a highlighter than a summariser. It just so happens that the most important sentences in a text are often good summaries. You can control this via the --density flag.

Here are some examples of what it outputs:

https://plato.stanford.edu/entries/chinese-room/, The Chinese Room Argument

  • The narrow conclusion of the argument is that programming a digital computer may make it appear to understand language but could not produce real understanding.

https://www.gutenberg.org/files/55/55-0.txt, The Wonderful Wizard of Oz

  • Dorothy did not know what to say to this, for all the people seemed to think her a witch, and she knew very well she was only an ordinary little girl who had come by the chance of a cyclone into a strange land.

https://archive.org/stream/ProgrammingRust1stEdition1491927283/Programming%20Rust%201st%20Edition%201491927283_djvu.txt, "Programming Rust 1st Edition"

  • It is ironic that the dominant systems programming languages, C and C++, are not type safe, while most other popular languages are. Given that C and C++ are meant to be used to implement the foundations of a system, entrusted with implementing security boundaries and placed in contact with untrusted data, type safety would seem like an especially valuable quality for them to have. This is the decades-old tension Rust aims to resolve: it is both type safe and a systems programming language

https://www.gutenberg.org/cache/epub/5827/pg5827.txt, The Problems of Philosophy by Bertrand Russell

  • It is chiefly in this sense that Berkeley denies matter; that is to say, he does not deny that the sense-data which we commonly take as signs of the existence of the table are really signs of the existence of something independent of us, but he does deny that this something is non-mental, that it is neither mind nor ideas entertained by some mind. He admits that there must be something which continues to exist when we go out of the room or shut our eyes, and that what we call seeing the table does really give us reason for believing in something which persists even when we are not seeing it. But he thinks that this something cannot be radically different in nature from what we see, and cannot be independent of seeing altogether, though it must be independent of our seeing
Quick example:
pithy -f your_file_here.txt --sentences 4

--help:

Print this help message

-f:

The file pithy will read from. Required.

--sentences:

The number of sentences for pithy to return. Defaults to 3.

--approximate:

Will return a decent approximation of the summary. Good
for extremely long texts where you don't care about precision.

--bias:

slash (i.e \"/\") separated list of words to bias the summary towards.
If you are using pithy on a large text, increase the chunk_size to
2500-5000 to get relevant results. Note that this doesn't work in
approximate mode.

--bias_strength:

The strength of the bias, must be an integer. Defaults to 6.

--by_section:

If set, pithy splits the text into sections, and each section is
summarized separately. Defaults to false.

--chunk_size:

The number of sentences to read at a time. Defaults to 500 
if unspecified.

--force_all:

If set, pithy reads the text all at once. Can be quite 
slow once you go past the 7k mark. Defaults to false.

--force_chunk:

If set, regardless of how large the text is, pithy splits it
into chunks. Should be used in combination with chunk_size 
and by_section.

--ngrams:

If set, pithy uses ngrams rather than words. 
It's usually crap, but you might use it as a last resort 
for non-spaced languages that you can't pre-tokenise. 
Defaults to false.

--min_length:

The minimum sentence length before filtering. Defaults to 30.

--max_length:

The maximum sentence length before filtering. Defaults to 1500.

--separator:

The separator used to split the text into sentences. 
Defaults to '. '. You can type newline to separate by newlines.

--clean_whitespace:

If set, removes sentences with excessive whitespace. Useful for 
pdfs and copy-pastes from websites.

--clean_nonalphabetic:

If set, removes sentences with too many non-alphabetic characters.

--clean_caps:

If set, removes sentences with too many capital letters. Useful 
if the text contains a lot of references or indices.

--length_penalty:

The length penalty. Defaults to 1.5. Decrease to make glance for longer 
sentences, increase for shorter sentences.

--density:

Experimental setting. Defaults to 3. Setting it lower 
seems to bias pithy's summaries towards more common words, 
setting it higher seems to bias summaries towards rarer 
but more informative words.

--no_context:

If set, the context surrounding sentences isn't provided. 
Defaults to false.

--relevance:

If set, the sentences are sorted by their relevance rather 
than their order in the original text. Defaults to false.

--nobar:

If set, the progress bar is not printed. Defaults to false because
progress bars are cool.

Dependencies

~4–12MB
~134K SLoC