17 releases
Uses old Rust 2015
0.8.1 | Mar 4, 2017 |
---|---|
0.7.4 | Oct 25, 2016 |
0.7.1 | Dec 10, 2015 |
0.7.0 | Nov 12, 2015 |
0.6.2 | Jul 14, 2015 |
#1608 in Text processing
135 downloads per month
Used in crowbook
84KB
1.5K
SLoC
Caribon
A repetition detector written in Rust.
Demo server
If you want to have an idea of what Caribon does without downloading anything, you can have a look at this instance that runs Caribon as a web service.
Downloading
Either use git
to get the latest version:
$ git clone https://github.com/lise-henry/caribon.git
or just download one of the stable(ish) releases.
(If you only plan to use Caribon as a library for your rust program,
you don't need to worry too much about downloading and building, just
add caribon = "*"
in your Cargo.toml
file.)
Build
You'll need Rust and Cargo, see their install instructions. Then
$ cargo build
should do the job, though you can also run cargo build --release
directly. You can then run caribon either with:
$ cargo run --release
or by directly executing the binary (in target/debug
or
target/release
):
$ target/release/caribon
Installing
If you have a recent version of Cargo
, you can use:
$ cargo install caribon
(for the latest version on crates.io, or
$ cargo install --git https://github.com/lise-henry/caribon
for the latest version on GitHub)
which will download, build and install Caribon.
Cargo run
If you don't want to install Caribon, cargo run
might be the
simplest option. Note, though, that command-line arguments must
be prefixed by --
so cargo gives them to the binary:
$ cargo run -- --input=some_text.txt
Also note that, by default, cargo run
builds and runs the program in
debug mode, which is slower. This isn't a problem for tiny files, but
if you plan to detect a repetitions in, say, a novel, using
cpu-extensive features (such as fuzzy string matching, see below), you might want
to run with --release
:
$ cargo run --release -- --input=big_file.html --output=output_big_file.html
Examples
Here is an example of Caribon using the HTML output of a (previous) version of this README, obtained with the following command:
$ caribon --language=english --input=README.html --output=example.html --fuzzy=0.5
(Note that --fuzzy=0.5
, while useful to show that fuzzy string
matching does indeed work, is not a very sensible parameter as is it
quite high (words only needs to be 50% similar to be considered the
same, matching e.g. just
and rust
). For real life usage, a lower value
would be recommended.)
Here is another example, displaying repetitions in
README.md
to the terminal, using the following command:
$ caribon --language=english --input=README.md --fuzzy=0.5 | more
While outputting to the terminal might be useful for small files, HTML outputs gives a more useful result, as higlighting a word will show you the other occurrences of it.
Options
Caribon provides a list of options. Here's the explanations to a few ones, from the most commons the the pretty advanced ones:
Language
--language=[english|french|spanish|...]
specifies the language of the input file. It is important for two reasons. The first one is that Caribon internally uses a stemming library, which will detect when words are derived from the same stem, e.g. "eats", "eat" and "eating" will be considered the same word. (More information on how this stemming library works can be found on the Snowball project website.) The second reason is that for some languages (currently only french and english), Caribon provides a default list of words to ignore for repetition counting (e.g. in english "it", "a" and so on are on it) to avoid cluttering the result file. It is possible to disable stemming by using "no_stemmer" instead of a language. This isn't really advised, but it might be useful if you want to try Caribon on a language that isn't implemented.--list-languages
prints the list of languages supported by the stemming library.
Input and output
--input=[file]
specifies the input file. By default it isstdin
, which means you'll have directly to type your text and end it withcontrol-D
. Iffile
is a non-existing file, the program aborts.--output=[file]
specifies the output file. It defaults tostdout
, printing the result to the terminal.
The input and output filenames extension determine the input and
output format, e.g. if you pass --input=text.html --output=result.html
, Caribon will
infer that the content is in HTML and that it must also output HTML
(so $ caribon < input.html > output.html
is NOT equivalent to $ caribon --input=input.html --output=output.html
: in the first case,
Caribon will consider the input as raw text and will output in
terminal
format (see below), while in the latter one it will
understand that both files are HTML).
It is possible to override this behaviour by specifying
--input-format=[text|html]
or--output-format=[terminal|html|markdown]
.
A note on the terminal
output format: it is designed to print text
to the terminal, by underlining and colouring some words with UNIX
terminal special characters (see screenshot above). It is, thus, only activated when no
output file name is given and Caribon prints on the standard output,
HTML output being the default for most of the cases.
Text statistics
--print-stats
, if passed to Caribon, will also display some statistics on the input text on the standard output.
Threshold and max-distance
The most useful algorithm of Caribon is local repetition detections. It detects when a word is repeated in a given interval of words. This interval is determined by
--max-distance=[value]
(default is currently 50).
So basically, if max-distance
is 50 and the word 'foo' occurs twice in this
interval, each occurrence will have a "repetition value" of 2. If
'foo' is repeated a third time in a 50-words interval after the second
occurence, then each of these occurrences will have a repetition
value of 3. (If there is then more than 50 words without apparition of
'foo', and 'foo' appears again, the value of the latest apparition
will be reset to 1).
Words are underlined when their "repetition value" is higher than a threshold, which can be set by:
--threshold=[value]
. The default is1.9
, so a word will be underlined as soon that is is repeated two times locally. If you change the threshold to, say,2.5
, a word will have to be repeated three times (locally) to be underlined.
(Why a float value for the threshold, instead of an integer one?
Because the local repetition detector will underline words in
different colors: green, orange and red according to the "severity" of
the repetitions. So setting the threshold to 1.01
or 1.99
will not
change which words are underlined, but they will be in orange or red
more quickly in the first case.)
Fuzzy string matching
Caribon uses a stemming library to detect words that are part of the same 'family'. It turns out that this algorithm is not always enough, and particularly it doesn't detect repetitions when there is a typo (e.g. "higlight" and "highlight" should probably be considered a repetition, even if it is mispelled in the first case). To solve this issue, there is the option of activating fuzzy string matching:
fuzzy=[value]
, where the value is a number between 0.0 and 1.0 which represents the maximal 'difference' between two words until they are no more identical: a value of 0.2 means that two words must be at most "20% different" until they are no more considered the same.
Internally, this algorithm uses the
Levenshtein distance
(and more specifically the
Rust implementation by Florian Ebelling)
which computes a distance between two strings by estimating the number
of insertions, deletions and permutations it require to go from one to
another. E.g., "dog" and "dogs" have a distance of 1, while "dog" and
"cat" have a distance of 3. This value is then divided by the length
of the string to match, and two string are considered "identical" (or,
at least, a repetition) when this value is less than the value given
to --fuzzy=
.
E.g., with --fuzzy=0.2
, "highlight" and "higlight" will have a
"difference" estimated to 1/9 (Levenshtein distance of 1, it only needs
one deletion to go from the first to the second, divided by the length
of "highlight", 9), so it will be a repetition. "Just" and "Rust" will
have a "difference" of 1/4, so won't be considered a repetition.
Fuzzy matching is practical, but you should not set it to a too high
value, else you will have a lot of false positives. Empirically, 0.2
or 0.25
is a good choice.
Fuzzy matching has a drawback: it requires a lot more of CPU. Caribon still manages to run reasonably fast (e.g., less than a second to detect repetitions on a whole novel, with fuzzy string matching activated) but it only uses fuzzy string matching for local repetitions, and not for global ones (see below).
Global repetitions
By default, Caribon only detects repetitions at a local level (if they
are separated by less than max-distance
words). It is,
however, possible to activate global repetition detecting with:
--global-threshold=[value]
, value being (again) a number between 0.0 and 1.0.
In this case, a word will be considered a repetition (even if it is
never repeated in a max-distance
range of words) if the relative
number of occurence is higher than the global threshold. I.e., if
global-threshold
is set to 0.01, a word will be highlighted (in
blue) if it represents more than 1% of the total number of words in
the document.
Ignored words
Some words, like "a" or "the", are unavoidably repeated a
lot and it doesn't make much sense to consider them a repetition. It
is thus useful to ignore some words. Caribon
provides a
default list for english and french, but it is in all cases possible
to provide your own with:
--ignore="list of common words"
.
This list must be separated by either spaces or commas (or, actually, anything that isn't a letter), and must be encircled by quotes. This list replaces the default one provided by Caribon (for english and french, at least). If you want to add words to these list instead of replace it, use:
--add-ignored="list of more ignored words"
Another option for ignoring words is:
--ignore-proper=[true|false]
(default is to false)
If sets to true, Caribon will try to ignore proper nouns". That is, a word will not count for repetition if it starts with a capital letter and is not at the beginning of a sentence.
Library
It is possible to use Caribon as a Rust library. The documentation is
available here; in order to
be certain to have the documentation version corresponding to the code
you downloaded, you can also generate it with
cargo doc
.
Caribon library is also available on Crates.io, allowing you to easily use it in any Cargo project: just add
caribon = "0.7"
in the dependencies section of your
Cargo.toml
file.
Caribon-server
If you are not a big fan of command-line interface, you can have a look at Caribon-server that runs Caribon as a web service (using Iron framework). See here for an instance running it.
Current features
- Built-in list of ignored words (common words whose repetitions don't matter) for french and english, though they are not complete.
- Stemming support for languages supported by the Snowball (http://snowball.tartarus.org/) project.
- Additionally (because stemming algorithms aren't always perfect, and sometimes you make typos), support for fuzzy string matching (based on Levenhstein distance).
- Count repetitions locally and globally.
- Detects HTML tags in input. Normally works both for HTML fragments or full HTML pages.
- Outputs the detected repetitions either in an HTML file (the most useful option), directly to the terminal, or to a Markdown file (with less useful information).
ChangeLog
License
Caribon is licensed under the GNU Lesser General Public License, version 2.1 or (at your convenience) any ulterior version.
Credits
Caribon is written by Élisabeth Henry <liz.henry at ouvaton.org>
.
This software uses (rust bindings to) the C Stemming library written by Dr Martin Porter, licensed under the BSD License.
It also uses the Rust implementation of Levenshtein distance written by Florian Ebelling, licensed under the Apache 2.0 License.
ToDo
Library
- Make colour highlighting more configurable
- Complete builtin lists of ignored words and provide them for other languages (currently, only french, and english);
- Provide algorithm to detect repetitions of expressions, not just single words;
- Make library callable from C (and other languages than Rust);
- Enhance documentation and add tests.
Program
- Add options to select highligting colours
- Find better default values?
- Make different repositories for program and library?
- Add a variant with GUI (Gtk+?)?
Dependencies
~640KB
~17K SLoC