3 stable releases
1.0.2  May 24, 2023 

1.0.1  May 19, 2023 
0.1.6 

#80 in Text processing
3,469 downloads per month
130KB
2.5K
SLoC
textdistance.rs
[ github.com ] [ docs.rs ] [ crates.io ]
Rust library with lots of different algorithms to compare how similar two sequences are.
Features:
 πͺ Based on popular and battletested textdistance Python library (and written by the same author).
 π Contains 20+ algorithms for all purposes.
 π¬ Includes stateoftheart algorithms like
EntropyNCD
andSift4
.  πͺΆ Zerodependency.
 π¨ Works with any iterators, including bytes, code points, Unicode grapheme clusters, words, and numbers.
 β€οΈ Friendly and consistent API for all algorithms.
 π Optional normalization of the result on the 0.01.0 interval.
 π‘ No unsafe code.
 π¦ Pure Rust.
Available algorithms
Editbased:
DamerauLevenshtein
, both optimal string alignment and restricted.Hamming
Jaro
JaroWinkler
Levenshtein
Sift4Common
Sift4Simple
SmithWaterman
Tokenbased:
Bag
Cosine
(aka Orchini, Tucker, OtsukaβOchiai)EntropyNCD
(Entropybased Normalized Compression Distance)Jaccard
(aka Tanimoto, Critical Success Index)Overlap
(aka SzymkiewiczβSimpson)Roberts
SorensenDice
(aka F1, Czekanowski, Zijdenbos)Tversky
Sequencebased:
LCSSeq
(Longest Common SubSequence)LCSStr
(Longest Common SubString)RatcliffObershelp
(aka Gestalt pattern matching)
Naive:
Prefix
Suffix
Length
Normalization for other metrics:
LIG3
normalization forHamming
byLevenshtein
MLIPNS
normalization forHamming
YujianBo
normalization forLevenshtein
Installation
cargo add textdistance
Usage
The textdistance::str
module provides shortcut functions for each algorithm for calculating the distance/similarity between two strings:
use textdistance::str::damerau_levenshtein;
assert!(damerau_levenshtein("abc", "acbd") == 2);
The textdistance::nstr
module is the same but all algorithms return a normalized value (between 0.0 and 1.0):
use textdistance::nstr::damerau_levenshtein;
assert!(damerau_levenshtein("abc", "acbd") == 2./4.);
For more advanced usage, each algorithm is provided as a struct implementing the Algorithm
trait:
use textdistance::{Algorithm, DamerauLevenshtein};
let a = DamerauLevenshtein::default();
let r = a.for_str("abc", "acbd");
assert!(r.val() == 2);
assert!(r.nval() == 2./4.);
 The
Algorithm
trait providesfor_str
,for_vec
, andfor_iter
to calculate the result for two strings, vectors (slices), or iterators respectively. In addition, there arefor_words
andfor_bigrams
methods that split the text into words or bigrams respectively before calculating the distance.  Each method returns a
textdistance::Result
that provides methods to get absolute (val
) or normalized (nval
) value of the metric, distance (dist
andndist
), or similarity (sim
andnsim
).
Unicode support
The for_str
method (and so all functions in the str
and nstr
modules) uses String.chars
to split the string and then runs it through the for_iter
method. So, Γ©
will be considered two distinct characters ("latin small letter e" and "combining acute accent"). Usually, that's ok and this is how Python works. You can read more in the official Rust documentation.
If you want Γ©
to be considered as a single symbol, use the unicodesegmentation crate:
use textdistance::{Algorithm, DamerauLevenshtein};
use unicode_segmentation::UnicodeSegmentation;
let s1 = "aΜΓ©ΓΆΜ²\r\n";
let s2 = "Γ©aΜΓΆΜ²\r\n";
let g1 = s1.graphemes(true);
let g2 = s2.graphemes(true);
let a = DamerauLevenshtein::default();
let r = a.for_iter(g1, g2);
assert!(r.val() == 1);
Choosing the algorithm
The algorithm to use depends on your use case. First, you need to decide on the algorithm category:
 Editbased algorithms work well on short sequences for detecting typos and minor changes.
 Tokenbased algorithms work well on longer sequences for comparing long texts with noticeable modifications.
 Sequencebased algorithms work well for calculating diff size between the original and the changed version of the sequence.
If you go with editbased, the next thing is to decide what kind of changes you need to detect:
 βοΈ Substitution. One character is replaced by another.
 β Addition. A new character is added.
 π Deletion. A character is removed.
 π Transposition. Two sequential characters are swapped.
alg  sub  add  del  trans 

Hamming 
β  β  β  β 
Jaro 
β  β  β  β 
JaroWinkler 
β  β  β  β 
Sift4 
β  β  β  β 
Levenshtein 
β  β  β  β 
DamerauLevenshtein 
β  β  β  β 
Hamming
is the fastest one but detects only substitutions.Sift4
is very fast but not as wellknown and battletested as other algorithms.Jaro
is slower thanSift4
but wellknown and battletested.JaroWinkler
is likeJaro
but gives more weight to strings with a matching prefix.Levenshtein
detects everything but transpositions and faster thanDamerauLevenshtein
(but slower than other algorithms).DamerauLevenshtein
ticks all the boxes but very slow.
There are some use cases:
DamerauLevenshtein
with some optimizations is used in cargo to correct typos in command names.Jaro
is included in the Elixir standard library (String.jaro_distance). It is used by the compiler and by mix (cargo for Elixir) to provide the "did you mean?" functionality for typos in module or command names.RatcliffObershelp
variation is included in the Python standard library (difflib.SequenceMatcher).
Benchmarks
Legend:
 π is very slow (> 5 ms)
 π’ is slow (> 1 ms)
 π is fast (> 500 Β΅s)
 π is very fast (< 500 Β΅s)
Editbased (and their normalizations):
algorithm  time 

hamming  π 19.203 Β΅s 
mlipns  π 20.625 Β΅s 
sift4_simple  π 143.69 Β΅s 
sift4_common  π 238.86 Β΅s 
jaro  π’ 1.7148 ms 
jaro_winkler  π’ 1.7174 ms 
levenshtein  π’ 4.5999 ms 
yujian_bo  π’ 4.6044 ms 
lig3  π 6.5563 ms 
smith_waterman  π 9.5146 ms 
damerau_levenshtein_restricted  π 10.301 ms 
damerau_levenshtein  π 41.938 ms 
Tokenbased:
algorithm  time 

cosine  π 508.59 Β΅s 
sorensen_dice  π 510.75 Β΅s 
tversky  π 512.41 Β΅s 
overlap  π 513.76 Β΅s 
bag  π 523.06 Β΅s 
jaccard  π 580.79 Β΅s 
roberts  π 714.79 Β΅s 
entropy_ncd  π 731.68 Β΅s 
Sequencebased:
algorithm  time 

lcsstr  π’ 3.2658 ms 
lcsseq  π 7.4349 ms 
ratcliff_obershelp  π 36.308 ms 
Naive:
algorithm  time 

length  π 2.5300 Β΅s 
prefix  π 22.473 Β΅s 
suffix  π 38.821 Β΅s 
The benchmarks are powered by criterion and live in the benches directory. They are quite simple: grab 10 opensource licenses, take a 200 chars prefix from each, and crosscompare these prefixes. The numbers might be very different for a different kind of input, length of the input, when comparing words rather than characters, or running the benchmarks on a different machine. The goal of these benchmarks is to provide basic guidance rather than give a definitive answer. If performance is critical for your application, I encourage you to make your benchmarks on the real data you have.
Versioning
We stick to SemVer:
 The patch number is for bug fixes. The results of an algorithm may change in some corner cases if we found that the previous implementation doesn't match the algorithm described in the original paper.
 The minor number is for new algorithms and features.
 The major number is for big changes in the API. We try to avoid breaking stuff but we prefer to provide a friendly and convenient API over keeping a backward compatibility.
Limitations
 In the original textdisance, most of the algorithms are adjusted to work on any number of the input sequences. However, Rust doesn't support variadic arguments, so all algorithms currently are implemented only for exactly two inputs.
 All algorithms in the crate implement the same
Algorithm
trait. Hence metrics that have additional limitations on the input sequence elements beyondEq
(like Editex and MRA that work only with ASCII letters) currently cannot be implemented.  Most of the implemented algorithms have certain properties (like commutative property) that make their behavior more like what you would expect and make normalization simple. So, I haven't implemented yet NeedlemanWunsch and Gotoh, mostly because they are tricky to normalize and I'm still not 100% sure that I did it correctly in the original textdistance.
Acknowledgments
There are the libraries that I used as a reference implementation and the source of test cases:
 π Python: textdistance, abydos, jellyfish.
 βοΈ JS: talisman.
 π¦ Rust: strsim, distance, levenshteinrs.
Specials thanks to Trevor Gross for transferring to me the ownership of the textdistance name on crates.io.
Testing locally
To run everything locally, all you need is Rust, Python, and task. Execute task all
to run all code formatters, linters, and tests.
Thank you β€οΈ