1 unstable release
0.1.0 | Feb 25, 2020 |
---|
#2154 in Algorithms
543 downloads per month
76KB
1.5K
SLoC
str-distance
A crate to evaluate distances between strings (and others).
Heavily inspired by the julia StringDistances
Distance Metrics
-
Q-gram distances compare the set of all slices of length
q
in each str, whereq > 0
- QGram Distance
Qgram::new(usize)
- Cosine Distance
Cosine::new(usize)
- Jaccard Distance
Jaccard::new(usize)
- Sorensen-Dice Distance
SorensenDice::new(usize)
- Overlap Distance
Overlap::new(usize)
- QGram Distance
-
The crate includes distance "modifiers", that can be applied to any distance.
- Winkler diminishes the distance of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
- TokenSort adjusts for differences in word orders by reording words alphabetically.
- TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
Usage
The str_distance::str_distance*
convenience functions.
str_distance
and str_distance_normalized
take the two string inputs for which the distance is determined using the passed 'DistanceMetric.
str_distance_normalized` evaluates the normalized distance between two strings. A value of '0.0' corresponds to the "zero distance", both strings are considered equal by means of the metric, whereas a value of '1.0' corresponds to the maximum distance that can exist between the strings.
Calling the str_distance::str_distance*
is just convenience for DistanceMetric.str_distance*("", "")
Example
Levenshtein metrics offer the possibility to define a maximum distance at which the further calculation of the exact distance is aborted early.
Distance
use str_distance::*;
// calculate the exact distance
assert_eq!(str_distance("kitten", "sitting", Levenshtein::default()), DistanceValue::Exact(3));
// short circuit if distance exceeds 10
let s1 = "Wisdom is easily acquired when hiding under the bed with a saucepan on your head.";
let s2 = "The quick brown fox jumped over the angry dog.";
assert_eq!(str_distance(s1, s2, Levenshtein::with_max_distance(10)), DistanceValue::Exceeded(10));
Normalized Distance
use str_distance::*;
assert_eq!(str_distance_normalized("" , "", Levenshtein::default()), 0.0);
assert_eq!(str_distance_normalized("nacht", "nacht", Levenshtein::default()), 0.0);
assert_eq!(str_distance_normalized("abc", "def", Levenshtein::default()), 1.0);
The DistanceMetric
trait
use str_distance::{DistanceMetric, SorensenDice};
// QGram metrics require the length of the underlying fragment length to use for comparison.
// For `SorensenDice` default is 2.
assert_eq!(SorensenDice::new(2).str_distance("nacht", "night"), 0.75);
DistanceMetric
was designed for str
types, but is not limited to. Calculating distance is possible for all data types which are comparable and are passed as 'IntoIterator', e.g. as Vec
use str_distance::{DistanceMetric, Levenshtein, DistanceValue};
assert_eq!(*Levenshtein::default().distance(&[1,2,3], &[1,2,3,4,5,6]),3);
Documentation
Full docs available at docs.rs
References
- StringDistances
- The stringdist Package for Approximate String Matching Mark P.J. van der Loo
- fuzzywuzzy
License
Licensed under either of these:
- Apache License, Version 2.0, (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)