#levenshtein #hamming #jaro #text

eddie

Fast and well-tested implementations of edit distance/string similarity metrics: Levenshtein, Damerau-Levenshtein, Hamming, Jaro, and Jaro-Winkler

13 unstable releases (3 breaking)

0.4.2 Jan 18, 2020
0.4.1 Dec 12, 2019
0.3.2 Dec 5, 2019
0.2.5 Nov 25, 2019
0.1.0 Nov 22, 2019

#7 in #hamming-distance

Download history 72/week @ 2023-11-03 80/week @ 2023-11-10 790/week @ 2023-11-17 640/week @ 2023-11-24 486/week @ 2023-12-01 603/week @ 2023-12-08 559/week @ 2023-12-15 444/week @ 2023-12-22 400/week @ 2023-12-29 671/week @ 2024-01-05 680/week @ 2024-01-12 805/week @ 2024-01-19 748/week @ 2024-01-26 702/week @ 2024-02-02 784/week @ 2024-02-09 836/week @ 2024-02-16

3,281 downloads per month
Used in 9 crates (7 directly)

MIT license

130KB
2.5K SLoC

Eddie

Fast and well-tested implementations of edit distance/string similarity metrics:

  • Levenshtein,
  • Damerau-Levenshtein,
  • Hamming,
  • Jaro,
  • Jaro-Winkler.

Documentation

See API reference.

Installation

Add this to your Cargo.toml:

[dependencies]
eddie = "0.4"

Basic usage

Levenshtein:

use eddie::Levenshtein;
let lev = Levenshtein::new();
let dist = lev.distance("martha", "marhta");
assert_eq!(dist, 2);

Damerau-Levenshtein:

use eddie::DamerauLevenshtein;
let damlev = DamerauLevenshtein::new();
let dist = damlev.distance("martha", "marhta");
assert_eq!(dist, 1);

Hamming:

use eddie::Hamming;
let hamming = Hamming::new();
let dist = hamming.distance("martha", "marhta");
assert_eq!(dist, Some(2));

Jaro:

use eddie::Jaro;
let jaro = Jaro::new();
let sim = jaro.similarity("martha", "marhta");
assert!((sim - 0.94).abs() < 0.01);

Jaro-Winkler:

use eddie::JaroWinkler;
let jarwin = JaroWinkler::new();
let sim = jarwin.similarity("martha", "marhta");
assert!((sim - 0.96).abs() < 0.01);

Strings vs slices

The crate exposes two modules containing two sets of implementations:

  • eddie::str for comparing UTF-8 encoded &str and &String values. Implementations are reexported in the root module.
  • eddie::slice for comparing generic slices &[T]. Implementations in this module are significantly faster than those from eddie::str, but will produce incorrect results for UTF-8 and other variable width character encodings.

Usage example:

use eddie::slice::Levenshtein;

let lev = Levenshtein::new();
let dist = lev.distance(&[1, 2, 3], &[1, 3]);
assert_eq!(dist, 1);

Complementary metrics

The main metric methods are complemented with inverted and/or relative versions. The naming convention across the crate is following:

  • distance — a number of edits required to transform one string to the other;
  • rel_dist — a distance between two strings, relative to string length (inversion of similarity);
  • similarity — similarity between two strings (inversion of relative distance).

Performance

At the moment Eddie has the fastest implementations among the alternatives from crates.io that have Unicode support.

For example, when comparing common english words you can expect at least 1.5-2x speedup for any given algorithm except Hamming.

For the detailed measurements tables see Benchmarks page.

No runtime deps