3 stable releases
1.1.1 | Nov 2, 2019 |
---|---|
1.1.0 | Oct 28, 2019 |
1.0.0 | Oct 22, 2019 |
0.1.1 |
|
0.1.0 |
|
#1559 in Text processing
51KB
982 lines
ed_join
This is an implementation of Ed-join Algorithm proposed by Xiao et al. (2008) in Rust programming language. There're two major deviation from the original paper:
- In this implementation I used parallel iteratiors instead of matching strings with a single thread.
- I created an inverted index before the actual matching, while the paper generates an inverted index on-the-fly.
On a system with Intel i7-8700k and 32 GB ram, the program performed a self-match on a file with 100,000 lines in 35 mins, with tau = 5
.
Note that, this algorithm excels at matching long records, and the aforementioned testfile has rather short lines.
There's a faster algorithm is called PassJoin, and I'll implement it in the future. It adaptively use different heuristics for short and long strings respectively. It's performance on the same aforementioned testfile is more than 30 times faster. And therefore, this crate serves as an example implementation of Ed-join Algorithm, rather than a solution for real world scenarios.
Nevertheless, when the input file is not too large, or when the threshold tau
is not too large, the performance of this algorithm is still reasonably fast. For example, for the 100,000 testfile, when tau = 5
, there are more than 878,000 matched pairs of records. But when tau = 2
, there are only 136,000 matched paris and it only takes about 3 minutes to finish the matching.
Installation
To add this crate as a dependency, add it into your Cargo.toml
or execute cargo add ed_join
.
This crate also comes with an binary ed-join
, which could be installed with cargo install ed_join --features cli
.
Reference
- Xiao, Chuan, Wei Wang, and Xuemin Lin. "Ed-join: an efficient algorithm for similarity joins with edit distance constraints." Proceedings of the VLDB Endowment 1.1 (2008): 933-944.
License: Apache-2.0 AND MIT
Dependencies
~4–14MB
~194K SLoC