2 releases

0.1.1	Sep 26, 2022
0.1.0	Sep 25, 2022

#5 in #lsh

594 downloads per month

MIT/Apache

73KB
1.5K SLoC

find-simdoc

Time- and memory-efficient all pairs similarity searches in documents. The detailed description can be found on the project page.

API documentation

https://docs.rs/find-simdoc

`lib.rs`:

Time- and memory-efficient all pairs similarity searches in documents. A more detailed description can be found on the project page.

Problem definition

Input
- List of documents
- Distance function
- Radius threshold
Output
- All pairs of similar document ids

Features

Easy to use

This software supports all essential steps of document similarity search, from feature extraction to output of similar pairs. Therefore, you can immediately try the fast all pairs similarity search using your document files.

Flexible tokenization

You can specify any delimiter when splitting words in tokenization for feature extraction. This can be useful in languages where multiple definitions of words exist, such as Japanese or Chinese.

Time and memory efficiency

The time and memory complexities are linear over the numbers of input documents and output results on the basis of the ideas behind the locality sensitive hashing (LSH) and sketch sorting approach.

Tunable search performance

LSH allows tuning of performance in accuracy, time, and memory, through a manual parameter specifying search dimensions. You can flexibly perform searches depending on your dataset and machine environment.

Specifying lower dimensions allows for faster and rougher searches with less memory usage.
Specifying higher dimensions allows for more accurate searches with more memory usage.

Search steps

Extract features from documents
- Set representation of character or word ngrams
- Tfidf-weighted vector representation of character or word ngrams
Convert the features into binary sketches through locality sensitive hashing
- 1-bit minwise hashing for the Jaccard similarity
- Simplified simhash for the Cosine similarity
Search for similar sketches in the Hamming space using a modified variant of the sketch sorting approach

Dependencies

~3.5MB
~60K SLoC