#similarity #search #lsh #all-pairs

find-simdoc

Time- and memory-efficient all pairs similarity searches in documents

2 releases

0.1.1 Sep 26, 2022
0.1.0 Sep 25, 2022

#1847 in Text processing

Download history 199/week @ 2024-07-25 175/week @ 2024-08-01 140/week @ 2024-08-08 102/week @ 2024-08-15 139/week @ 2024-08-22 166/week @ 2024-08-29 112/week @ 2024-09-05 74/week @ 2024-09-12 78/week @ 2024-09-19 196/week @ 2024-09-26 122/week @ 2024-10-03 129/week @ 2024-10-10 209/week @ 2024-10-17 295/week @ 2024-10-24 196/week @ 2024-10-31 249/week @ 2024-11-07

973 downloads per month

MIT/Apache

73KB
1.5K SLoC

find-simdoc

Time- and memory-efficient all pairs similarity searches in documents. The detailed description can be found on the project page.

API documentation

https://docs.rs/find-simdoc


lib.rs:

Time- and memory-efficient all pairs similarity searches in documents. A more detailed description can be found on the project page.

Problem definition

  • Input
    • List of documents
    • Distance function
    • Radius threshold
  • Output
    • All pairs of similar document ids

Features

Easy to use

This software supports all essential steps of document similarity search, from feature extraction to output of similar pairs. Therefore, you can immediately try the fast all pairs similarity search using your document files.

Flexible tokenization

You can specify any delimiter when splitting words in tokenization for feature extraction. This can be useful in languages where multiple definitions of words exist, such as Japanese or Chinese.

Time and memory efficiency

The time and memory complexities are linear over the numbers of input documents and output results on the basis of the ideas behind the locality sensitive hashing (LSH) and sketch sorting approach.

Tunable search performance

LSH allows tuning of performance in accuracy, time, and memory, through a manual parameter specifying search dimensions. You can flexibly perform searches depending on your dataset and machine environment.

  • Specifying lower dimensions allows for faster and rougher searches with less memory usage.
  • Specifying higher dimensions allows for more accurate searches with more memory usage.

Search steps

  1. Extract features from documents
    • Set representation of character or word ngrams
    • Tfidf-weighted vector representation of character or word ngrams
  2. Convert the features into binary sketches through locality sensitive hashing
  3. Search for similar sketches in the Hamming space using a modified variant of the sketch sorting approach

Dependencies

~3.5MB
~59K SLoC