12 releases (stable)
2.0.1 | Jul 3, 2024 |
---|---|
2.0.0 | Feb 8, 2024 |
2.0.0-alpha-2 | Jan 5, 2023 |
2.0.0-alpha-1 | Aug 12, 2022 |
0.1.1 | Jun 6, 2021 |
#153 in Text processing
686 downloads per month
Used in 7 crates
(via sos-sdk)
71KB
1.5K
SLoC
probly-search ·
A full-text search library, written in Rust, optimized for insertion speed, that provides full control over the scoring calculations.
This start initially as a port of the Node library NDX.
Demo
Recipe (title) search with 50k documents.
https://quantleaf.github.io/probly-search-demo/
Features
-
Three ways to do scoring
- BM25 ranking function to rank matching documents. The same ranking function that is used by default in Lucene >= 6.0.0.
- zero-to-one, a library unique scoring function that provides a normalized score that is bounded by 0 and 1. Perfect for matching titles/labels with queries.
- Ability to fully customize your own scoring function by implenting the
ScoreCalculator
trait.
-
Trie based dynamic Inverted Index.
-
Multiple fields full-text indexing and searching.
-
Per-field score boosting.
-
Configurable tokenizer.
-
Free text queries with query expansion.
-
Fast allocation, but latent deletion.
-
WASM compatible
Documentation
Adding, Removing and Searching documents
See Integration tests.
Use this library with WASM
See recipe search demo project
A basic example
Creating an index with a document that has 2 fields. Query documents, and remove a document.
use std::collections::HashSet;
use probly_search::{
index::Index,
query::{
score::default::{bm25, zero_to_one},
QueryResult,
},
};
// A white space tokenizer
fn tokenizer(s: &str) -> Vec<Cow<str>> {
s.split(' ').map(Cow::from).collect::<Vec<_>>()
}
// We have to provide extraction functions for the fields we want to index
// Title
fn title_extract(d: &Doc) -> Vec<&str> {
vec![d.title.as_str()]
}
// Description
fn description_extract(d: &Doc) -> Vec<&str> {
vec![d.description.as_str()]
}
// Create index with 2 fields
let mut index = Index::<usize>::new(2);
// Create docs from a custom Doc struct
let doc_1 = Doc {
id: 0,
title: "abc".to_string(),
description: "dfg".to_string(),
};
let doc_2 = Doc {
id: 1,
title: "dfgh".to_string(),
description: "abcd".to_string(),
};
// Add documents to index
index.add_document(
&[title_extract, description_extract],
tokenizer,
doc_1.id,
&doc_1,
);
index.add_document(
&[title_extract, description_extract],
tokenizer,
doc_2.id,
&doc_2,
);
// Search, expected 2 results
let mut result = index.query(
&"abc",
&mut bm25::new(),
tokenizer,
&[1., 1.],
);
assert_eq!(result.len(), 2);
assert_eq!(
result[0],
QueryResult {
key: 0,
score: 0.6931471805599453
}
);
assert_eq!(
result[1],
QueryResult {
key: 1,
score: 0.28104699650060755
}
);
// Remove documents from index
index.remove_document(doc_1.id);
// Vacuum to remove completely
index.vacuum();
// Search, expect 1 result
result = index.query(
&"abc",
&mut bm25::new(),
tokenizer,
&[1., 1.],
);
assert_eq!(result.len(), 1);
assert_eq!(
result[0],
QueryResult {
key: 1,
score: 0.1166450426074421
}
);
Go through source tests in for the BM25 implementation and zero-to-one implementation for more query examples.
Testing
Run all tests with
cargo test
Benchmark
Run all benchmarks with
cargo bench
License
Dependencies
~2MB
~30K SLoC