12 releases (stable)

2.0.1	Jul 3, 2024
2.0.0	Feb 8, 2024
2.0.0-alpha-2	Jan 5, 2023
2.0.0-alpha-1	Aug 12, 2022
0.1.1	Jun 6, 2021

#249 in Text processing

653 downloads per month
Used in 9 crates (via sos-search)

MIT license

71KB
1.5K SLoC

probly-search ·

A full-text search library, written in Rust, optimized for insertion speed, that provides full control over the scoring calculations.

This start initially as a port of the Node library NDX.

Demo

Recipe (title) search with 50k documents.

https://quantleaf.github.io/probly-search-demo/

Features

Three ways to do scoring
- BM25 ranking function to rank matching documents. The same ranking function that is used by default in Lucene >= 6.0.0.
- zero-to-one, a library unique scoring function that provides a normalized score that is bounded by 0 and 1. Perfect for matching titles/labels with queries.
- Ability to fully customize your own scoring function by implenting the ScoreCalculator trait.
Trie based dynamic Inverted Index.
Multiple fields full-text indexing and searching.
Per-field score boosting.
Configurable tokenizer.
Free text queries with query expansion.
Fast allocation, but latent deletion.
WASM compatible

Documentation

Adding, Removing and Searching documents

See Integration tests.

Use this library with WASM

See recipe search demo project

A basic example

Creating an index with a document that has 2 fields. Query documents, and remove a document.

use std::collections::HashSet;
use probly_search::{
    index::Index,
    query::{
        score::default::{bm25, zero_to_one},
        QueryResult,
    },
};

// A white space tokenizer
fn tokenizer(s: &str) -> Vec<Cow<str>> {
     s.split(' ').map(Cow::from).collect::<Vec<_>>()
}

// We have to provide extraction functions for the fields we want to index

// Title
fn title_extract(d: &Doc) -> Vec<&str> {
    vec![d.title.as_str()]
}

// Description
fn description_extract(d: &Doc) -> Vec<&str> {
    vec![d.description.as_str()]
}

// Create index with 2 fields
let mut index = Index::<usize>::new(2);

// Create docs from a custom Doc struct
let doc_1 = Doc {
    id: 0,
    title: "abc".to_string(),
    description: "dfg".to_string(),
};

let doc_2 = Doc {
    id: 1,
    title: "dfgh".to_string(),
    description: "abcd".to_string(),
};

// Add documents to index
index.add_document(
    &[title_extract, description_extract],
    tokenizer,
    doc_1.id,
    &doc_1,
);

index.add_document(
    &[title_extract, description_extract],
    tokenizer,
    doc_2.id,
    &doc_2,
);

// Search, expected 2 results
let mut result = index.query(
    &"abc",
    &mut bm25::new(),
    tokenizer,
    &[1., 1.],
);
assert_eq!(result.len(), 2);
assert_eq!(
    result[0],
    QueryResult {
        key: 0,
        score: 0.6931471805599453
    }
);
assert_eq!(
    result[1],
    QueryResult {
        key: 1,
        score: 0.28104699650060755
    }
);

// Remove documents from index
index.remove_document(doc_1.id);

// Vacuum to remove completely
index.vacuum();

// Search, expect 1 result
result = index.query(
    &"abc",
    &mut bm25::new(),
    tokenizer,
    &[1., 1.],
);
assert_eq!(result.len(), 1);
assert_eq!(
    result[0],
    QueryResult {
        key: 1,
        score: 0.1166450426074421
    }
);

Go through source tests in for the BM25 implementation and zero-to-one implementation for more query examples.

Testing

Run all tests with

cargo test

Benchmark

Run all benchmarks with

cargo bench

License

MIT

Dependencies

~2MB
~30K SLoC