#search-engine #search #search-query #index #full-text-search #query #text-search

probly-search

A lightweight full-text search engine with a fully customizable scoring function

11 releases (7 stable)

2.0.0 Feb 8, 2024
2.0.0-alpha-2 Jan 5, 2023
2.0.0-alpha-1 Aug 12, 2022
1.2.4 Aug 7, 2021
0.1.1 Jun 6, 2021

#88 in Text processing

Download history 164/week @ 2024-01-01 265/week @ 2024-01-08 362/week @ 2024-01-15 201/week @ 2024-01-22 203/week @ 2024-01-29 220/week @ 2024-02-05 160/week @ 2024-02-12 156/week @ 2024-02-19 227/week @ 2024-02-26 163/week @ 2024-03-04 135/week @ 2024-03-11 194/week @ 2024-03-18 87/week @ 2024-03-25 163/week @ 2024-04-01 55/week @ 2024-04-08 113/week @ 2024-04-15

419 downloads per month
Used in 2 crates (via sos-sdk)

MIT license

71KB
1.5K SLoC

probly-search · GitHub license Coverage Status Latest Version PRs Welcome

A full-text search library, written in Rust, optimized for insertion speed, that provides full control over the scoring calculations.

This start initially as a port of the Node library NDX.

Demo

Recipe (title) search with 50k documents.

https://quantleaf.github.io/probly-search-demo/

Features

  • Three ways to do scoring

    • BM25 ranking function to rank matching documents. The same ranking function that is used by default in Lucene >= 6.0.0.
    • zero-to-one, a library unique scoring function that provides a normalized score that is bounded by 0 and 1. Perfect for matching titles/labels with queries.
    • Ability to fully customize your own scoring function by implenting the ScoreCalculator trait.
  • Trie based dynamic Inverted Index.

  • Multiple fields full-text indexing and searching.

  • Per-field score boosting.

  • Configurable tokenizer.

  • Free text queries with query expansion.

  • Fast allocation, but latent deletion.

  • WASM compatible

Documentation

Adding, Removing and Searching documents

See Integration tests.

Use this library with WASM

See recipe search demo project

A basic example

Creating an index with a document that has 2 fields. Query documents, and remove a document.

use std::collections::HashSet;
use probly_search::{
    index::Index,
    query::{
        score::default::{bm25, zero_to_one},
        QueryResult,
    },
};

// A white space tokenizer
fn tokenizer(s: &str) -> Vec<Cow<str>> {
     s.split(' ').map(Cow::from).collect::<Vec<_>>()
}

// We have to provide extraction functions for the fields we want to index

// Title
fn title_extract(d: &Doc) -> Vec<&str> {
    vec![d.title.as_str()]
}

// Description
fn description_extract(d: &Doc) -> Vec<&str> {
    vec![d.description.as_str()]
}

// Create index with 2 fields
let mut index = Index::<usize>::new(2);

// Create docs from a custom Doc struct
let doc_1 = Doc {
    id: 0,
    title: "abc".to_string(),
    description: "dfg".to_string(),
};

let doc_2 = Doc {
    id: 1,
    title: "dfgh".to_string(),
    description: "abcd".to_string(),
};

// Add documents to index
index.add_document(
    &[title_extract, description_extract],
    tokenizer,
    doc_1.id,
    &doc_1,
);

index.add_document(
    &[title_extract, description_extract],
    tokenizer,
    doc_2.id,
    &doc_2,
);

// Search, expected 2 results
let mut result = index.query(
    &"abc",
    &mut bm25::new(),
    tokenizer,
    &[1., 1.],
);
assert_eq!(result.len(), 2);
assert_eq!(
    result[0],
    QueryResult {
        key: 0,
        score: 0.6931471805599453
    }
);
assert_eq!(
    result[1],
    QueryResult {
        key: 1,
        score: 0.28104699650060755
    }
);

// Remove documents from index
index.remove_document(doc_1.id);

// Vacuum to remove completely
index.vacuum();

// Search, expect 1 result
result = index.query(
    &"abc",
    &mut bm25::new(),
    tokenizer,
    &[1., 1.],
);
assert_eq!(result.len(), 1);
assert_eq!(
    result[0],
    QueryResult {
        key: 1,
        score: 0.1166450426074421
    }
);

Go through source tests in for the BM25 implementation and zero-to-one implementation for more query examples.

Testing

Run all tests with

cargo test

Benchmark

Run all benchmarks with

cargo bench

License

MIT

Dependencies

~4MB
~71K SLoC