#search #query #index #bm25

probly-search

A lightweight and thread-safe, full-text search engine with a fully customizable scoring function

5 releases (3 stable)

1.1.1 Jun 11, 2021
1.1.0 Jun 10, 2021
1.0.0 Jun 9, 2021
0.1.1 Jun 6, 2021
0.1.0 Jun 6, 2021

#144 in Text processing

22 downloads per month

MIT license

64KB
1.5K SLoC

probly-search · GitHub license Coverage Status Latest Version PRs Welcome

A lightweight and thread-safe, full-text search library that provides full control over the scoring calculations.

This start initially as a port of the Node library NDX.

Features

  • Three ways to do scoring

    • BM25 ranking function to rank matching documents. The same ranking function that is used by default in Lucene >= 6.0.0.
    • zero-to-one, a library unique scoring function that provides a normalized score that is bounded by 0 and 1. Perfect for matching titles/labels with queries.
    • Ability to fully customize your own scoring function by implenting the ScoreCalculator trait.
  • Trie based dynamic Inverted Index.

  • Small memory footprint, optimized for mobile devices.

  • Multiple fields full-text indexing and searching.

  • Per-field score boosting.

  • Configurable tokenizer and term filter.

  • Free text queries with query expansion.

Documentation

Documentation is under development. For now read the source tests.

Example

Creating an index with a document that has 2 fields. Query documents, and remove a document.

use std::collections::HashSet;
use probly_search::{
    index::{add_document_to_index, create_index, remove_document_from_index, Index},
    query::{
        query,
        score::default::{bm25, zero_to_one},
        QueryResult,
    },
};


// Create index with two fields
let mut idx: Index<usize> = create_index(2);

// Create docs from a custom Doc struct
struct Doc {
    id: usize,
    title: String,
    description: String,
}

let doc_1 = Doc {
    id: 0,
    title: "abc".to_string(),
    description: "dfg".to_string(),
};

let doc_2 = Doc {
    id: 1,
    title: "dfgh".to_string(),
    description: "abcd".to_string(),
};

// Add documents to index 
fn tokenizer(s: &str) -> Vec<String> {
    s.split(' ')
        .map(|slice| slice.to_owned())
        .collect::<Vec<String>>()
}
fn title_extract(d: &Doc) -> Option<&str> {
    Some(d.title.as_str())
}

fn description_extract(d: &Doc) -> Option<&str> {
    Some(d.description.as_str())
}

fn filter(s: &String) -> String {
    s.to_owned()
}

add_document_to_index(
    &mut idx,
    &[title_extract, description_extract],
    tokenizer,
    filter,
    doc_1.id,
    doc_1.clone(),
);

add_document_to_index(
    &mut idx,
    &[title_extract, description_extract],
    tokenizer,
    filter,
    doc_2.id,
    doc_2,
);

// Search, expect 2 results
let mut result = query(
    &mut idx,
    &"abc",
    &mut bm25::new(),
    tokenizer,
    filter,
    &[1., 1.],
    None,
);
assert_eq!(result.len(), 2);
assert_eq!(
    result[0],
    QueryResult {
        key: 0,
        score: 0.6931471805599453
    }
);
assert_eq!(
    result[1],
    QueryResult {
        key: 1,
        score: 0.28104699650060755
    }
);

Go through source tests in for the BM25 implementation and zero-to-one implementation for more query examples.

License

MIT

No runtime deps