#ai-agent #object-store #hnsw #database #thread

anda_db_tfs

A full-text search library using the BM25 ranking algorithm in Rust

4 releases (2 breaking)

Uses new Rust 2024

new 0.4.0 Apr 24, 2025
0.2.1 Apr 6, 2025
0.2.0 Apr 5, 2025
0.1.0 Apr 1, 2025

#15 in Accessibility

Download history 44/week @ 2025-03-26 286/week @ 2025-04-02 37/week @ 2025-04-09 14/week @ 2025-04-16

381 downloads per month
Used in anda_db

MIT license

80KB
1.5K SLoC

Anda-DB BM25 Full-Text Search Library

Crates.io Documentation License Build Status

anda_db_tfs is a full-text search library implementing the BM25 ranking algorithm in Rust. BM25 (Best Matching 25) is a ranking function used by search engines to estimate the relevance of segments to a given search query. It's an extension of the TF-IDF model.

Features

  • High Performance: Optimized for speed with parallel processing using Rayon.
  • Customizable Tokenization: Support for various tokenizers including Chinese text via jieba.
  • BM25 Ranking: Industry-standard relevance scoring algorithm.
  • Serialization: Save and load indices in CBOR format with optional compression.
  • Incremental Persistent: Support incremental index updates persistent (insertions and deletions)
  • Thread-safe concurrent access: Safely use the index from multiple threads

Installation

Add this to your Cargo.toml:

[dependencies]
anda_db_tfs = "0.4"

For full features including tantivy tokenizers and jieba support:

[dependencies]
anda_db_tfs = { version = "0.4", features = ["full"] }

Quick Start

use anda_db_tfs::{BM25Index, SimpleTokenizer};
use tokio_util::compat::{TokioAsyncReadCompatExt, TokioAsyncWriteCompatExt};

// Create a new index with a simple tokenizer
let index = BM25Index::new("my_bm25_index".to_string(), SimpleTokenizer::default(), None);

// Add segments to the index
index.insert(1, "The quick brown fox jumps over the lazy dog", now_ms).unwrap();
index.insert(2, "A fast brown fox runs past the lazy dog", now_ms).unwrap();
index.insert(3, "The lazy dog sleeps all day", now_ms).unwrap();

// Search for segments containing "fox"
let results = index.search("fox", 10);
for (seg_id, score) in results {
    println!("Segment {}: score {}", seg_id, score);
}

// Remove a segment
index.remove(3, "The lazy dog sleeps all day", now_ms);

// Store the index
{
    let metadata = std::fs::File::create("tfs_demo/metadata.cbor")?;
    index
        .flush(
            metadata,
            0,
            async |id, data| {
                let mut node = std::fs::File::create(format!("tfs_demo/seg_{id}.cbor"))?;
                node.write_all(data)?;
                Ok(true)
            },
            async |id, data| {
                let mut node =
                    std::fs::File::create(format!("tfs_demo/posting_{id}.cbor"))?;
                node.write_all(data)?;
                Ok(true)
            },
        )
        .await?;
}

// Load the index from a file
let metadata = std::fs::File::open("debug/hnsw_demo/metadata.cbor")?;
let loaded_index = BM25Index::load_all(
    jieba_tokenizer(),
    metadata,
    async |id| {
        let mut node = std::fs::File::open(format!("tfs_demo/seg_{id}.cbor"))?;
        let mut buf = Vec::new();
        node.read_to_end(&mut buf)?;
        Ok(Some(buf))
    },
    async |id| {
        let mut node = std::fs::File::open(format!("tfs_demo/posting_{id}.cbor"))?;
        let mut buf = Vec::new();
        node.read_to_end(&mut buf)?;
        Ok(Some(buf))
    },
)
.await?;
println!("Loaded index with {} documents", loaded_index.len());

Chinese Text Support

With the tantivy-jieba feature enabled, you can use the jieba tokenizer for Chinese text:

use anda_db_tfs::{BM25Index, jieba_tokenizer};

// Create an index with jieba tokenizer
let index = BM25Index::new("my_bm25_index".to_string(), jieba_tokenizer(), None);

// Add segments with Chinese text
index.insert(1, "Rust 是一种系统编程语言", now_ms).unwrap();
index.insert(2, "Rust 快速且内存高效,安全、并发、实用", now_ms).unwrap();

// Search for segments
let results = index.search("安全", 10);

Advanced Usage

Custom Tokenizer and BM25 Parameters

use anda_db_tfs::{BM25Index, BM25Config};
use tantivy::tokenizer::{LowerCaser, RemoveLongFilter, SimpleTokenizer, Stemmer};

// Create an index with custom BM25 parameters
let params = BM25Config { k1: 1.5, b: 0.75 };
let index_name = "my_custom_index".to_string();
let tokenizer = TokenizerChain::builder(SimpleTokenizer::default())
  .filter(RemoveLongFilter::limit(32))
  .filter(LowerCaser)
  .filter(Stemmer::default())
  .build();
let index = BM25Index::new(index_name, tokenizer, Some(params));

API Documentation

👉 https://docs.rs/anda_db_tfs

BM25Config

Parameters for the BM25 ranking algorithm.

pub struct BM25Config {
    // Controls term frequency saturation
    pub k1: f32,
    // Controls segment length normalization
    pub b: f32,
}

Default values: k1 = 1.2, b = 0.75

Error Handling

The library uses a custom error type BM25Error for various error conditions:

  • BM25Error::Generic: Index-related errors.
  • BM25Error::Serialization: CBOR serialization/deserialization errors.
  • BM25Error::NotFound: Error when a token is not found.
  • BM25Error::AlreadyExists: When trying to add a segment with an ID that already exists.
  • BM25Error::TokenizeFailed: When tokenization produces no tokens for a segment.

Performance Considerations

  • For large segments, the library automatically uses parallel processing for tokenization.
  • The search function uses parallel processing for query terms.
  • For best performance with large indices, consider using SSD storage for serialized indices.
  • Memory usage scales with the number of segments and unique terms.

License

Copyright © 2025 LDC Labs.

ldclabs/anda-db is licensed under the MIT License. See LICENSE for the full license text.

Dependencies

~2–14MB
~147K SLoC