#single-file #index #vector #search-query #parallel #similarity #processing

diskann-rs

A Rust implementation of DiskANN (Disk-based Approximate Nearest Neighbor search) featuring a 3-layer index architecture and parallel query processing. This project provides an efficient and scalable solution for large-scale vector similarity search with single-file storage.

1 unstable release

0.1.0 Dec 28, 2024

#211 in Concurrency

Download history 103/week @ 2024-12-24 2/week @ 2024-12-31

105 downloads per month

MIT license

39KB
724 lines

DiskANN Implementation in Rust

Rust

A Rust implementation of DiskANN (Disk-based Approximate Nearest Neighbor search) featuring a 3-layer index architecture and parallel query processing. This project provides an efficient and scalable solution for large-scale vector similarity search with single-file storage.

Overview

This implementation provides a memory-efficient approach to similarity search by:

  • Using a 3-layer hierarchical index structure for faster search
  • Storing all data in a single file using memory mapping
  • Supporting both Euclidean distance and Cosine similarity
  • Managing adjacency lists for graph-based search
  • Implementing parallel query processing
  • Supporting large-scale datasets that don't fit in RAM

Features

  • Three-layer hierarchical index structure:
    • Top layer (L0): Smallest, most selective layer
    • Middle layer (L1): Intermediate connectivity
    • Base layer (L2): Complete dataset
  • Single-file storage format for simplified deployment
  • Choice of distance metrics:
    • Euclidean distance
    • Cosine similarity
  • Cluster-based graph construction for meaningful adjacency
  • Parallel query processing using rayon
  • Memory-mapped file access for handling large datasets
  • Comprehensive error handling with custom error types

Usage

Building a New Index

use diskannrs::{SingleFileDiskANN, DistanceMetric};

let index = SingleFileDiskANN::build_index_singlefile(
    1_000_000,        // number of vectors
    128,              // dimension
    32,               // max neighbors per node
    0.01,             // fraction of vectors in top layer
    0.1,              // fraction of vectors in middle layer
    DistanceMetric::Euclidean,  // or DistanceMetric::Cosine
    "index.db"        // single file to store everything
)?;

Opening an Existing Index

let index = SingleFileDiskANN::open_index_singlefile("index.db")?;

Searching the Index

// Prepare your query vector
let query = vec![0.1, 0.2, ...; 128];  // must match index dimension

// Search for nearest neighbors
let k = 10;  // number of neighbors to return
let beam_width = 64;  // search beam width
let neighbors = index.search(&query, k, beam_width);
use rayon::prelude::*;
use std::sync::Arc;

// Create shared index reference
let index = Arc::new(index);

// Perform parallel queries
let results: Vec<Vec<u32>> = query_batch
    .par_iter()
    .map(|query| index.search(query, k, beam_width))
    .collect();

Performance Characteristics

  • Memory Usage: O(1) for vector storage due to memory mapping
  • Disk Space: Single file containing:
    • Vectors: num_vectors * dimension * 4 bytes
    • Adjacency Lists: Varies by layer size and max_degree
    • Metadata: Small overhead
  • Search Time: Logarithmic due to hierarchical structure
  • Parallel Processing: Scales with available CPU cores

Building and Testing

# Build the library
cargo build --release

# Run tests
cargo test

# Run with example
cargo run --release --example simple_search

Current Status

This implementation features:

  • Single-file storage format
  • 3-layer hierarchical index structure
  • Multiple distance metrics support
  • Cluster-based graph construction
  • Parallel query processing
  • Memory-mapped I/O
  • Comprehensive test suite

Future Improvements

  1. Add more distance metrics
  2. Implement dynamic index updates
  3. Add parameter auto-tuning
  4. Expand benchmarking suite
  5. Add more examples
  6. Improve documentation

Contributing

Contributions are welcome! Please feel free to:

  • Open issues for bugs or feature requests
  • Submit PRs for improvements
  • Share ideas for optimization

License

This project is licensed under the MIT License - see the LICENSE file for details.

References

Dependencies

~2.2–3MB
~62K SLoC