2 unstable releases
| 0.2.0 | Sep 3, 2025 |
|---|---|
| 0.1.0 | Dec 28, 2024 |
#899 in Algorithms
38KB
669 lines
DiskANN Implementation in Rust
A Rust implementation of DiskANN (Disk-based Approximate Nearest Neighbor search) using the Vamana graph algorithm. This project provides an efficient and scalable solution for large-scale vector similarity search with minimal memory footprint.
Overview
This implementation follows the DiskANN paper's approach by:
- Using the Vamana graph algorithm for index construction
- Memory-mapping the index file for efficient disk-based access
- Implementing beam search with medoid entry points
- Supporting both Euclidean distance and Cosine similarity
- Maintaining minimal memory footprint during search operations
Key Features
- Single-file storage: All index data stored in one memory-mapped file
- Vamana graph construction: Efficient graph building with α-pruning
- Memory-efficient search: Uses beam search that visits < 1% of vectors
- Distance metrics: Support for both Euclidean and Cosine similarity
- Medoid-based entry points: Smart starting points for search
- Parallel query processing: Using rayon for concurrent searches
- Minimal memory footprint: ~330MB RAM for 2GB index (16% of file size)
Usage
Building a New Index
use diskann_rs::{DiskANN, DistanceMetric};
// Your vectors to index
let vectors = vec![
vec![0.1, 0.2, 0.3, ...],
vec![0.4, 0.5, 0.6, ...],
// ...
];
let index = DiskANN::build_index(
&vectors,
32, // max neighbors per node (max_degree)
128, // beam width for construction
1.2, // alpha parameter for pruning
DistanceMetric::Euclidean,
"index.db"
)?;
Opening an Existing Index
let index = DiskANN::open_index("index.db")?;
Searching the Index
// Prepare your query vector
let query = vec![0.1, 0.2, ...; 128]; // must match index dimension
// Search for nearest neighbors
let k = 10; // number of neighbors to return
let beam_width = 64; // search beam width (higher = better recall, slower)
let neighbors = index.search(&query, k, beam_width);
// neighbors contains the IDs of the k nearest vectors
Parallel Search
use rayon::prelude::*;
use std::sync::Arc;
// Create shared index reference
let index = Arc::new(index);
// Perform parallel queries
let results: Vec<Vec<u32>> = query_batch
.par_iter()
.map(|query| index.search(query, k, beam_width))
.collect();
Algorithm Details
Vamana Graph Construction
- Initialize with random graph connectivity
- Iteratively improve edges using greedy search
- Apply α-pruning to maintain diversity in neighbors
- Ensure graph remains undirected
Beam Search Algorithm
- Start from medoid (vector closest to dataset centroid)
- Maintain beam of promising candidates
- Explore neighbors of best candidates
- Terminate when no improvement found
- Return top-k results
Memory Management
- Vectors: Memory-mapped, loaded on-demand during distance calculations
- Graph structure: Adjacency lists stored contiguously in file
- Search memory: Only beam_width vectors in memory at once
- Typical usage: 10-100MB RAM for billion-scale indices
Performance Characteristics
- Index Build Time: O(n * max_degree * beam_width)
- Search Time: O(beam_width * log n) - typically visits < 1% of dataset
- Memory Usage: O(beam_width) during search
- Disk Space: n * (dimension * 4 + max_degree * 4) bytes
- Query Throughput: Scales linearly with CPU cores
Parameters Tuning
Build Parameters
max_degree: 32-64 for most datasetsbuild_beam_width: 128-256 for good graph qualityalpha: 1.2-2.0 (higher = more diverse neighbors)
Search Parameters
beam_width: 32-128 (trade-off between speed and recall)- Higher beam_width = better recall but slower search
Building and Testing
# Build the library
cargo build --release
# Run tests
cargo test
# Run demo
cargo run --release --example demo
# Run performance test
cargo run --release --example perf_test
Examples
See the examples/ directory for:
demo.rs: Demo with 100k vectorsperf_test.rs: Performance benchmarking with 1M vectors
Current Implementation
✅ Completed Features:
- Single-file storage format with memory mapping
- Vamana graph construction with α-pruning
- Beam search with medoid entry points
- Multiple distance metrics (Euclidean, Cosine)
- Parallel query processing
- Comprehensive test suite
- Memory-efficient design (< 100MB for large indices)
Future Improvements
- Support for incremental index updates
- Additional distance metrics (Manhattan, Hamming)
- Compressed vector storage
- Distributed index support
- GPU acceleration for distance calculations
- Auto-tuning of parameters
Contributing
Contributions are welcome! Please feel free to:
- Open issues for bugs or feature requests
- Submit PRs for improvements
- Share performance benchmarks
- Suggest optimizations
License
This project is licensed under the MIT License - see the LICENSE file for details.
References
- DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node
- Microsoft DiskANN Repository
Acknowledgments
This implementation is based on the DiskANN paper and the official Microsoft implementation, adapted for Rust with focus on simplicity and memory efficiency.
Dependencies
~2.2–3MB
~63K SLoC