4 releases
| 0.4.0 | Jan 30, 2026 |
|---|---|
| 0.3.3 | Jan 2, 2026 |
| 0.3.2 | Jan 1, 2026 |
| 0.3.1 | Jan 1, 2026 |
#640 in Text processing
497 downloads per month
245KB
4.5K
SLoC
NanoFTS
A high-performance full-text search engine with Rust core, featuring efficient indexing and searching capabilities for both English and Chinese text.
Features
- High Performance: Rust-powered core with sub-millisecond search latency
- LSM-Tree Architecture: Scalable to billions of documents
- Incremental Updates: Real-time document add/update/delete
- Fuzzy Search: Intelligent fuzzy matching with configurable thresholds
- Full CRUD: Complete document management operations
- Result Handle: Zero-copy result with set operations (AND/OR/NOT)
- NumPy Support: Direct numpy array output
- Multilingual: Support for both English and Chinese text
- Persistence: Disk-based storage with WAL recovery
- LRU Cache: Built-in caching for frequently accessed terms
- Data Import: Import from pandas, polars, arrow, parquet, CSV, JSON
Installation
pip install nanofts
Quick Start
from nanofts import create_engine
# Create a search engine
engine = create_engine(
index_file="./index.nfts",
track_doc_terms=True, # Enable update/delete operations
)
# Add documents (field values must be strings)
engine.add_document(1, {"title": "Python教程", "content": "学习Python编程"})
engine.add_document(2, {"title": "数据分析", "content": "使用pandas进行数据处理"})
engine.flush()
# Search - returns ResultHandle object
result = engine.search("Python")
print(f"Found {result.total_hits} documents")
print(f"Document IDs: {result.to_list()}")
# Update document
engine.update_document(1, {"title": "高级Python教程", "content": "深入学习Python"})
# Delete document
engine.remove_document(2)
# Compact to persist deletions
engine.compact()
Rust Usage (Rust Core)
The Rust crate name is nanofts (minimum Rust version: rustc >= 1.75). If you are building a Rust service, you can use it directly as a pure Rust full-text search library.
Add as a dependency
Add this to your project Cargo.toml:
[dependencies]
nanofts = "0.4.0"
Optional features:
mimalloc: enabled by default; lower latency / more stable allocation performancepython: enable PyO3/Numpy bindings (only needed if you build the Python extension)simd: enable SIMD acceleration (requires nightly andpacked_simd_2)
Minimal example: in-memory indexing and searching
use nanofts::{UnifiedEngine, EngineConfig, EngineResult};
use std::collections::HashMap;
fn main() -> EngineResult<()> {
// 1) Create an in-memory engine
let engine = UnifiedEngine::new(EngineConfig::memory_only())?;
// 2) Add a document (field values must be String)
let mut fields = HashMap::new();
fields.insert("title".to_string(), "Rust Tutorial".to_string());
fields.insert("content".to_string(), "Build a high-performance full-text search engine in Rust".to_string());
engine.add_document(1, fields)?;
// 3) Search
let result = engine.search("Rust")?;
println!("hits={}, ids={:?}", result.total_hits(), result.to_list());
Ok(())
}
Persistence: single-file index + WAL recovery
use nanofts::{UnifiedEngine, EngineConfig, EngineResult};
fn main() -> EngineResult<()> {
let config = EngineConfig::persistent("./index.nfts")
.with_lazy_load(true)
.with_cache_size(10_000);
let engine = UnifiedEngine::new(config)?;
// ... add/update/remove ...
// Flush new documents to disk
engine.flush()?;
// Deletions become permanent only after compaction
engine.compact()?;
Ok(())
}
Run the built-in Rust example in this repo
cargo run --example basic_usage --release
Performance Tuning (Rust Developer Perspective)
Build and runtime knobs
- Use release builds:
cargo build --release/cargo run --release(this repo already configureslto=fat,codegen-units=1,panic=abort,strip=truefor release). - Optimize for your CPU (optional): set
RUSTFLAGS="-C target-cpu=native"when building/running on a specific machine. - SIMD (optional): if you enable
--features simd, use nightly and validate the benefit for your workload.
Fastest ingestion formats and APIs
- Prefer batch ingestion: it reduces per-document overhead and lets the engine use its optimized parallel paths.
- Fastest Rust API:
UnifiedEngine::add_documents_texts(doc_ids, texts)is the fastest ingestion path when you can pre-concatenate all searchable fields into a singleStringper document. - Columnar ingestion:
UnifiedEngine::add_documents_columnar(doc_ids, columns)avoids constructing aHashMapper document and is a good fit for Arrow/DataFrame-style input. - Arrow zero-copy ingestion: if your data is already in Arrow (or can be represented as borrowed
&strslices), useUnifiedEngine::add_documents_arrow_str(doc_ids, columns)(multi-column) orUnifiedEngine::add_documents_arrow_texts(doc_ids, texts)(single merged text column) to avoidStringallocation/copy. - Batch HashMap ingestion:
UnifiedEngine::add_documents(docs)is still much faster than callingadd_documentin a loop.
Arrow Zero-Copy API Examples
Multi-column zero-copy ingestion
use nanofts::{UnifiedEngine, EngineConfig};
let engine = UnifiedEngine::new(EngineConfig::memory_only())?;
// Simulate Arrow StringArray data (in real use, extract from Arrow)
let doc_ids = vec![1, 2, 3];
let titles = vec!["Title 1", "Title 2", "Title 3"];
let contents = vec!["Content 1", "Content 2", "Content 3"];
// Zero-copy columnar ingestion
let columns = vec![
("title".to_string(), titles),
("content".to_string(), contents),
];
engine.add_documents_arrow_str(&doc_ids, columns)?;
Single-column zero-copy ingestion (fastest for Arrow)
// Pre-merged text from Arrow (single column)
let doc_ids = vec![1, 2, 3];
let merged_texts = vec![
"Title 1 Content 1",
"Title 2 Content 2",
"Title 3 Content 3",
];
// Zero-copy single column ingestion
engine.add_documents_arrow_texts(&doc_ids, &merged_texts)?;
Real Arrow StringArray integration
// Example with real Arrow StringArray
use arrow_array::StringArray;
let title_array = StringArray::from(vec!["Title 1", "Title 2", "Title 3"]);
let content_array = StringArray::from(vec!["Content 1", "Content 2", "Content 3"]);
// Extract zero-copy string slices from Arrow
let title_slices: Vec<&str> = title_array.iter()
.map(|s| s.unwrap_or(""))
.collect();
let content_slices: Vec<&str> = content_array.iter()
.map(|s| s.unwrap_or(""))
.collect();
let columns = vec![
("title".to_string(), title_slices),
("content".to_string(), content_slices),
];
engine.add_documents_arrow_str(&doc_ids, columns)?;
Flush/compact strategy
flush()frequency: flushing periodically bounds WAL/memory usage, but flushing too often may increase IO amplification.- Deletion persistence: deletes/updates are logical until
compact().- If you delete a lot, compact in bigger batches rather than after every small delete wave.
- Track doc terms only when you need updates/deletes: enable it only if you need update/delete support (Python:
track_doc_terms=True). It adds extra bookkeeping on ingestion.
Large indexes and memory footprint
- Use
lazy_loadwhen the index is large and you don't want to map everything into memory:with_lazy_load(true)/ Pythonlazy_load=True. - Tune
cache_size: inlazy_loadmode, cache hit rate is a major driver for latency. Iterate usingengine.stats()(e.g., cache hit rate).
Query-side optimization
- Use boolean/batch APIs and set operations: prefer
search_and/search_ororResultHandle::{intersect, union, difference}to avoid repeated work. - Fuzzy search is more expensive:
fuzzy_searchintroduces extra candidate generation and edit-distance checks. Use it only when needed and tune thresholds/distances.
Benchmarking and profiling
- Benchmarks: use
cargo bench(or your own fixed dataset) and compare A/B with realistic data scale, term distribution, and query sets. - CPU profiling: profile release binaries to find hot spots (tokenization, bitmap ops, IO, compression/decompression). On macOS, Instruments is usually the easiest.
- Measure first: use
engine.stats()to track search counts, cumulative time, and cache hit rate before tuning.
API Reference
Creating Engine
from nanofts import create_engine
engine = create_engine(
index_file="./index.nfts", # Index file path (empty string for memory-only)
max_chinese_length=4, # Max Chinese n-gram length
min_term_length=2, # Minimum term length to index
fuzzy_threshold=0.7, # Fuzzy search similarity threshold (0.0-1.0)
fuzzy_max_distance=2, # Maximum edit distance for fuzzy search
track_doc_terms=False, # Enable for update/delete support
drop_if_exists=False, # Drop existing index on creation
lazy_load=False, # Lazy load mode (memory efficient)
cache_size=10000, # LRU cache size for lazy load mode
)
Document Operations
# Add single document
engine.add_document(doc_id=1, fields={"title": "Hello", "content": "World"})
# Add multiple documents
docs = [
(1, {"title": "Doc 1", "content": "Content 1"}),
(2, {"title": "Doc 2", "content": "Content 2"}),
]
engine.add_documents(docs)
# Update document (requires track_doc_terms=True)
engine.update_document(1, {"title": "Updated", "content": "New content"})
# Delete single document
engine.remove_document(1)
# Delete multiple documents
engine.remove_documents([1, 2, 3])
# Flush buffer to disk
engine.flush()
# Compact index (applies deletions permanently)
engine.compact()
Search Operations
# Basic search - returns ResultHandle
result = engine.search("python programming")
# Get results
doc_ids = result.to_list() # List[int]
doc_ids = result.to_numpy() # numpy array
top_10 = result.top(10) # Top N results
page_2 = result.page(page=2, size=10) # Pagination
# Result properties
print(result.total_hits) # Total match count
print(result.is_empty) # Check if empty
print(1 in result) # Check if doc_id in results
# Fuzzy search (for typo tolerance)
result = engine.fuzzy_search("pythn", min_results=5)
print(result.fuzzy_used) # True if fuzzy matching was applied
# Batch search
results = engine.search_batch(["python", "rust", "java"])
# AND search (intersection)
result = engine.search_and(["python", "tutorial"])
# OR search (union)
result = engine.search_or(["python", "rust"])
# Filter by document IDs
result = engine.filter_by_ids([1, 2, 3, 4, 5])
# Exclude specific IDs
result = engine.exclude_ids([1, 2])
Result Set Operations
# Search for different terms
python_docs = engine.search("python")
rust_docs = engine.search("rust")
# Intersection (AND)
both = python_docs.intersect(rust_docs)
# Union (OR)
either = python_docs.union(rust_docs)
# Difference (NOT)
python_only = python_docs.difference(rust_docs)
# Chained operations
result = engine.search("python").intersect(
engine.search("tutorial")
).difference(
engine.search("beginner")
)
Statistics
stats = engine.stats()
print(stats)
# {
# 'term_count': 1234,
# 'search_count': 100,
# 'fuzzy_search_count': 10,
# 'total_search_ns': 1234567,
# ...
# }
Data Import
NanoFTS supports importing data from various sources:
from nanofts import create_engine
engine = create_engine("./index.nfts")
# Import from pandas DataFrame
import pandas as pd
df = pd.DataFrame({
'id': [1, 2, 3],
'title': ['Hello World', '全文搜索', 'Test Document'],
'content': ['This is a test', '支持多语言', 'Another test']
})
engine.from_pandas(df, id_column='id')
# Import from Polars DataFrame
import polars as pl
df = pl.DataFrame({
'id': [1, 2, 3],
'title': ['Doc 1', 'Doc 2', 'Doc 3']
})
engine.from_polars(df, id_column='id')
# Import from PyArrow Table
import pyarrow as pa
table = pa.Table.from_pydict({
'id': [1, 2, 3],
'title': ['Arrow 1', 'Arrow 2', 'Arrow 3']
})
engine.from_arrow(table, id_column='id')
# Import from Parquet file
engine.from_parquet("documents.parquet", id_column='id')
# Import from CSV file
engine.from_csv("documents.csv", id_column='id')
# Import from JSON file
engine.from_json("documents.json", id_column='id')
# Import from JSON Lines file
engine.from_json("documents.jsonl", id_column='id', lines=True)
# Import from Python dict list
data = [
{'id': 1, 'title': 'Hello', 'content': 'World'},
{'id': 2, 'title': 'Test', 'content': 'Document'}
]
engine.from_dict(data, id_column='id')
Specifying Text Columns
By default, all columns except the ID column are indexed. You can specify which columns to index:
# Only index 'title' and 'content' columns, ignore 'metadata'
engine.from_pandas(df, id_column='id', text_columns=['title', 'content'])
# Same for other import methods
engine.from_csv("data.csv", id_column='id', text_columns=['title', 'content'])
CSV and JSON Options
You can pass additional options to the underlying pandas readers:
# CSV with custom delimiter
engine.from_csv("data.csv", id_column='id', sep=';', encoding='utf-8')
# JSON Lines format
engine.from_json("data.jsonl", id_column='id', lines=True)
Chinese Text Support
NanoFTS handles Chinese text using n-gram tokenization:
engine = create_engine(
index_file="./chinese_index.nfts",
max_chinese_length=4, # Generate 2,3,4-gram for Chinese
)
engine.add_document(1, {"content": "全文搜索引擎"})
engine.flush()
# Search Chinese text
result = engine.search("搜索")
print(result.to_list()) # [1]
Persistence and Recovery
# Create persistent index
engine = create_engine(index_file="./data.nfts")
engine.add_document(1, {"title": "Test"})
engine.flush()
# Close and reopen
del engine
engine = create_engine(index_file="./data.nfts")
# Data is automatically recovered
result = engine.search("Test")
print(result.to_list()) # [1]
# Important: Use compact() to persist deletions
engine.remove_document(1)
engine.compact() # Deletions are now permanent
Memory-Only Mode
# Create in-memory engine (no persistence)
engine = create_engine(index_file="")
engine.add_document(1, {"content": "temporary data"})
# No flush needed for in-memory mode
result = engine.search("temporary")
Best Practices
For Production Use
- Always call
compact()after bulk deletions - Deletions are only persisted after compaction - Use
track_doc_terms=Trueif you need update/delete operations - Call
flush()periodically to persist new documents - Use
lazy_load=Truefor large indexes that don't fit in memory
Performance Tips
# Batch operations are faster
docs = [(i, {"content": f"doc {i}"}) for i in range(10000)]
engine.add_documents(docs) # Much faster than individual add_document calls
engine.flush()
# Use batch search for multiple queries
results = engine.search_batch(["query1", "query2", "query3"])
# Use result set operations instead of multiple searches
# Good:
result = engine.search_and(["python", "tutorial"])
# Instead of:
# result = engine.search("python").intersect(engine.search("tutorial"))
Migration from Old API
If you're upgrading from the old FullTextSearch API:
# Old API (deprecated)
# from nanofts import FullTextSearch
# fts = FullTextSearch(index_dir="./index")
# fts.add_document(1, {"title": "Test"})
# results = fts.search("Test") # Returns List[int]
# New API
from nanofts import create_engine
engine = create_engine(index_file="./index.nfts")
engine.add_document(1, {"title": "Test"})
result = engine.search("Test")
results = result.to_list() # Returns List[int]
Key differences:
FullTextSearch→create_engine()functionindex_dir→index_file(file path, not directory)- Search returns
ResultHandleinstead ofList[int] - Call
.to_list()to get document IDs - Use
compact()to persist deletions
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Dependencies
~10–17MB
~293K SLoC