3 releases

Uses new Rust 2024

0.1.2 Feb 12, 2026
0.1.1 Feb 11, 2026
0.1.0 Feb 10, 2026

#239 in Cryptography

MIT license

135KB
2.5K SLoC

sketches

crates.io

Probabilistic data structures for scalable approximate analytics in Rust.

This crate gives you memory-efficient sketches for:

  • frequency estimation,
  • distinct counting,
  • set membership,
  • set similarity (Jaccard),
  • heavy hitter detection,
  • quantiles,
  • and stream sampling.

All sketches are designed for streaming workloads where exact data structures (HashMap, full sorted buffers, exact sets) are too expensive in memory or throughput.

Note

This crate was designed by humans, but coded with AI.

Add To A Project

This repository is currently consumed as a local crate:

[dependencies]
sketches = { path = "../sketches" }

What Is Included

Sketch Module Use it when Notes
Bloom Filter bloom_filter You need very fast membership checks and can tolerate false positives No deletions
Cuckoo Filter cuckoo_filter You need membership checks and deletions Can fail inserts at high load
HyperLogLog hyperloglog You need approximate distinct counts (COUNT(DISTINCT ...)) Mergeable, tiny memory footprint
MinMax Sketch minmax_sketch You need approximate non-negative frequency counts Conservative updates reduce overestimation
Count Sketch count_sketch You need approximate signed frequency updates Good for turnstile streams (+/- updates)
Space-Saving space_saving You need top-k / heavy hitters from a stream Tracks only bounded number of candidates
KLL Sketch kll You need general quantiles (median, p90, p99) Good default quantile sketch
t-digest tdigest You care most about tail quantiles (p95/p99/p999) Typically stronger tail behavior
MinHash minhash You need Jaccard similarity between sets Best default for similarity tasks
MinHash LSH lsh_minhash You need fast near-duplicate/candidate lookup before reranking Uses banding over MinHash signatures
Reservoir Sampling reservoir_sampling You need a uniform sample from an unbounded stream Fixed-size unbiased sample
Jaccard trait/helpers jacard You want a shared Jaccard API across sketches Provides JacardIndex trait

Which Sketch Should I Use?

If your primary goal is:

  • Distinct counting: use HyperLogLog.
  • Jaccard similarity: use MinHash first.
  • Candidate retrieval for similarity search: use MinHashLshIndex, then rerank with MinHash Jaccard.
  • Jaccard from existing cardinality pipelines: use HyperLogLog + jacard helpers.
  • Membership without delete: use BloomFilter.
  • Membership with delete: use CuckooFilter.
  • Approximate frequency (non-negative): use MinMaxSketch.
  • Approximate frequency (signed +/- updates): use CountSketch.
  • Heavy hitters / top-k: use SpaceSaving.
  • General quantiles: use KllSketch.
  • Tail-sensitive quantiles: use TDigest.
  • Keep a representative stream sample: use ReservoirSampling.

Quick Examples

Approximate distinct counting:

use sketches::hyperloglog::HyperLogLog;

let mut hll = HyperLogLog::new(12)?;
for i in 0_u64..100_000 {
    hll.add(&i);
}
println!("distinct ~ {}", hll.count());
# Ok::<(), Box<dyn std::error::Error>>(())

Approximate Jaccard similarity (recommended via MinHash):

use sketches::jacard::JacardIndex;
use sketches::minhash::MinHash;

let mut left = MinHash::new(256)?;
let mut right = MinHash::new(256)?;

for v in 0_u64..10_000 {
    left.add(&v);
}
for v in 5_000_u64..15_000 {
    right.add(&v);
}

println!("jaccard ~ {:.4}", left.jaccard_index(&right)?);
# Ok::<(), Box<dyn std::error::Error>>(())

Run Examples

cargo run --example bloom_filter
cargo run --example cuckoo_filter
cargo run --example hyperloglog
cargo run --example jacard
cargo run --example minhash
cargo run --example lsh_minhash
cargo run --example minmax_sketch
cargo run --example count_sketch
cargo run --example space_saving
cargo run --example kll
cargo run --example tdigest
cargo run --example reservoir_sampling

Validate

cargo test
cargo check --examples

No runtime deps