#bloom-filter #collection #probabilistic #approximate #big-data #count-min-sketch

probabilistic-collections

Various implementations of collections that use approximations to improve on running time or memory, but introduce a certain amount of error

7 releases (breaking)

0.7.0 May 11, 2020
0.6.0 Apr 24, 2020
0.5.0 Nov 4, 2018
0.4.0 Sep 9, 2018
0.1.0 Sep 6, 2018

#800 in Data structures

Download history 1059/week @ 2024-06-17 785/week @ 2024-06-24 783/week @ 2024-07-01 566/week @ 2024-07-08 979/week @ 2024-07-15 587/week @ 2024-07-22 945/week @ 2024-07-29 566/week @ 2024-08-05 832/week @ 2024-08-12 1212/week @ 2024-08-19 1032/week @ 2024-08-26 1299/week @ 2024-09-02 1019/week @ 2024-09-09 925/week @ 2024-09-16 1241/week @ 2024-09-23 716/week @ 2024-09-30

3,985 downloads per month
Used in 8 crates (5 directly)

MIT/Apache

270KB
4.5K SLoC

probabilistic-collections-rs

Documentation License: MIT License: Apache 2.0 Pipeline Status Coverage Report

probabilistic-collections contains various implementations of collections that use approximations to improve on running time or memory, but introduce a certain amount of error. The error can be controlled under a certain threshold which makes these data structures extremely useful for big data and streaming applications.

The following types of collections are implemented:

  • Approximate Membership in Set: BloomFilter, PartitionedBloomFilter, CuckooFilter, QuotientFilter
  • Scalable Approximate Membership in Set: ScalableBloomFilter, ScalableCuckooFilter
  • Approximate Membership in Stream: BSBloomFilter, BSSDBloomFilter, RLBSBloomFilter
  • Approximate Item Count: CountMinSketch
  • Approximate Distinct Item Count: HyperLogLog
  • Set similarity: MinHash, SimHash

Usage

Add this to your Cargo.toml:

[dependencies]
probabilistic-collections = "*"

For serde support, include the serde feature:

[dependencies]
probabilistic-collections = { version = "*", features = ["serde"] }

Add this to your crate root if you are using Rust 2015:

extern crate probabilistic_collections;

Caveats

If you are using this crate to create collections to be used across different platforms, you must be careful not to use keys of type [T]. The Rust standard library implementation of Hash for [T] first hashes the length of the slice, then the contents of the slice. The length is platform specific because it is of type usize. Therefore, a collection with keys of type [T] will have unexpected results when used across platforms. For example, if you were to generate and serilize a BloomFilter on i686 compiled code where the keys are of type [u8], the filter will not have the correct results when deserialized on x86_64 compiled code.

The recommended work-around to this problem is to a define a wrapper struct around a [T] with a Hash implementation that only hashes the contents of the slice.

Changelog

See CHANGELOG for more details.

References

License

probabilistic-collections-rs is dual-licensed under the terms of either the MIT License or the Apache License (Version 2.0).

See LICENSE-APACHE and LICENSE-MIT for more details.

Dependencies

~1.8–2.6MB
~49K SLoC