3 releases

0.0.3 Sep 28, 2024
0.0.2 Sep 28, 2024
0.0.1 Feb 4, 2024

#283 in Algorithms

Download history 1061/week @ 2024-09-08 749/week @ 2024-09-15 1069/week @ 2024-09-22 424/week @ 2024-09-29 955/week @ 2024-10-06 801/week @ 2024-10-13 1000/week @ 2024-10-20 1358/week @ 2024-10-27 890/week @ 2024-11-03 1173/week @ 2024-11-10 1089/week @ 2024-11-17 1402/week @ 2024-11-24 770/week @ 2024-12-01 857/week @ 2024-12-08 979/week @ 2024-12-15 868/week @ 2024-12-22

3,529 downloads per month

Apache-2.0

19KB
403 lines

simple_hll   Build Status Latest Version

simple_hll is a simple HyperLogLog implementation in rust. It is designed to be simple to use and less bytes to store (with Sparse HyperLogLog).

Quick Start

use simple_hll::HyperLogLog;

let mut hll = HyperLogLog::<14>::new();
hll.add_object("hello");
hll.add_object("world");
hll.add_object("simple_hll");

println!("cardinality: {}", hll.count());

Serde

simple_hll supports serde and borsh with feature serde_borsh enabled, so you can serialize and deserialize the HyperLogLog instance.

   let val = serde_json::to_vec(hll)?;

Notice that in order to reduce the serialized size, we introduce a sparse intermediate struct for the HyperLogLog instance. When the non-zero registers are less than a threshold, we will use the sparse mode to serialize the HyperLogLog instance.

enum HyperLogLogVariant<const P: usize> {
    Empty,
    Sparse { data: Vec<(u16, u8)> },
    Full(Vec<u8>),
}

None-Fixed type

Different from other hyperloglog implementation, we don't use fixed type HyperLogLog<T> for the HyperLogLog instance, but we use a const generic parameter to specify the precision. The precision P is the number of bits to use for the register index. The number of registers is 2^P. The precision P is a trade-off between the accuracy and the memory usage. The default precision is 14, which means the memory usage is about 16KB.

The reason is that in databend or other dbms, we will store the HyperLogLog inside the metadata. We don't want to use HyperLogLog<Datum> for simplicity and less overhead to hash the enum.

Contributing

Check out the CONTRIBUTING.md guide for more details on getting started with contributing to this project.

Acknowledgements

Some codes and tests are borrowed and inspired from:

Reference papers:

Thanks for the great work of the authors and contributors.

License

Licensed under Apache License, Version 2.0.

Dependencies

~0.7–1.3MB
~19K SLoC