### 2 unstable releases

0.2.0 | Jul 10, 2019 |
---|---|

0.1.1 | Apr 24, 2019 |

**MIT**license

35KB

208 lines

# golomb-set

A Golomb Coded Set implementation in Rust

A Golomb Coded Set is a probabilistic data structure which is typically smaller than a bloom filter, at the cost of query performance.. A sorted list of differences between samples of a random distribution will roughly have a geometric distribution, which makes for an optimal prefix for Golomb-Rice coding. Since a good hash algorithm will be randomly distributed, this encoding makes for efficient storage of hashed values.

Giovanni Bajo's blog post as well as their Python and C++ implementations were a huge help when writing and testing this library, found here and here respectively.

## Usage and Behaviour

There are 3 main parameters to select when creating a Golomb Coded Set: the hash algorithm,

and `N`

. `P`

is the desired maximum number of elements that will be inserted into the set, and `N`

is the desired probability of a false positive when the set is full. If fewer items have been inserted the real probability will be significantly lower.`1` `/` `2` `^` P

The chosen hashing algorithm must have a uniform distribution (which is not the same as being cryptograpically secure) and the output length of the hash in bits must be greater than

bits. This is not currently enforced by the library and failing to do so could result in far more false positives than expected. Beyond meeting those requirements, selecting an algorithm for speed would be appropriate. If the hardware acceleration is present, CRC32 would be a good choice for up to a million elements and a false positive probability of 0.001%. For larger sets and/or lower probabilities a hashing algorithm with a longer output is needed.`log2``(`N `*` `2` `^` P`)`

## Example

`use` `{``golomb_set``::`UnpackedGcs`,` `md5``::`Md5`}``;`
`//` Create a GCS where when 3 items are stored in the set, the
`//` probability of a false positive will be `1/(2^5)`, or 3.1%
`let` `mut` gcs `=` `UnpackedGcs``::``<`Md5`>``::`new`(``3``,` `5``)``;`
`//` Insert the MD5 hashes of "alpha" and "bravo"
gcs`.``insert``(``b``"`alpha`"``)``;`
gcs`.``insert``(``b``"`bravo`"``)``;`
`assert!``(`gcs`.``contains``(``b``"`alpha`"``)``)``;`
`assert!``(`gcs`.``contains``(``b``"`bravo`"``)``)``;`
`assert!``(``!`gcs`.``contains``(``b``"`charlie`"``)``)``;`
`//` Reduces memory footprint in exchange for slower querying
`let` gcs `=` gcs`.``pack``(``)``;`
`assert!``(`gcs`.``contains``(``b``"`alpha`"``)``)``;`
`assert!``(`gcs`.``contains``(``b``"`bravo`"``)``)``;`
`assert!``(``!`gcs`.``contains``(``b``"`charlie`"``)``)``;`

#### Dependencies

~3.5MB

~76K SLoC