9 releases
0.2.0 | Jul 29, 2024 |
---|---|
0.1.8 | Apr 30, 2024 |
#1142 in Data structures
648 downloads per month
1.5MB
3K
SLoC
Poppy is a Rust crate offering an efficient implementation of Bloom filters. It also includes a command-line utility (also called poppy) allowing users to effortlessly create filters with their desired capacity and false positive probability. Values can be added to the filters via standard input, facilitating the use of this tool in a pipeline workflow.
Poppy ensures cross-compatibility with the bloom filter format used by DCSO bloom software but also provides its own Bloom filter implementation and format.
FAQ
Which format to choose ?
It depends what you want to achieve. If you want to be cross compatible with DCSO tools and library, you must absolutely choose DCSO format. In any other scenario
we advice to use Poppy format (the default), as it is more robust, faster and provides room for customization. A comparison between the two formats and implementations can be found
in this blog post. By default, library and CLI chooses poppy format. If one wants
to select DCSO format when creating a filter from CLI, one has to use poppy create --version 1
.
How to build the project ?
Regular building
cargo build --release --bins
Building with MUSL (static binary)
# You can skip this step if you already have musl installed
rustup target add x86_64-unknown-linux-musl
# Build poppy with musl target
cargo build --release --target=x86_64-unknown-linux-musl --bins
How to use Poppy in other languages ?
In Python
Poppy comes with Python bindings, using the great PyO3 crate.
Please take a look at Poppy Bindings for further details.
Command Line Interface
Installation
In order to install poppy
command line utility, one has to run the following command: cargo install poppy-filters
An alternative installation is by cloning this repository and compile from source using cargo
.
Usage
Usage: poppy [OPTIONS] <COMMAND>
Commands:
create Create a new bloom filter
insert Insert data into an existing bloom filter
check Checks entries against an existing bloom filter
bench Benchmark the bloom filter. If the bloom filter behaves in an unexpected way, the benchmark fails. Input data is read from stdin
show Show information about an existing bloom filter
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose Verbose output
-j, --jobs <JOBS> The number of jobs to use when parallelization is possible. For write operations the original filter is copied into the memory of each job so you can expect the memory of the whole process to be N times the size of the filter [default: 2]
-h, --help Print help
Every command has its own arguments and help information. For example to get create
command help run: poppy create help
.
Examples
Creating an empty Bloom filter
# creating a filter with a desired capacity `-c` and false positive probability `-p`
poppy create -c 1000 -p 0.001 /path/to/output/filter.pop
# showing information about the filter we just created
poppy show /path/to/output/filter.pop
Inserting data into the filter
One can insert data in the filter in two ways, either by reading from stdin or by files.
Reading data from stdin cannot be parallelized, so if one wants to insert a lot of data in the
filter and speed up insertion, one has to insert from files (and setting the number of CPUs to use
with -j
option).
# insertion from stdin
cat data-1.txt data-2.txt | poppy insert filter.pop
# we verify number of element in the filter
poppy show filter.pop
# insertion from files
poppy insert filter.pop data-1.txt data-2.txt
# we verify number of element in the filter
poppy show filter.pop
# insertion from several files in parallel
poppy -j 0 insert filter.pop data-1.txt data-2.txt
Creating and Inserting in one command
One can easily create filter directly from a bunch of data. In this case the filter capacity will be set to the number of entries in the dataset.
# this creates a new filter saved in filter.pop with all entries (one per line)
# found in .txt files under the dataset directory using available CPUs (-j 0)
poppy -j 0 create -p 0.001 /path/to/output/filter.pop /path/to/dataset/*.txt
Checking if some data is in the filter
Check operation comes in the same variant as insertion, either from stdin or from files (when one need to take advantage of parallelization). By default, when an entry is inside the filter it is going to be printed out to stdout.
# check from stdin
cat data-1.txt data-2.txt | poppy check filter.pop
# check from files
poppy check filter.pop data-1.txt data-2.txt
# check from several files in parallel
poppy -j 0 check filter.pop data-1.txt data-2.txt
Benchmarking filter
Benchmarking a filter is an important step as it allow you to make sure that what you get is what you expected, in terms of false positive probability. The benchmark needs to take data already inserted in the filter, it will then randomly mutate entries and check them against the filter.
# run a benchmark against data known to be in the filter
cat data-1.txt data-2.txt | poppy bench filter.pop
Funding
The NGSOTI project is dedicated to training the next generation of Security Operation Center (SOC) operators, focusing on the human aspect of cybersecurity. It underscores the significance of providing SOC operators with the necessary skills and open-source tools to address challenges such as detection engineering, incident response, and threat intelligence analysis. Involving key partners such as CIRCL, Restena, Tenzir, and the University of Luxembourg, the project aims to establish a real operational infrastructure for practical training. This initiative integrates academic curricula with industry insights, offering hands-on experience in cyber ranges.
NGSOTI is co-funded under Digital Europe Programme (DEP) via the ECCC (European cybersecurity competence network and competence centre).
Dependencies
~8MB
~147K SLoC