14 releases
0.2.2 | Feb 24, 2024 |
---|---|
0.2.1 | Dec 26, 2023 |
0.2.0 | Oct 26, 2023 |
0.1.10 | Jul 5, 2023 |
0.1.2 | Dec 24, 2022 |
#357 in Profiling
345KB
4.5K
SLoC
mphf_benchmark
is the program (by Piotr Beling) for benchmarking Minimal Perfect Hash Functions.
It can test the algorithms included in the following crates:
- ph,
- boomphf,
- cmph-sys (only if compiled with
cmph-sys
feature, and only CHD algorithm is supported).
Please run the program with the --help
switch to see the available options.
Below you can find instructions for installing mphf_benchmark
and
reproducing experiments performed with it,
which can be found in published or under review papers.
Note that these instructions have been tested under GNU/Linux and may require some modifications for other systems.
Installation
mphf_benchmark
can be compiled and installed from sources. To do this, a Rust compiler is needed.
The easiest way to obtain the compiler along with other necessary tools (like cargo
) is
to use rustup.
Please follow the instructions at https://www.rust-lang.org/tools/install.
Once Rust is installed, just execute the following to install mphf_benchmark
with native optimizations:
RUSTFLAGS="-C target-cpu=native" cargo install mphf_benchmark
Reproducing experiments from the papers
Fingerprinting-based minimal perfect hashing revisited
(Piotr Beling, Fingerprinting-based minimal perfect hashing revisited, ACM Journal of Experimental Algorithmics, 2023, DOI: https://doi.org/10.1145/3596453)
The results for FMPHGO with wide range of parameters and 100,000,000 64-bit integer keys generated uniformly at random, can be calculated by:
mphf_benchmark -d -s xs64 -n 100000000 fmphgo-all
The results for FMPH, FMPHGO and boomphf, for 39,459,925 64-bit integer keys generated uniformly at random, can be calculated by:
mphf_benchmark -d -s xs64 -n 39459925 -b 30 -l 30 fmph
mphf_benchmark -d -s xs64 -n 39459925 -b 30 -l 30 fmph -c 0
mphf_benchmark -d -s xs64 -n 39459925 -b 30 -l 30 fmphgo
mphf_benchmark -d -s xs64 -n 39459925 -b 30 -l 30 fmphgo -c 0
mphf_benchmark -d -s xs64 -n 39459925 -b 30 -l 30 boomphf
Subsequent parts/flags of the above calls have the following meaning:
-d
causes accurate results to be written to a file (alternatively, you may not use this flag when the information printed on the screen is enough),-s xs64
points to pseudo-random number generator XorShift64 as key source,-n
specify the number of keys,-b 30 -l 30
indicate that both build times and average (over keys) lookup (evaluation) times are averaged over 30 runs,-c 0
(given after the method name) disables hash caching in FMPH or FMPHGO.
We use similar calls for 500,000,000 keys, but with -n 500000000
and -b 5
.
Please run
mphf_benchmark --help
to see the options available for all methods or
mphf_benchmark help <method name>
(e.g., mphf_benchmark help fmph
) to see the method-specific options given after the method name.
To perform tests with a collection of .uk URLs collected in 2005 by UbiCrawler, first download the uk-2005-nat.urls.gz file from https://law.di.unimi.it/webdata/uk-2005/ and unpack it:
gzip -d uk-2005-nat.urls.gz
Then run (we use cat to print the unpacked uk-nat-2005.urls
file
to the standard output, which we then redirect to mphf_benchmark
):
cat uk-nat-2005.urls | mphf_benchmark -d -s stdin -n 39459925 -b 30 -l 30 fmph
cat uk-nat-2005.urls | mphf_benchmark -d -s stdin -n 39459925 -b 30 -l 30 fmph -c 0
cat uk-nat-2005.urls | mphf_benchmark -d -s stdin -n 39459925 -b 30 -l 30 fmphgo
cat uk-nat-2005.urls | mphf_benchmark -d -s stdin -n 39459925 -b 30 -l 30 fmphgo -c 0
cat uk-nat-2005.urls | mphf_benchmark -d -s stdin -n 39459925 -b 30 -l 30 boomphf
To measure memory consumption of construction processes, we use the memusage profiler and run methods with each set of parameters separately. For example, we run
memusage mphf_benchmark -s xs64 -n 39459925 -b 1 -l 0 -t single fmphgo -c 0 -s 1 -l 100
to measure memory consumption of single-threaded (-t single
) construction
of FMPHGO s=1 (-s 1
) b=8 (chosen according to s) $\gamma$=1 (-l 100
) without caching hashes (-c 0
).
To use cat
and memusage
simultaneously (which is non-trivial),
we create a bash script mphf_uk2005.sh
with the following content:
cat uk-2005-nat.urls | mphf_benchmark "$@"
And we run it like this, for example:
memusage mphf_uk2005.sh -s stdin -n 39459925 -b 1 -l 0 -t single fmphgo -c 0 -s 1 -l 100
To subtract the memory occupied by the keys and data unrelated to the construction process (which are actually negligible), we also measure the memory consumption for the execution of benchmark programs that terminates as soon as the keys are loaded or generated:
memusage mphf_benchmark -s xs64 -n 39459925 none
memusage mphf_benchmark -s xs64 -n 500000000 none
memusage mphf_uk2005.sh -s stdin -n 39459925 none
To benchmark RecSplit, PTHash, and CHD, we use another program with the same name (mphf_benchmark
),
but written in C++ by Giulio Ermanno Pibiri and Roberto Trani, and available at
https://github.com/roberto-trani/mphf_benchmark (this site also contains compilation instructions; it is built with native optimizations by default).
The authors have accepted our modifications to their program, which, among other things,
ensure that both benchmark programs can generate exactly the same keys for testing MPHFs.
For example, the results for RecSplit, PTHash, and CHD, for 39,459,925 64-bit integer keys generated uniformly at random with XorShift64, and with averaging over 30 runs, can be calculated by:
mphf_benchmark recsplit -n 39459925 --gen xs64 --num_construction_runs 30 --num_lookup_runs 30
mphf_benchmark pthash -n 39459925 --gen xs64 --num_construction_runs 30 --num_lookup_runs 30
mphf_benchmark chd -n 39459925 --gen xs64 --num_construction_runs 30 --num_lookup_runs 30
And similar calls, but using URLs from the uk-2005 collection as keys looks like this:
cat uk-2005-nat.urls | mphf_benchmark recsplit -n 39459925 --gen stdin --num_construction_runs 30 --num_lookup_runs 30
cat uk-2005-nat.urls | mphf_benchmark pthash -n 39459925 --gen stdin --num_construction_runs 30 --num_lookup_runs 30
cat uk-2005-nat.urls | mphf_benchmark chd -n 39459925 --gen stdin --num_construction_runs 30 --num_lookup_runs 30
Again, measuring memory consumption with memusage
requires running one variant at a time,
which can be achieved with the --variant
switch, whose argument is the number of subsequent method variant to test.
For example, measuring the first variant of RecSplit might look like this
(note that for the uk-2005 collection, we use the mphf_uk2005.sh
script again):
memusage mphf_benchmark recsplit --variant 1 -n 39459925 --gen xs64 --num_construction_runs 1 --num_lookup_runs 0
memusage mphf_benchmark recsplit --variant 1 -n 500000000 --gen xs64 --num_construction_runs 1 --num_lookup_runs 0
memusage mphf_uk2005.sh recsplit --variant 1 -n 39459925 --gen stdin --num_construction_runs 1 --num_lookup_runs 0
A non-existent number (e.g., 9) of the variant can be given to measure the memory consumption for the execution of benchmark programs that terminates as soon as the keys are loaded or generated:
memusage mphf_benchmark recsplit --variant 9 -n 39459925 --gen xs64
memusage mphf_benchmark recsplit --variant 9 -n 500000000 --gen xs64
memusage mphf_uk2005.sh recsplit --variant 9 -n 39459925 --gen stdin
Dependencies
~2.4–3.5MB
~65K SLoC