7 releases

0.0.6 Mar 28, 2024
0.0.5 Mar 14, 2024
0.0.0 Oct 3, 2023

#122 in Profiling

Download history 4/week @ 2024-02-23 2/week @ 2024-03-01 293/week @ 2024-03-08 39/week @ 2024-03-15 101/week @ 2024-03-22 65/week @ 2024-03-29 10/week @ 2024-04-05

177 downloads per month
Used in 2 crates

MIT license

26KB
547 lines

Harness

Precise and reproducible benchmarking. Inspired by running-ng.

crates.io docs workflow-status

Getting Started

  1. Install the harness CLI: cargo install harness-cli.
  2. Get the example crate: git clone https://github.com/wenyuzhao/harness.git && cd harness/examples/sort.
  3. Start an evaluation: cargo harness run.
  4. View results: cargo harness report.

Please see more examples on how to configure and use harness. The evaluation configs can be found in Cargo.toml of each example crate.

Precise Measurement

Interleaved runs

harness avoids running the same benchmark multiple times in a loop, unlike what most existing Rust benchmarking tools would do.

For an evaluation, given benchmark programs $P_1..P_p$, builds $B_1..B_b$, and we run each $(P, B)$ pair for $I$ invocations, harness will use the following run order, row by row:

$$I_1\ :\ [P_1B_1,\ P_1B_2,\ ..,\ P_1B_b],\ \ \ [P_2B_1,\ P_2B_2,\ ..,\ P_2B_b]\ \ \ ...\ \ \ [P_pB_1,\ P_pB_2,\ ..,\ P_pB_b]$$

$$I_2\ :\ [P_1B_1,\ P_1B_2,\ ..,\ P_1B_b],\ \ \ [P_2B_1,\ P_2B_2,\ ..,\ P_2B_b]\ \ \ ...\ \ \ [P_pB_1,\ P_pB_2,\ ..,\ P_pB_b]$$

$$\dots$$

$$I_I\ :\ [P_1B_1,\ P_1B_2,\ ..,\ P_1B_b],\ \ \ [P_2B_1,\ P_2B_2,\ ..,\ P_2B_b]\ \ \ ...\ \ \ [P_pB_1,\ P_pB_2,\ ..,\ P_pB_b]$$

Any machine can have performance fluctuations, e.g. CPU frequency suddenly scaled down, or a background process waking up to do some task. Interleaved runs will make sure fluctuations do not affect only one build or one benchmark, but all the benchmarks and builds in a relatively fair way.

When running in a complex environment, you are very likely to see a difference in the results between the two run orders.

Note: For the same reason, it's recommended to always have more than two different builds in each evaluation. Otherwise, there is no difference to running a single build in a loop.

Warmup / timing phase separation

harness has a clear notion of warmup and timing iterations, instead of blindly iterating a single benchmark multiple times and reporting the per-iteration time distribution. By default, each invocation of $(P,B)$ will repeat the workload for $5$ iterations. The first $4$ iterations are used for warmup. Only the results from the last timing iteration are reported. This can greatly reduce the noise due to program warmup and precisely measure the peak performance. However, you can also choose to do single-iteration runs to cover the boot time and warmup cost.

Statistical runs and analysis

Similar to other bench tools, harness runs each $(P,B)$ pair multiple times (multiple invocations). However, we use a fixed number of invocations for all $(P,B)$ pairs for easier reasoning. Unless specified differently, each $(P,B)$ is run for 10 invocations by default.

After all the $I$ invocations are finished, running cargo harness report will parse the results and report the min/max/mean/geomean for each performance value, as well as the 95% confidence interval per benchmark. You can also use your own script to load the results and analyze them differently. The performance values are stored in target/harness/logs/<RUNID>/results.csv.

Probes

harness supports collecting and reporting extra performance data other than execution time, by enabling the following probes:

  • harness-probe-perf: Collect perf-event values for the timing iteration.
  • harness-probe-ebpf (WIP): Extra performance data collected by eBPF programs.

System checks

harness performs a series of strict checks to minimize system noise. It refuses to start benchmarking if any of the following checks fail:

  • (Linux-only) Only one user is logged in
  • (Linux-only) All CPU scaling governors are set to performance

Reproducible Evaluation

Tracked evaluation configs

harness refuses to support casual benchmarking. Each evaluation is enforced to be properly tracked by Git, including all the benchmark configurations and the revisions of all the benchmarks and benchmarked programs. Verifying the correctness of any evaluation, or re-running an evaluation from years ago, can be done by simply tracking back the git history.

Deterministic builds

harness assigns each individual evaluation a unique RUNID and generates an evaluation summary at target/harness/logs/<RUNID>/config.toml. harness uses this file to record the evaluation info for the current benchmark run, including:

  • Git commit of the evaluation config
  • Git commit, cargo features, and environment variables used for producing each evaluated build
  • The Cargo.lock file used for producing each evaluated builds

Reproducing a previous evaluation is as simple as running cargo harness run --config <RUNID>. harness automatically checks out the corresponding commits, sets up the recorded cargo features or environment variables, and replays the pre-recorded Cargo.lock file, to ensure the codebase and builds are exactly at the same state as when RUNID was generated.

Note: harness cannot check local dependencies right now. For completely deterministic builds, don't use local dependencies.

System environment verification

In the same <RUNID>/config.toml file, harness also records all the environmental info for every benchmark run, including but not limited to:

  • All global system environment variables at the time of the run
  • OS / CPU / Memory / Swap information used for the run

Any change to the system environments would affect reproducibility. So it's recommended to keep the same environment variables and the same OS / CPU / Memory / Swap config as much as possible. harness automatically verifies the current system info against the recorded ones and warns for any differences.

TODO:

  • Runner
  • Binary runner
  • Result reporting
  • Test runner
  • Scratch folder
  • Default to compare HEAD vs HEAD~1
  • Restore git states after benchmarking
  • Comments for public api
  • Documentation
  • Benchmark subsetting
  • Handle no result cases
  • More examples
  • Add tests
  • Plugin system
  • Plugin: html or markdown report with graphs
  • Plugin: Copy files
  • Plugin: Rsync results
  • Performance evaluation guide / tutorial

Dependencies

~5–20MB
~258K SLoC