2 releases

0.2.1 Jul 5, 2022
0.2.0 Jul 5, 2022

#410 in Profiling

MIT/Apache

14KB
89 lines

criterion-cust

This crate provides the Measurement CudaTime for benchmarking CUDA kernels using criterion-rs.

See examples/add.rs for a usage example.

How is the CUDA time measured?

The time of GPU execution is measured by recording a CUDA event before and after your benchmark on the CUDA default stream.

Running the Example Benchmark

cargo bench

or with cargo-criterion installed

cargo criterion

This cargo bench should print

add kernel/add kernel/2000
                        time:   [0.0142 ms 0.0142 ms 0.0142 ms]
                        thrpt:  [0.5229 GiB/s 0.5231 GiB/s 0.5232 GiB/s]
                 change:
                        time:   [-1.8762% -1.2732% -0.7326%] (p = 0.00 < 0.05)
                        thrpt:  [+0.7380% +1.2896% +1.9121%]
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  3 (3.00%) high mild
  8 (8.00%) high severe
add kernel/add kernel/20000
                        time:   [0.1163 ms 0.1163 ms 0.1164 ms]
                        thrpt:  [0.6403 GiB/s 0.6404 GiB/s 0.6404 GiB/s]
                 change:
                        time:   [-1.5252% -1.0335% -0.4522%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4542% +1.0443% +1.5488%]
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

Making an Optimization

Now change the following line in the example

- launch!(module.sum<<<buffer_size, 1, 0, stream>>>(
+ launch!(module.sum<<<256, ((buffer_size + 256 - 1) / 256), 0, stream>>>(

Now the benchmark should run faster:

add kernel/add kernel/2000
                        time:   [0.0041 ms 0.0041 ms 0.0041 ms]
                        thrpt:  [1.8300 GiB/s 1.8311 GiB/s 1.8321 GiB/s]
                 change:
                        time:   [-71.520% -71.397% -71.249%] (p = 0.00 < 0.05)
                        thrpt:  [+247.81% +249.61% +251.13%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe
add kernel/add kernel/20000
                        time:   [0.0041 ms 0.0041 ms 0.0041 ms]
                        thrpt:  [18.0229 GiB/s 18.0325 GiB/s 18.0405 GiB/s]
                 change:
                        time:   [-96.459% -96.441% -96.421%] (p = 0.00 < 0.05)
                        thrpt:  [+2694.2% +2709.4% +2724.3%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

Troubleshooting

This project uses cust for running CUDA programs. Please refer to their README in case of any build problems! Use the "0.1.0" version of this crate to use RustaCUDA.

Dependencies

~16–26MB
~435K SLoC