2 releases
0.2.1 | Jul 5, 2022 |
---|---|
0.2.0 | Jul 5, 2022 |
#410 in Profiling
14KB
89 lines
criterion-cust
This crate provides the Measurement
CudaTime
for benchmarking CUDA kernels using
criterion-rs.
See examples/add.rs for a usage example.
How is the CUDA time measured?
The time of GPU execution is measured by recording a CUDA event before and after your benchmark on the CUDA default stream.
Running the Example Benchmark
cargo bench
or with cargo-criterion installed
cargo criterion
This cargo bench
should print
add kernel/add kernel/2000
time: [0.0142 ms 0.0142 ms 0.0142 ms]
thrpt: [0.5229 GiB/s 0.5231 GiB/s 0.5232 GiB/s]
change:
time: [-1.8762% -1.2732% -0.7326%] (p = 0.00 < 0.05)
thrpt: [+0.7380% +1.2896% +1.9121%]
Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
1 (1.00%) low severe
3 (3.00%) high mild
8 (8.00%) high severe
add kernel/add kernel/20000
time: [0.1163 ms 0.1163 ms 0.1164 ms]
thrpt: [0.6403 GiB/s 0.6404 GiB/s 0.6404 GiB/s]
change:
time: [-1.5252% -1.0335% -0.4522%] (p = 0.00 < 0.05)
thrpt: [+0.4542% +1.0443% +1.5488%]
Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
2 (2.00%) low severe
4 (4.00%) low mild
5 (5.00%) high mild
4 (4.00%) high severe
Making an Optimization
Now change the following line in the example
- launch!(module.sum<<<buffer_size, 1, 0, stream>>>(
+ launch!(module.sum<<<256, ((buffer_size + 256 - 1) / 256), 0, stream>>>(
Now the benchmark should run faster:
add kernel/add kernel/2000
time: [0.0041 ms 0.0041 ms 0.0041 ms]
thrpt: [1.8300 GiB/s 1.8311 GiB/s 1.8321 GiB/s]
change:
time: [-71.520% -71.397% -71.249%] (p = 0.00 < 0.05)
thrpt: [+247.81% +249.61% +251.13%]
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
4 (4.00%) high mild
10 (10.00%) high severe
add kernel/add kernel/20000
time: [0.0041 ms 0.0041 ms 0.0041 ms]
thrpt: [18.0229 GiB/s 18.0325 GiB/s 18.0405 GiB/s]
change:
time: [-96.459% -96.441% -96.421%] (p = 0.00 < 0.05)
thrpt: [+2694.2% +2709.4% +2724.3%]
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
4 (4.00%) high mild
8 (8.00%) high severe
Troubleshooting
This project uses cust for running CUDA programs. Please refer to their README in case of any build problems! Use the "0.1.0" version of this crate to use RustaCUDA.
Dependencies
~16–26MB
~435K SLoC