4 releases

0.1.3	Aug 4, 2022
0.1.2	May 24, 2022
0.1.1	Oct 30, 2021
0.1.0	Jul 6, 2021

#651 in Math

24 downloads per month

MIT/Apache

13KB
81 lines

`[sd]gemm` benchmark

Introduction

This is a small [sd]gemm benchmark based, similar to ACES DGEMM, implemented in Rust. It supports the following BLAS libraries:

Accelerate (macOS)
Intel MKL
OpenBLAS

Building

Build with Accelerate (macOS)

$ cargo install gemm-benchmark --features accelerate

Build with BLIS

$ cargo install gemm-benchmark --features blis

Build with Intel MKL

To build the benchmark with Intel MKL statically linked, use:

$ cargo install gemm-benchmark --features intel-mkl

Intel MKL uses Zen-specific [sd]gemmkernels on AMD Zen CPUs. However, these kernels are slower on many Zen CPUs than the AVX2 kernels. You can build the benchmark to override Intel CPU detection, so that MKL uses AVX2 kernels on Zen CPUs as well. This does require dynamic linking, since it is not permitted to modify MKL binaries. To enable this override, use the intel-mkl-amd feature:

$ cargo install gemm-benchmark --features intel-mkl-amd

Build with OpenBLAS

$ cargo install gemm-benchmark --features openblas

Set OPENBLAS_NUM_THREADS=1 before running.

Benchmarking

By default, sgemm is benchmarked using 256 x 256 matrices, for 1,000 iterations and 1 thread. The dimensionality (-d), number of iterations (-i), and the number of threads (-t) can be set with command-line flags. For example:

$ gemm-benchmark -d 1024 -i 2000 -t 4

Runs the benchmark using 1024 x 1024 matrices, for 1,000 iterations, and 4 threads. It is also possible to benchmark dgem, using the --dgemm option:

$ gemm-benchmark -d 1024 -i 2000 -t 4 --dgemm

Example results

1 to 16 threads

The following table shows GFLOPS for various CPUs using 1 to 16 threads on matrix size 768.

Threads	M1 Accelerate	M1 Pro Accelerate	M1 Ultra Accelerate	Ryzen 3700X MKL	Ryzen 5900X MKL
1	1340	2061	2177	134	148
2	1226	2583	3427	262	284
4	1102	2685	3788	513	558
8	1253	2381	4344	924	1106
12	1225	2248	4261	989	1555
16	1217	2254	4376	850	1390

Dependencies

~6–12MB
~218K SLoC