3 releases (breaking)
0.3.0 | Oct 28, 2024 |
---|---|
0.2.0 | Aug 27, 2024 |
0.1.1 | Jul 19, 2024 |
#217 in Algorithms
2,114 downloads per month
Used in 16 crates
(4 directly)
1MB
21K
SLoC
CubeCL Linear Algebra Library.
The crate contains common linear algebra algorithms.
Algorithms
-
Tiling 2D Matrix Multiplication.
The kernel is very flexible and can be used on pretty much any hardware.
-
Cooperative Matrix Multiplication.
The kernel is using Automatic Mixed Precision (AMP) to leverage cooperative matrix-multiply and accumulate instructions. For
f32
tensors, the inputs are casted intof16
, but the accumulation is still performed inf32
. This may cause a small lost in precision, but with way faster execution.
Benchmarks
You can run the benchmarks from the workspace with the following:
cargo bench --bench matmul --features wgpu # for wgpu
cargo bench --bench matmul --features cuda # for cuda
On an RTX 3070 we get the following results:
matmul-wgpu-f32-tiling2d
―――――――― Result ―――――――――
Samples 100
Mean 13.289ms
Variance 28.000ns
Median 13.271ms
Min 12.582ms
Max 13.768ms
―――――――――――――――――――――――――
matmul-cuda-f32-tiling2d
―――――――― Result ―――――――――
Samples 100
Mean 12.754ms
Variance 93.000ns
Median 12.647ms
Min 12.393ms
Max 14.501ms
―――――――――――――――――――――――――
matmul-cuda-f32-cmma
―――――――― Result ―――――――――
Samples 100
Mean 4.996ms
Variance 35.000ns
Median 5.084ms
Min 4.304ms
Max 5.155ms
―――――――――――――――――――――――――
Dependencies
~4–18MB
~184K SLoC