5 releases (breaking)

Uses new Rust 2024

0.5.0 Apr 23, 2025
0.4.0 Jan 14, 2025
0.3.0 Oct 28, 2024
0.2.0 Aug 27, 2024
0.1.1 Jul 19, 2024

#348 in Algorithms

Download history 1623/week @ 2025-01-31 1508/week @ 2025-02-07 1301/week @ 2025-02-14 5241/week @ 2025-02-21 3388/week @ 2025-02-28 3882/week @ 2025-03-07 3093/week @ 2025-03-14 2857/week @ 2025-03-21 2413/week @ 2025-03-28 2036/week @ 2025-04-04 2172/week @ 2025-04-11 5217/week @ 2025-04-18 6118/week @ 2025-04-25 3471/week @ 2025-05-02 2561/week @ 2025-05-09 4781/week @ 2025-05-16

18,056 downloads per month
Used in 33 crates (4 directly)

MIT/Apache

1.5MB
41K SLoC

CubeCL Linear Algebra Library.

The crate contains common linear algebra algorithms.

Algorithms

  • Tiling 2D Matrix Multiplication.

    The kernel is very flexible and can be used on pretty much any hardware.

  • Cooperative Matrix Multiplication.

    The kernel is using Automatic Mixed Precision (AMP) to leverage cooperative matrix-multiply and accumulate instructions. For f32 tensors, the inputs are casted into f16, but the accumulation is still performed in f32. This may cause a small lost in precision, but with way faster execution.

Benchmarks

You can run the benchmarks from the workspace with the following:

cargo bench --bench matmul --features wgpu # for wgpu
cargo bench --bench matmul --features cuda # for cuda

On an RTX 3070 we get the following results:

matmul-wgpu-f32-tiling2d

―――――――― Result ―――――――――
  Samples     100
  Mean        13.289ms
  Variance    28.000ns
  Median      13.271ms
  Min         12.582ms
  Max         13.768ms
―――――――――――――――――――――――――
matmul-cuda-f32-tiling2d

―――――――― Result ―――――――――
  Samples     100
  Mean        12.754ms
  Variance    93.000ns
  Median      12.647ms
  Min         12.393ms
  Max         14.501ms
―――――――――――――――――――――――――
matmul-cuda-f32-cmma

―――――――― Result ―――――――――
  Samples     100
  Mean        4.996ms
  Variance    35.000ns
  Median      5.084ms
  Min         4.304ms
  Max         5.155ms
―――――――――――――――――――――――――

Dependencies

~6–20MB
~220K SLoC