1 unstable release

new 0.3.0 Oct 28, 2024

#94 in Science

Download history 329/week @ 2024-10-24

329 downloads per month
Used in 12 crates (2 directly)

MIT/Apache

750KB
19K SLoC



Discord Current Crates.io Version Minimum Supported Rust Version Test Status license
NVIDIA AMD WGPU


Multi-platform high-performance compute language extension for Rust.

TL;DR

With CubeCL, you can program your GPU using Rust, taking advantage of zero-cost abstractions to develop maintainable, flexible, and efficient compute kernels. CubeCL currently fully supports functions, generics, and structs, with partial support for traits, methods and type inference. As the project evolves, we anticipate even broader support for Rust language primitives, all while maintaining optimal performance.

Example

Simply annotate functions with the cube attribute to indicate that they should run on the GPU.

use cubecl::prelude::*;

#[cube(launch_unchecked)]
/// A [Line] represents a contiguous series of elements where SIMD operations may be available.
/// The runtime will automatically use SIMD instructions when possible for improved performance.
fn gelu_array<F: Float>(input: &Array<Line<F>>, output: &mut Array<Line<F>>) {
    if ABSOLUTE_POS < input.len() {
        output[ABSOLUTE_POS] = gelu_scalar(input[ABSOLUTE_POS]);
    }
}

#[cube]
fn gelu_scalar<F: Float>(x: Line<F>) -> Line<F> {
    // Execute the sqrt function at comptime.
    let sqrt2 = F::new(comptime!(2.0f32.sqrt()));
    let tmp = x / Line::new(sqrt2);

    x * (Line::erf(tmp) + 1.0) / 2.0
}

You can then launch the kernel using the autogenerated gelu_array::launch_unchecked function.

pub fn launch<R: Runtime>(device: &R::Device) {
    let client = R::client(device);
    let input = &[-1., 0., 1., 5.];
    let vectorization = 4;
    let output_handle = client.empty(input.len() * core::mem::size_of::<f32>());
    let input_handle = client.create(f32::as_bytes(input));

    unsafe {
        gelu_array::launch_unchecked::<f32, R>(
            &client,
            CubeCount::Static(1, 1, 1),
            CubeDim::new(input.len() as u32 / vectorization, 1, 1),
            ArrayArg::from_raw_parts(&input_handle, input.len(), vectorization as u8),
            ArrayArg::from_raw_parts(&output_handle, input.len(), vectorization as u8),
        )
    };

    let bytes = client.read(output_handle.binding());
    let output = f32::from_bytes(&bytes);

    // Should be [-0.1587,  0.0000,  0.8413,  5.0000]
    println!("Executed gelu with runtime {:?} => {output:?}", R::name());
}

To see it in action, run the working GELU example with the following command:

cargo run --example gelu --features cuda # cuda runtime
cargo run --example gelu --features wgpu # wgpu runtime

Runtime

We support the following GPU runtimes:

  • WGPU for cross-platform GPU support (Vulkan, Metal, DirectX, WebGPU)
  • CUDA for NVIDIA GPU support
  • ROCm/HIP for AMD GPU support (WIP)

We also plan to develop an optimized JIT CPU runtime with SIMD instructions, leveraging Cranelift.

Motivation

The goal of CubeCL is to ease the pain of writing highly optimized compute kernels that are portable across hardware. There is currently no adequate solution when you want optimal performance while still being multi-platform. You either have to write custom kernels for different hardware, often with different languages such as CUDA, Metal, or ROCm. To fix this, we created a Just-in-Time compiler with three core features: automatic vectorization, comptime, and autotune!

These features are extremely useful for anyone writing high-performance kernels, even when portability is not a concern. They improve code composability, reusability, testability, and maintainability, all while staying optimal. CubeCL also ships with a memory management strategy optimized for throughput with heavy buffer reuse to avoid allocations.

Our goal extends beyond providing an optimized compute language; we aim to develop an ecosystem of high-performance and scientific computing in Rust. To achieve this, we're developing linear algebra components that you can integrate into your own kernels. We currently have an highly optimized matrix multiplication module, leveraging Tensor Cores on NVIDIA hardware where available, while gracefully falling back to basic instructions on other platforms. While there's room for improvement, particularly in using custom instructions from newer NVIDIA GPUs, our implementation already delivers impressive performance.

This is just the beginning. We plan to include more utilities such as convolutions, random number generation, fast Fourier transforms, and other essential algorithms. We are a small team also building Burn, so don't hesitate to contribute and port algorithms; it can help more than you would imagine!

How it works

CubeCL leverages Rust's proc macro system in a unique two-step process:

  1. Parsing: The proc macro parses the GPU kernel code using the syn crate.
  2. Expansion: Instead of immediately generating an Intermediate Representation (IR), the macro generates a new Rust function.

The generated function, semantically similar to the original, is responsible for creating the IR when called. This approach differs from traditional compilers, which typically generate IR directly after parsing. Our method enables several key features:

  • Comptime: By not transforming the original code, it becomes remarkably easy to integrate compile-time optimizations.
  • Automatic Vectorization: By simply vectorizing the inputs of a CubeCL function, we can determine the vectorization factor of each intermediate variable during the expansion.
  • Rust Integration: The generated code remains valid Rust code, allowing it to be bundled without any dependency on the specific runtime.

Design

CubeCL is designed around - you guessed it - Cubes! More specifically, it's based on cuboids, because not all axes are the same size. Since all compute APIs need to map to the hardware, which are tiles that can be accessed using a 3D representation, our topology can easily be mapped to concepts from other APIs.

CubeCL - Topology



A cube is composed of units, so a 3x3x3 cube has 27 units that can be accessed by their positions along the x, y, and z axes. Similarly, a hyper-cube is composed of cubes, just as a cube is composed of units. Each cube in the hyper-cube can be accessed by its position relative to the hyper-cube along the x, y, and z axes. Hence, a hyper-cube of 3x3x3 will have 27 cubes. In this example, the total number of working units would be 27 x 27 = 729.

Topology Equivalence 👇

Since all topology variables are constant within the kernel entry point, we chose to use the Rust constant syntax with capital letters. Often when creating kernels, we don't always care about the relative position of a unit within a cube along each axis, but often we only care about its position in general. Therefore, each kind of variable also has its own axis-independent variable, which is often not present in other languages, except WebGPU with local_invocation_index.


CubeCL CUDA WebGPU
CUBE_COUNT N/A N/A
CUBE_COUNT_X gridDim.x num_workgroups.x
CUBE_COUNT_Y gridDim.y num_workgroups.y
CUBE_COUNT_Z gridDim.z num_workgroups.z
CUBE_POS N/A N/A
CUBE_POS_X blockIdx.x workgroup.x
CUBE_POS_Y blockIdx.y workgroup.y
CUBE_POS_Z blockIdx.z workgroup.z
CUBE_DIM N/A N/A
CUBE_DIM_X blockDim.x workgroup_size.x
CUBE_DIM_Y blockDim.y workgroup_size.y
CUBE_DIM_Z blockDim.z workgroup_size.z
UNIT_POS N/A local_invocation_index
UNIT_POS_X threadIdx.x local_invocation_id.x
UNIT_POS_Y threadIdx.y local_invocation_id.y
UNIT_POS_Z threadIdx.z local_invocation_id.z
SUBCUBE_DIM warpSize subgroup_size
ABSOLUTE_POS N/A N/A
ABSOLUTE_POS_X N/A global_id.x
ABSOLUTE_POS_Y N/A global_id.y
ABSOLUTE_POS_Z N/A global_id.z

Special Features

Automatic Vectorization

High-performance kernels should rely on SIMD instructions whenever possible, but doing so can quickly get pretty complicated! With CubeCL, you can specify the vectorization factor of each input variable when launching a kernel. Inside the kernel code, you still use only one type, which is dynamically vectorized and supports automatic broadcasting. The runtimes are able to compile kernels and have all the necessary information to use the best instruction! However, since the algorithmic behavior may depend on the vectorization factor, CubeCL allows you to access it directly in the kernel when needed, without any performance loss, using the comptime system!

Comptime

CubeCL isn't just a new compute language: though it feels like you are writing GPU kernels, you are, in fact, writing compiler plugins that you can fully customize! Comptime is a way to modify the compiler IR at runtime when compiling a kernel for the first time.

This enables lots of optimizations and flexibility without having to write many separate variants of the same kernels to ensure maximal performance.

Feature Description
Instruction Specialization Not all instructions are available on all hardware, but when a specialized one exists, it should be enabled with a simple if statement.
Automatic Vectorization When you can use SIMD instructions, you should! But since not all hardware supports the same vectorization factors, it can be injected at runtime!
Loop Unrolling You may want multiple flavors of the same kernel, with loop unrolling for only a certain range of values. This can be configured easily with Comptime.
Shape Specialization For deep learning kernels, it's often crucial to rely on different kernels for different input sizes; you can do it by passing the shape information as Comptime values.
Compile Time Calculation In general, you can calculate a constant using Rust runtime properties and inject it into a kernel during its compilation, to avoid recalculating it during each execution.

Autotuning

Autotuning drastically simplifies kernel selection by running small benchmarks at runtime to figure out the best kernels with the best configurations to run on the current hardware; an essential feature for portability. This feature combines gracefully with comptime to test the effect of different comptime values on performance; sometimes it can be surprising!

Even if the benchmarks may add some overhead when running the application for the first time, the information gets cached on the device and will be reused. It is usually a no-brainer trade-off for throughput-oriented programs such as deep learning models. You can even ship the autotune cache with your program, reducing cold start time when you have more control over the deployment target.

Resource

For now we don't have a lot of resources to learn, but you can look at the linear algebra library to see how CubeCL can be used. If you have any questions or want to contribute, don't hesitate to join the Discord.

Disclaimer & History

CubeCL is currently in alpha.

While CubeCL is used in Burn, there are still a lot of rough edges; it isn't refined yet. The project started as a WebGPU-only backend for Burn. As we optimized it, we realized that we needed an intermediate representation (IR) that could be optimized then compiled to WGSL. Having an IR made it easy to support another compilation target, so we made a CUDA runtime. However, writing kernels directly in that IR wasn't easy, so we created a Rust frontend using the syn crate. Navigating the differences between CUDA and WebGPU, while leveraging both platforms, forced us to come up with general concepts that worked everywhere. Hence, CubeCL was born!

Dependencies

~4–18MB
~184K SLoC