#kernel #compute #vulkan #gpu

krnl

Safe, portable, high performance compute (GPGPU) kernels

3 releases

0.0.3 Oct 17, 2023
0.0.2 Apr 23, 2023
0.0.1 Apr 11, 2023

#166 in Math

Download history 2/week @ 2023-08-12 2/week @ 2023-08-26 6/week @ 2023-09-09 1/week @ 2023-09-16 2/week @ 2023-09-23 2/week @ 2023-09-30 1/week @ 2023-10-07 26/week @ 2023-10-14 19/week @ 2023-10-21 31/week @ 2023-10-28 18/week @ 2023-11-04 25/week @ 2023-11-11 4/week @ 2023-11-18 11/week @ 2023-11-25

58 downloads per month

MIT/Apache

645KB
10K SLoC

DocsBadge Build Status

krnl

Safe, portable, high performance compute (GPGPU) kernels.

Developed for autograph.

  • Similar functionality to CUDA and OpenCL.
  • Supports GPU's and other Vulkan 1.2 capable devices.
  • MacOS / iOS supported via MoltenVK.
  • Kernels are written inline, entirely in Rust.
  • Simple iterator patterns can be implemented without unsafe.
  • Buffers on the host can be accessed natively as Vecs and slices.

krnlc

Kernel compiler for krnl.

  • Built on RustGPU's spirv-builder.
  • Supports dependencies defined in Cargo.toml.
  • Uses spirv-tools to validate and optimize.
  • Compiles to "krnl-cache.rs", so the crate will build on stable Rust.

Example

use krnl::{
    anyhow::Result,
    buffer::{Buffer, Slice, SliceMut},
    device::Device,
    macros::module,
};

#[module]
mod kernels {
    #[cfg(not(target_arch = "spirv"))]
    use krnl::krnl_core;
    use krnl_core::macros::kernel;

    pub fn saxpy_impl(x: f32, alpha: f32, y: &mut f32) {
        *y += alpha * x;
    }

    // Item kernels for iterator patterns.
    #[kernel]
    pub fn saxpy(#[item] x: f32, alpha: f32, #[item] y: &mut f32) {
        saxpy_impl(x, alpha, y);
    }

    // General purpose kernels like CUDA / OpenCL.
    #[kernel]
    pub fn saxpy_global(#[global] x: Slice<f32>, alpha: f32, #[global] y: UnsafeSlice<f32>) {
        use krnl_core::buffer::UnsafeIndex;
        let mut index = kernel.global_id() as usize;
        while index < x.len().min(y.len()) {
            saxpy_impl(x[index], alpha, unsafe { y.unsafe_index_mut(index) });
            index += kernel.global_threads() as usize;
        }
    }
}

fn saxpy(x: Slice<f32>, alpha: f32, mut y: SliceMut<f32>) -> Result<()> {
    if let Some((x, y)) = x.as_host_slice().zip(y.as_host_slice_mut()) {
        for (x, y) in x.iter().copied().zip(y) {
            kernels::saxpy_impl(x, alpha, y);
        }
        return Ok(());
    } 
    kernels::saxpy::builder()?
        .with_threads(256)
        .build(y.device())?
        .dispatch(x, alpha, y) 
}

fn main() -> Result<()> {
    let x = vec![1f32];
    let alpha = 2f32;
    let y = vec![0f32];
    let device = Device::builder().build().ok().unwrap_or(Device::host());
    let x = Buffer::from(x).into_device(device.clone())?;
    let mut y = Buffer::from(y).into_device(device.clone())?;
    saxpy(x.as_slice(), alpha, y.as_slice_mut())?;
    let y = y.into_vec()?;
    println!("{y:?}");
    Ok(())
}

Performance

NVIDIA GeForce GTX 1060 with Max-Q Design

alloc

krnl cuda ocl
1,000,000 319.07 ns (✅ 1.00x) 112.83 us (❌ 353.62x slower) 486.10 ns (❌ 1.52x slower)
10,000,000 318.22 ns (✅ 1.00x) 1.11 ms (❌ 3494.06x slower) 493.02 ns (❌ 1.55x slower)
64,000,000 318.40 ns (✅ 1.00x) 6.31 ms (❌ 19803.98x slower) 493.07 ns (❌ 1.55x slower)

upload

krnl cuda ocl
1,000,000 339.76 us (✅ 1.00x) 363.93 us (✅ 1.07x slower) 789.44 us (❌ 2.32x slower)
10,000,000 4.90 ms (✅ 1.00x) 3.81 ms (✅ 1.29x faster) 8.84 ms (❌ 1.80x slower)
64,000,000 25.92 ms (✅ 1.00x) 24.58 ms (✅ 1.05x faster) 56.74 ms (❌ 2.19x slower)

download

krnl cuda ocl
1,000,000 593.88 us (✅ 1.00x) 461.01 us (✅ 1.29x faster) 20.12 ms (❌ 33.88x slower)
10,000,000 5.66 ms (✅ 1.00x) 4.07 ms (✅ 1.39x faster) 20.13 ms (❌ 3.55x slower)
64,000,000 29.50 ms (✅ 1.00x) 25.71 ms (✅ 1.15x faster) 37.48 ms (❌ 1.27x slower)

zero

krnl cuda ocl
1,000,000 38.49 us (✅ 1.00x) 25.31 us (✅ 1.52x faster) 35.16 us (✅ 1.09x faster)
10,000,000 254.52 us (✅ 1.00x) 243.01 us (✅ 1.05x faster) 252.41 us (✅ 1.01x faster)
64,000,000 1.54 ms (✅ 1.00x) 1.55 ms (✅ 1.01x slower) 1.56 ms (✅ 1.02x slower)

saxpy

krnl cuda ocl
1,000,000 88.59 us (✅ 1.00x) 81.25 us (✅ 1.09x faster) 89.24 us (✅ 1.01x slower)
10,000,000 742.25 us (✅ 1.00x) 770.35 us (✅ 1.04x slower) 780.49 us (✅ 1.05x slower)
64,000,000 4.68 ms (✅ 1.00x) 4.91 ms (✅ 1.05x slower) 4.92 ms (✅ 1.05x slower)

Recent Changes

See Releases.md

License

Dual-licensed to be compatible with the Rust project.

Licensed under the Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0 or the MIT license http://opensource.org/licenses/MIT, at your option. This file may not be copied, modified, or distributed except according to those terms.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions

Dependencies

~5–15MB
~218K SLoC