#automatic-differentiation #cpu-gpu #array #deep-learning #gpu #fixed-size #auto-diff

sys no-std custos

A minimal OpenCL, WGPU, CUDA and host CPU array manipulation engine

15 releases (6 breaking)

0.7.0 Apr 14, 2023
0.6.3 Feb 11, 2023
0.5.0 Sep 10, 2022
0.4.0 Jul 31, 2022

#566 in Machine learning

Download history 6/week @ 2024-03-13 16/week @ 2024-03-27 21/week @ 2024-04-03

97 downloads per month
Used in 2 crates

MIT license


custos logo

Crates.io version Docs Rust GPU rust-clippy

A minimal OpenCL, WGPU, CUDA and host CPU array manipulation engine / framework written in Rust. This crate provides the tools for executing custom array and automatic differentiation operations with the CPU, as well as with CUDA, WGPU and OpenCL devices.
This guide demonstrates how operations can be implemented for the compute devices: implement_operations.md
or to see it at a larger scale, look here custos-math or here sliced (for automatic diff examples).


Add "custos" as a dependency:

custos = "0.7.0"

# to disable the default features (cpu, cuda, opencl, static-api, blas, macro) and use an own set of features:
#custos = {version = "0.7.0", default-features=false, features=["opencl", "blas"]}

Available features:

Feature Description
cpu Adds the CPU device
stack Adds the Stack device, enables stack allocated Buffers
opencl Adds OpenCL features. (name of the device: OpenCL)
cuda Adds CUDA features. (name of the device: CUDA)
wgpu Adds WGPU features. (name of the device: WGPU)
no-std For no std environments, activates stack feature.
static-api Enables the creation of Buffers without providing a device.
blas Adds gemm functions from the system's (selected) BLAS library.
opt-cache Makes the 'cache graph' optimizeable, lowering the memory footprint.
macro Reexport of custos-macro
realloc Disables allocation caching for all devices.
autograd  Adds automatic differentiation features.


custos only implements four Buffer operations. These would be the write, read, copy_slice and clear operations, however, there are also unary (device only) operations.
On the other hand, [custos-math] implements a lot more operations, including Matrix operations for a custom Matrix struct.

Implement an operation for CPU: If you want to implement your own operations for all compute devices, consider looking here: implement_operations.md

use std::ops::Mul;
use custos::prelude::*;

pub trait MulBuf<T, S: Shape = (), D: Device = Self>: Sized + Device {
    fn mul(&self, lhs: &Buffer<T, D, S>, rhs: &Buffer<T, D, S>) -> Buffer<T, Self, S>;

impl<T, S, D> MulBuf<T, S, D> for CPU
    T: Mul<Output = T> + Copy,
    S: Shape,
    D: MainMemory,
    fn mul(&self, lhs: &Buffer<T, D, S>, rhs: &Buffer<T, D, S>) -> Buffer<T, CPU, S> {
        let mut out = self.retrieve(lhs.len(), (lhs, rhs));

        for ((lhs, rhs), out) in lhs.iter().zip(&*rhs).zip(&mut out) {
            *out = *lhs * *rhs;


A lot more usage examples can be found in the tests and examples folders. (Or in the unary operation file, custos-math and sliced)


~465K SLoC