#machine-learning #tensor #devices #memory

no-std axonml-core

Core abstractions for the Axonml ML framework

20 releases (5 breaking)

Uses new Rust 2024

new 0.6.2 Apr 17, 2026
0.6.1 Apr 10, 2026
0.5.0 Mar 31, 2026
0.4.3 Mar 25, 2026
0.1.0 Jan 19, 2026

#2568 in Machine learning

Download history 30/week @ 2026-01-21 27/week @ 2026-02-04 10/week @ 2026-02-11 46/week @ 2026-02-18 45/week @ 2026-02-25 28/week @ 2026-03-04 17/week @ 2026-03-11 46/week @ 2026-03-18 60/week @ 2026-03-25 31/week @ 2026-04-01 54/week @ 2026-04-08

191 downloads per month
Used in 20 crates

MIT/Apache

570KB
14K SLoC

axonml-core

AxonML Logo

License: Apache-2.0 License: MIT Rust 1.85+ Version 0.6.1 Part of AxonML

Overview

axonml-core is the foundational layer of the AxonML machine learning framework. It provides the Device abstraction, the Scalar/Numeric/Float trait hierarchy, reference-counted Storage<T> with pooled GPU allocations, and five compute backends (CPU, CUDA, Vulkan, Metal, WebGPU) that underpin every tensor operation in the framework.

Features

  • Device Abstraction - Device enum (Cpu, Cuda, Vulkan, Metal, Wgpu) with per-variant device index, runtime availability checks, and best_available_backend() selector (CUDA > Metal > Vulkan > WebGPU > CPU).

  • Type-Safe Data Types - DType runtime enum covering F16, F32, F64, I8, I16, I32, I64, U8, U32, U64, Bool with size_of / is_float / is_signed / is_integer queries. Compile-time Scalar / Numeric / Float trait hierarchy for zero-cost generic dispatch.

  • Reference-Counted Storage - Storage<T> wraps either a host Vec<T> or a PooledCudaSlice behind Arc<RwLock<...>>. Supports zero-copy views via offset+len slicing, to_device() for CPU<->GPU transfer, deep copy, and RAII as_slice() / as_slice_mut() guards.

  • Five Compute Backends - CPU (rayon-parallel, matrixmultiply GEMM/GEMV, always available), CUDA (cuBLAS + 15+ custom PTX kernel modules), Vulkan (ash + gpu-allocator, SPIR-V compute), Metal (Apple Silicon, compute pipelines), WebGPU (wgpu for browser/cross-platform).

  • GPU Memory Pool - cuda_pool returns freed CUDA allocations to a size-bucketed free list instead of calling cudaFree, amortising allocator cost across training steps.

  • Device Capabilities - DeviceCapabilities exposes name, total/available memory, f16/f64 support, max threads per block, and CUDA compute capability.

  • Allocator Trait - Allocator extension point with a DefaultAllocator that performs 64-byte-aligned host allocations and reports system memory via sysinfo.

Modules

Module Description
device Device enum (Cpu, Cuda, Vulkan, Metal, Wgpu) + DeviceCapabilities with availability and capability queries
dtype DType runtime enum and Scalar / Numeric / Float trait hierarchy; F16Wrapper and BoolWrapper adapters
storage Reference-counted Storage<T> with zero-copy views, device transfer, and pooled GPU slices
allocator Allocator trait and DefaultAllocator (64-byte-aligned CPU allocator)
backends Backend trait, BackendType, GpuMemory, GpuStream, plus CPU/CUDA/Vulkan/Metal/WGPU implementations
error Error / Result types for shape mismatches, device errors, and allocation failures

Backends (under backends/)

Backend File Status
CPU cpu.rs Always compiled; rayon-parallel ops, matrixmultiply GEMM
CUDA cuda.rs + cuda_kernels/ + cuda_pool.rs Feature cuda; cuBLAS + PTX kernels for elementwise, activations, attention, Q4_K/Q6_K dequant-in-shader matmul, softmax, layernorm, RMSNorm, transpose, embedding gather
cuDNN cudnn_ops.rs Feature cudnn; conv2d forward/backward via cuDNN
Vulkan vulkan.rs Feature vulkan; ash + gpu-allocator, full buffer/pipeline/dispatch (~982 lines)
Metal metal.rs Feature metal; full buffer/pipeline/dispatch on Apple Silicon (~769 lines)
WebGPU wgpu_backend.rs Feature wgpu; full buffer/pipeline/dispatch via wgpu (~1710 lines)

Cargo Features

Feature Pulls In Purpose
std (default) Standard library support
cuda cudarc NVIDIA CUDA backend
cudnn cuda + cudarc cuDNN cuDNN conv ops
vulkan ash, gpu-allocator Vulkan compute backend
metal metal, objc (macOS only) Apple Metal backend
wgpu wgpu, pollster WebGPU / cross-platform backend

Usage

Add this to your Cargo.toml:

[dependencies]
axonml-core = "0.6.1"

Basic Example

use axonml_core::{Device, DType, Storage};

// Check device availability
let device = Device::Cpu;
assert!(device.is_available());

// Create storage on CPU
let storage = Storage::<f32>::zeros(1024, device);
assert_eq!(storage.len(), 1024);

// Create storage from data
let data = vec![1.0f32, 2.0, 3.0, 4.0];
let storage = Storage::from_vec(data, Device::Cpu);

// Create a view (zero-copy slice)
let view = storage.slice(1, 2).unwrap();
assert_eq!(view.len(), 2);

Device Capabilities

use axonml_core::Device;

let device = Device::Cpu;
let caps = device.capabilities();

println!("Device: {}", caps.name);
println!("Total Memory: {} bytes", caps.total_memory);
println!("Supports f16: {}", caps.supports_f16);
println!("Supports f64: {}", caps.supports_f64);

Data Types

use axonml_core::{DType, Scalar, Numeric, Float};

// Query dtype properties
assert!(DType::F32.is_float());
assert_eq!(DType::F32.size_of(), 4);

// Use type traits
fn process<T: Float>(data: &[T]) -> T {
    data.iter().fold(T::ZERO, |acc, &x| acc + x)
}

Picking a Backend

use axonml_core::backends::{best_available_backend, gpu_count, BackendType};

let backend = best_available_backend();
match backend {
    BackendType::Cpu => println!("Falling back to CPU"),
    _ => println!("Using {} GPU(s)", gpu_count()),
}

Tests

Run the test suite:

cargo test -p axonml-core

License

Licensed under either of:

at your option.


Last updated: 2026-04-16 (v0.6.1)

Dependencies

~4–12MB
~245K SLoC