#gpu-compute #gpu #vectorization #performance

bin+lib trueno

High-performance SIMD compute library with GPU support for matrix operations

34 releases (11 breaking)

new 0.14.6 Feb 15, 2026
0.14.4 Jan 30, 2026
0.9.0 Dec 30, 2025
0.7.3 Nov 25, 2025

#49 in Math

Download history 38/week @ 2025-11-12 3296/week @ 2025-11-19 1215/week @ 2025-11-26 1758/week @ 2025-12-03 2572/week @ 2025-12-10 1438/week @ 2025-12-17 949/week @ 2025-12-24 4509/week @ 2025-12-31 7394/week @ 2026-01-07 7794/week @ 2026-01-14 8048/week @ 2026-01-21 3967/week @ 2026-01-28 3844/week @ 2026-02-04 3035/week @ 2026-02-11

20,650 downloads per month
Used in 51 crates (21 directly)

MIT license

3.5MB
80K SLoC

trueno

Multi-Target High-Performance Compute Library

CI Coverage Crates.io


trueno (Spanish: "thunder") provides unified compute primitives across CPU SIMD, GPU, and WebAssembly.

Features

  • CPU SIMD: x86 (SSE2/AVX/AVX2/AVX-512), ARM (NEON), WASM (SIMD128)
  • GPU: Pure Rust PTX generation via trueno-gpu (no nvcc required)
  • Cross-platform GPU: Vulkan/Metal/DX12/WebGPU via wgpu
  • Auto-dispatch: Runtime selection of optimal backend
  • Zero unsafe in public API: Safety via type system

Installation

[dependencies]
trueno = "0.11"

# Optional: GPU support for large matrices
trueno = { version = "0.11", features = ["gpu"] }

# Optional: Pure Rust CUDA PTX generation
trueno-gpu = "0.4"

Quick Start

use trueno::{Vector, Matrix, SymmetricEigen};

// Vector operations - auto-selects best SIMD backend
let a = Vector::from_slice(&[1.0, 2.0, 3.0, 4.0]);
let b = Vector::from_slice(&[5.0, 6.0, 7.0, 8.0]);

let sum = a.add(&b).unwrap();           // [6.0, 8.0, 10.0, 12.0]
let dot = a.dot(&b).unwrap();           // 70.0
let activated = a.relu().unwrap();      // ReLU activation

// Matrix operations
let m = Matrix::from_vec(2, 2, vec![1.0, 2.0, 3.0, 4.0]).unwrap();
let product = m.matmul(&m).unwrap();    // Matrix multiplication
let transposed = m.transpose();          // Transpose

// Batched matmul for transformers (Q @ K^T pattern)
let batch = 2; let heads = 4; let seq = 8; let dim = 64;
let q: Vec<f32> = vec![0.1; batch * heads * seq * dim];
let kt: Vec<f32> = vec![0.1; batch * heads * dim * seq];
let attn = Matrix::batched_matmul_4d(&q, &kt, batch, heads, seq, dim, seq).unwrap();

// Eigendecomposition (PCA, spectral analysis)
let cov = Matrix::from_vec(2, 2, vec![3.0, 1.0, 1.0, 3.0]).unwrap();
let eigen = SymmetricEigen::new(&cov).unwrap();
let eigenvalues = eigen.eigenvalues();  // [4.0, 2.0]

Performance

Operation SIMD Speedup Notes
Dot product 6-17x AVX-512 for compute-bound
Matrix multiply 2-10x GPU for 500x500+
Reductions (sum, max, min) 3-12x AVX-512 optimal
Element-wise (add, mul) 1-2x Memory-bound
Convolution 2D 5-8x AVX2/AVX-512 optimized

Benchmark Results (AMD Ryzen 9 7950X)

Benchmark Throughput
Vector recip (AVX-512, 10K) 10.0 Gelem/s
Vector recip (AVX2, 10K) 9.7 Gelem/s
PTX module emit 3.1 µs
PTX kernel build 81 ns
Launch config 1.7 ns

GPU Note: GPU acceleration benefits matrix multiply only. Element-wise operations use CPU SIMD (GPU transfer overhead exceeds compute time).

trueno-gpu: Pure Rust CUDA

Generate CUDA PTX kernels without nvcc, LLVM, or external toolchains:

use trueno_gpu::kernels::{GemmKernel, Kernel, SoftmaxKernel};

// Generate optimized GEMM kernel
let gemm = GemmKernel::tensor_core(1024, 1024, 1024);
let ptx = gemm.emit_ptx();  // Pure Rust PTX generation

// Generate softmax with warp shuffle reduction
let softmax = SoftmaxKernel::new(4096);
let ptx = softmax.emit_ptx();

// Available kernels: GEMM, Softmax, LayerNorm, Attention, Quantize (Q4K/Q5K/Q6K)

Operations

Vector: add, sub, mul, div, dot, sum, min, max, argmin, argmax, norm_l1, norm_l2, normalize, recip, sqrt, abs, clamp

Activations: relu, leaky_relu, elu, sigmoid, tanh, gelu, swish, softmax, log_softmax, silu

Matrix: matmul, batched_matmul, batched_matmul_4d, transpose, matvec, convolve2d, pooling (max/avg), topk, gather, pad

Statistics: mean, variance, stddev, covariance, correlation, zscore

Eigen: symmetric eigendecomposition (Jacobi algorithm)

GPU Kernels: GEMM (naive/tiled/tensor core), Softmax, LayerNorm, RMSNorm, Attention, GEMV, Quantization

Development

cargo test                  # Run tests
cargo bench                 # Run benchmarks
make coverage              # Coverage report (requires cargo-llvm-cov)
cargo run --example backend_detection  # Check available backends

Ecosystem

Part of the Pragmatic AI Labs stack:

License

MIT - see LICENSE

Dependencies

~2–34MB
~536K SLoC