3 releases

0.2.4 Nov 13, 2025
0.2.3 Nov 12, 2025
0.2.2 Nov 10, 2025

#2502 in Machine learning


Used in 5 crates (via hodu_core)

BSD-3-Clause

365KB
5K SLoC

CUDA 3.5K SLoC // 0.0% comments Rust 1.5K SLoC // 0.0% comments

hodu_cuda_kernels

High-performance CUDA kernels for tensor operations on NVIDIA GPUs.

cuBLAS Integration

Supported Operations

  • matmul: Batched matrix multiplication with GEMM
  • dot: 2D matrix multiplication with GEMM

Supported Data Types

  • bf16: BFloat16 (compute in FP32, I/O in BF16)
  • f16: Float16/Half (compute in FP32, I/O in FP16)
  • f32: Float32 (native precision)
  • f64: Float64 (native precision)

Features

  • Automatic fallback to custom CUDA kernels for unsupported types or non-contiguous matrices
  • Handles non-contiguous matrices via leading dimension parameters
  • Transparent row-major to column-major layout conversion

Dependencies

~12MB
~294K SLoC