3 releases
| 0.2.4 | Nov 13, 2025 |
|---|---|
| 0.2.3 | Nov 12, 2025 |
| 0.2.2 | Nov 10, 2025 |
#2502 in Machine learning
Used in 5 crates
(via hodu_core)
365KB
5K
SLoC
hodu_cuda_kernels
High-performance CUDA kernels for tensor operations on NVIDIA GPUs.
cuBLAS Integration
Supported Operations
- matmul: Batched matrix multiplication with GEMM
- dot: 2D matrix multiplication with GEMM
Supported Data Types
- bf16: BFloat16 (compute in FP32, I/O in BF16)
- f16: Float16/Half (compute in FP32, I/O in FP16)
- f32: Float32 (native precision)
- f64: Float64 (native precision)
Features
- Automatic fallback to custom CUDA kernels for unsupported types or non-contiguous matrices
- Handles non-contiguous matrices via leading dimension parameters
- Transparent row-major to column-major layout conversion
Dependencies
~12MB
~294K SLoC