#simd #performance #numerical #function

bin+lib simdly

🚀 High-performance Rust library leveraging SIMD and Rayon for fast computations

10 releases

0.1.10 Aug 18, 2025
0.1.9 Aug 16, 2025
0.1.6 Jul 29, 2025
0.1.1 Jun 13, 2025

#507 in Hardware support

MIT license

685KB
11K SLoC

Simdly

⚠️ Development Status: This project is currently under active development. APIs may change and features are still being implemented.

🚀 A high-performance Rust library that leverages SIMD (Single Instruction, Multiple Data) instructions for fast vectorized computations. This library provides efficient implementations of mathematical operations using modern CPU features.

Crates.io Documentation License: MIT Rust

✨ Features

  • 🚀 SIMD Optimized: Leverages AVX2 (256-bit) and NEON (128-bit) instructions for vector operations
  • 🧠 Intelligent Algorithm Selection: Automatic choice between scalar, SIMD, and parallel algorithms based on data size
  • 💾 Memory Efficient: Supports both aligned and unaligned memory access patterns with cache-aware chunking
  • 🔧 Generic Traits: Provides consistent interfaces across different SIMD implementations
  • 🛡️ Safe Abstractions: Wraps unsafe SIMD operations in safe, ergonomic APIs with robust error handling
  • 🧮 Rich Math Library: Extensive mathematical functions (trig, exp, log, sqrt, etc.) with SIMD acceleration
  • ⚡ Performance: Optimized thresholds prevent overhead while maximizing throughput gains

🏗️ Architecture Support

Currently Supported

  • x86/x86_64 with AVX2 (256-bit vectors)
  • ARM/AArch64 with NEON (128-bit vectors)

Planned Support

  • SSE (128-bit vectors for older x86 processors)

📦 Installation

Add simdly to your Cargo.toml:

[dependencies]
simdly = "0.1.10"

For optimal performance, enable AVX2 support in your build configuration.

🚀 Quick Start

The library provides multiple algorithms for vector operations that you can choose based on your data size:

  • Small arrays (< 128 elements): Use scalar operations to avoid SIMD setup overhead
  • Medium arrays (128+ elements): Use SIMD operations for optimal vectorization benefits
  • Large arrays (≥ 262,144 elements): Use parallel SIMD for memory bandwidth and multi-core scaling

The library supports working with SIMD vectors directly, handling partial data efficiently, and provides mathematical operations with automatic SIMD acceleration including trigonometric functions, exponentials, square roots, powers, and distance calculations.

📊 Performance

simdly provides significant performance improvements for numerical computations with multiple algorithm options:

Algorithm Selection

Performance Characteristics

  • Mathematical Operations: SIMD shows 4x-13x speedup for complex operations like cosine
  • Simple Operations: Intelligent thresholds prevent performance regression on small arrays
  • Memory Hierarchy: Optimized chunk sizes (16 KiB) for L1 cache efficiency
  • Cross-Platform: Thresholds work optimally on Intel AVX2 and ARM NEON architectures

Mathematical Functions Performance

Complex mathematical operations benefit from SIMD across all sizes:

Function Array Size SIMD Speedup Notes
cos() 4 KiB 4.4x Immediate benefit
cos() 64 KiB 11.7x Peak efficiency
cos() 1 MiB 13.3x Best performance
cos() 128 MiB 9.2x Memory-bound

Key Features

  • Manual Optimization: Choose the best algorithm for your specific use case
  • Zero-Cost Abstraction: Direct method calls with no runtime overhead
  • Memory Efficiency: Cache-aware chunking and aligned memory access
  • Scalable Performance: Near-linear scaling with available CPU cores

Compilation Flags

For maximum performance, compile with target-feature flags for AVX2 support, and consider using link-time optimization (LTO) and single codegen unit configuration in your release profile.

🔧 Usage

The library provides multiple algorithms that you can choose based on your specific needs, with fine-grained control over algorithm selection. It supports vectorized mathematical operations with automatic SIMD acceleration, efficient processing of large arrays with chunking strategies, and memory-aligned operations for optimal performance on both AVX2 and NEON architectures.

📚 Documentation

🛠️ Development

Prerequisites

  • Rust 1.77 or later
  • x86/x86_64 processor with AVX2 support
  • Linux, macOS, or Windows

Building

Clone the repository and build with cargo build --release.

Testing

Run tests with cargo test.

Performance Benchmarks

The crate includes comprehensive benchmarks showing real-world performance improvements:

Run benchmarks with cargo bench and view detailed reports in the target/criterion/report/ directory.

Key Findings from Benchmarks:

  • Mathematical operations (cos, sin, exp, etc.) show significant SIMD acceleration
  • Parallel methods automatically optimize based on array size using PARALLEL_SIMD_THRESHOLD
  • Performance varies by CPU architecture - benchmarks show actual improvements on your hardware

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Areas for Contribution

  • Additional SIMD instruction set support (SSE)
  • Advanced mathematical operations implementation
  • Performance optimizations and micro-benchmarks
  • Documentation improvements and examples
  • Testing coverage and edge case validation
  • WebAssembly SIMD support

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built with Rust's excellent SIMD intrinsics
  • Inspired by high-performance computing libraries
  • Thanks to the Rust community for their valuable feedback

📈 Roadmap

  • ARM NEON support for ARM/AArch64 - ✅ Complete with full mathematical operations
  • Additional mathematical operations - ✅ Power, 2D/3D/4D hypotenuse, and more
  • SSE support for older x86 processors
  • Automatic SIMD instruction set detection
  • WebAssembly SIMD support
  • Additional mathematical functions (bessel, gamma, etc.)
  • Complex number SIMD operations

Made with ❤️ and ⚡ by Mahdi Tantaoui

Dependencies

~1.5MB
~25K SLoC