8 releases (5 breaking)
0.17.0 | Apr 9, 2025 |
---|---|
0.16.0 | Feb 8, 2025 |
0.15.1 | Jan 6, 2025 |
0.15.0 | Dec 28, 2024 |
0.10.0 | May 25, 2024 |
#404 in Machine learning
1,852 downloads per month
Used in 9 crates
(2 directly)
205KB
5K
SLoC
Portable SIMD library.
rten-simd is a library for defining operations that are accelerated using SIMD instruction sets such as AVX2, Arm Neon or WebAssembly SIMD. Operations are defined once using safe, portable APIs, then dispatched at runtime to evaluate the operation using the best available SIMD instruction set (ISA) on the current CPU.
The design is inspired by Google's Highway library for C++ and the pulp crate.
Differences from std::simd
In nightly Rust the standard library has a built-in portable SIMD API,
std::simd
. This library differs in several ways:
-
It is available on stable Rust
-
The instruction set is selected at runtime rather than compile time. On x86 an operation may be compiled for AVX-512, AVX2 and generic (SSE). If the binary is run on a system supporting AVX-512 that version will be used. The same binary on an older system may use the generic (SSE) version.
-
Operations use the full available SIMD vector width, which varies by instruction set, as opposed to specifying a fixed width in the code. For example a SIMD vector with f32 elements has 4 lanes on Arm Neon and 16 lanes under AVX-512.
The API is designed to support scalable vector ISAs such as Arm SVE and RVV in future, where the vector length is known only at runtime.
-
Semantics are chosen to be "performance portable". This means that the behavior is chosen based on what maps well to the hardware, rather than strictly matching Rust behaviors for scalars as
std::simd
generally does. It also means some operations may have different behaviors in edge cases on different platforms. This is similar to WebAssembly Relaxed SIMD.
Supported architectures
The currently supported SIMD ISAs are:
- AVX2
- AVX-512 (requires nightly Rust and
avx512
feature enabled) - Arm Neon
- WebAssembly SIMD (including relaxed SIMD)
There is also a generic fallback implemented using 128-bit arrays which is designed to be autovectorization-friendly (ie. it compiles on all platforms, and should enable the compiler to use SSE or similar instructions).
Example
This code defines an operation which squares each value in a slice and evaluates it on a vector of floats:
use rten_simd::{Isa, SimdOp};
use rten_simd::ops::NumOps;
use rten_simd::functional::simd_map;
struct Square<'a> {
xs: &'a mut [f32],
}
impl<'a> SimdOp for Square<'a> {
type Output = &'a mut [f32];
#[inline(always)]
fn eval<I: Isa>(self, isa: I) -> Self::Output {
let ops = isa.f32();
simd_map(ops, self.xs, #[inline(always)] |x| ops.mul(x, x))
}
}
let mut buf: Vec<_> = (0..32).map(|x| x as f32).collect();
let expected: Vec<_> = buf.iter().map(|x| *x * *x).collect();
let squared = Square { xs: &mut buf }.dispatch();
assert_eq!(squared, &expected);
This example shows the basic steps to define a vectorized operation:
- Create a struct containing the operation's parameters.
- Implement the
SimdOp
trait for the struct to define how to evaluate the operation. - Call
SimdOp::dispatch
to evaluate the operation using the best available instruction set. Here "best" refers to the ISA with the widest vectors, and thus the maximum amount of parallelism.
Note the use of the #[inline(always)]
attribute on closures and functions
called within eval
. See the section on inlining below for an explanation.
Separation of vector types and operations
SIMD vectors are effectively arrays (like [T; N]
) with a larger alignment.
A SIMD vector type can be created whether or not the associated instructions
are supported on the system.
Performing a SIMD operation however requires the caller to first ensure that
the instructions are supported on the current system. To enforce this,
operations are separated from the vector type, and types providing access to
SIMD operations ([Isa
]) can only be instantiated if the instruction set is
supported.
Overview of key traits
The SimdOp
trait defines an operation which can be vectorized using
different SIMD instruction sets. This trait has a
dispatch
method to perform the operation.
An implementation of the [Isa
] trait is passed to SimdOp::eval
. The
[Isa
] is the entry point for operations on SIMD vectors. It provides
access to implementations of the NumOps
trait and
sub-traits for each element type. For example Isa::f32
provides
operations on SIMD vectors with f32
elements.
The NumOps
trait provides operations that are available on
all SIMD vectors. The sub-traits FloatOps
and
IntOps
provide operations that are only available on SIMD
vectors with float and integer elements respectively. There is also
SignedIntOps
for signed integer operations. Finally
there are additional traits for operations only available for other subsets
of element types. For example Extend
widens each lane to
one with twice the bit-width.
SIMD operations (eg. NumOps::add
take SIMD vectors as
arguments. These vectors are either platform-specific types (eg.
float32x4_t
on Arm) or transparent wrappers around them. The Simd
trait is implemented for all vector types. The Elem
trait is implemented
for supported element types, providing required numeric operations.
Use with slices
SIMD operations are usually applied to a slice of elements. To support this,
the SimdIterable
trait provides a way to iterate over SIMD vector-sized
chunks of a slice, using padding or masking to handle slice lengths that are
not a multiple of the vector size.
The functional
module provides utilities for defining vectorized
transforms on slices (eg. simd_map
).
The SliceWriter
utility provides a way to incrementally initialize the
contents of a slice with the results of SIMD operations, by writing one
SIMD vector at a time.
The SimdUnaryOp
trait provides a convenient way to define unary
operations (like Iterator::map
) on slices.
Importance of inlining
In the above example #[inline(always)]
attributes are used to ensure
that the whole eval
implementation is compiled to a single function. This
is required to ensure that the platform-specific intrinsics (from
core::arch
) are compiled to direct instructions with no function call
overhead.
Failure to inline these intrinsics will significantly harm performance, since most of the runtime will be spent in function call overhead rather than actual computation. This issue affects platforms where the availability of the SIMD instruction set is not guaranteed at compile time. This includes AVX2 and AVX-512 on x86-64, but not Arm Neon or WASM SIMD.
If a vectorized operation performs more slowly than expected, use a profiler such as samply to verify that the intrinsics have been inlined and thus do not appear in the list of called functions.
The need for this forced inlining is expected to change in future with
updates to how Rust's target_feature
attribute works.
Generic operations
It is possible to define operations which are generic over the element type
by using the GetNumOps
trait and related traits. These
are implemented for supported element types and provide a way to get the
NumOps
implementation for that element type from an Isa
.
This can be used to define SimdOp
s which are generic over the element
type.
This example defines an operation which can sum a slice of any supported element type:
use std::iter::Sum;
use rten_simd::{Isa, Simd, SimdIterable, SimdOp};
use rten_simd::ops::{GetNumOps, NumOps};
struct SimdSum<'a, T>(&'a [T]);
impl<'a, T: GetNumOps + Sum> SimdOp for SimdSum<'a, T> {
type Output = T;
#[inline(always)]
fn eval<I: Isa>(self, isa: I) -> Self::Output {
let ops = T::num_ops(isa);
let partial_sums = self.0.simd_iter(ops).fold(
ops.zero(),
|sum, x| ops.add(sum, x)
);
partial_sums.to_array().into_iter().sum()
}
}
assert_eq!(SimdSum(&[1.0f32, 2.0, 3.0]).dispatch(), 6.0);
assert_eq!(SimdSum(&[1i32, 2, 3]).dispatch(), 6);
assert_eq!(SimdSum(&[1u8, 2, 3]).dispatch(), 6u8);
rten-simd
Portable SIMD library for stable Rust.
rten-simd is a library for defining operations that are accelerated using SIMD instruction sets such as AVX2, Arm Neon or WebAssembly SIMD. Operations are defined once using safe, portable APIs, then dispatched at runtime to evaluate the operation using the best available SIMD instruction set (ISA) on the current CPU.
The design is inspired by Google's Highway library for C++ and the pulp crate.