1 unstable release
0.0.1 | Apr 16, 2020 |
---|
#7 in #effort
36KB
639 lines
ssimd
This is a fork of the unmaintained ssimd crate with the goal of making it compatible with packed_simd.
Simulated Simd on Rust stable channel. This is my effort to make the packed-simd crate work on stable channel. The work is based on the well-known method : auto-vectorization. However, in this crate, I try to provide an API that is as close as possible to the simd crate. While autovectorization seems to work as a luck, with a simple trick as follow, I have made autovectorization successful in most of the cases.
Note
In order to make auto-vectorization successful in most of the cases, please turn on the BB optimizer :
https://llvm.org/docs/Vectorizers.html#the-slp-vectorizer
For Rust, you can turn on the BB optimizer by building your projects with the following build command:
RUSTFLAGS="-C llvm-args=--vectorize-slp -C target-cpu=native" cargo build --release
Examples
Let's start with a very simple example
extern crate ssimd;
use ssimd::f64x2;
#[inline(never)]
fn test_simd(a : f64x2, b: f64x2) {
let c = a + b;
println!("{:?}", c);
}
fn main() {
let a = f64x2::new(1.0, 1.0);
let b = f64x2::new(2.0, 3.0);
test_simd(a, b);
}
When compile to llvm-ir code you will see those instructions in the function "test_simd":
%7 = fadd <2 x double> %4, %6
store <2 x double> %7, <2 x double>* %c, align 16
and the equivalent assembly code is :
addpd %xmm0, %xmm2
movapd %xmm2, 16(%rsp)
In this case, even with the default build command, LLVM can successfully vectorize the code.
Let's try a more tricky example :
extern crate ssimd;
use std::io::{self, BufRead};
use ssimd::f64x2;
fn test_simd(i : i32) {
let a = f64x2::new((i + 1) as f64, (i + 2) as f64);
let b = f64x2::new((i + 3) as f64, (i + 3) as f64);
let c = a + b;
println!("{:?}", c);
}
fn main() {
let stdin = io::stdin();
let mut line = String::new();
stdin.lock().read_line(&mut line).unwrap();
let i : i32 = line.trim().parse().unwrap_or(0);
test_simd(i);
}
The llvm-ir code in the function "test_simd":
%104 = bitcast %"ssimd::f64x2"* %c.i to i8*
call void @llvm.lifetime.start(i64 16, i8* nonnull %104)
%105 = fadd double %99, %103
%106 = fadd double %101, %103
and the assembly code :
cvtsi2sdl %ecx, %xmm2
addsd %xmm2, %xmm0
addsd %xmm2, %xmm1
So llvm wasn't able to vectorize the code when an integer-to-float converstion is inserted inside the function. However if the BB optimizer is enable, you will see the llvm-ir code as follow:
%109 = fadd <2 x double> %108, %106
store <2 x double> %109, <2 x double>* %c.i, align 16
and the assembly code:
addpd %xmm2, %xmm0
movapd %xmm0, 96(%rsp)
So the code is successfully vectorized.
You can see more examples in the folder "examples". These examples are ported from the simd crate to work on stable channel. Almost no modification from the original code is made. For these examples, some might not get autovectorization with default build command. However, when the BB optimizer is enabled, all examples are successfully vectorized. You can try more with your examples.
AVX instructions
AVX instructions are not available in some machines. If you want to use AVX instructions with the intrinsic approach (like the one in the simd crate), you will need to use the Rust attribute "target_feature" to detect whether the machines support those instructions. You also need to provide a fallback function method in case AVX instructions are not available. However, if you use autovectorization, you only need to provide one function method, since LLVM will generate appropriate instructions for each machine configuration.
Consider the following example:
extern crate ssimd;
use ssimd::f64x4;
#[inline(never)]
fn test_simd(a : f64x4, b: f64x4) {
let c = a + b;
println!("{:?}", c);
}
fn main() {
let a = f64x4::new(1.0, 1.0, 1.0, 1.0);
let b = f64x4::new(2.0, 3.0, 4.0, 5.0);
test_simd(a, b);
}
On machines that do not have AVX instructions, LLVM generates the following code:
addpd %xmm0, %xmm2
addpd %xmm1, %xmm3
movapd %xmm2, 64(%rsp)
movapd %xmm3, 80(%rsp)
On machines that support AVX instructions, LLVM will generate this:
vaddpd (%rsi), %ymm0, %ymm0
vmovapd %ymm0, 64(%rsp)
So everything will be done automatically for you.