#ida #galois #matrix #maths #simd

nightly guff-matrix

Fast Galois Field matrix multiplication

13 releases

0.1.12 Jul 29, 2021
0.1.11 Jul 23, 2021

#681 in Math

Download history 7/week @ 2023-07-28 8/week @ 2023-08-04 8/week @ 2023-08-11 24/week @ 2023-08-18 17/week @ 2023-08-25 19/week @ 2023-09-01 23/week @ 2023-09-08 8/week @ 2023-09-15 5/week @ 2023-09-22 12/week @ 2023-09-29 10/week @ 2023-10-06 8/week @ 2023-10-13 9/week @ 2023-10-20 31/week @ 2023-10-27 14/week @ 2023-11-03 23/week @ 2023-11-10

79 downloads per month
Used in 2 crates

GPL-2.0-or-later OR LGPL-2.0-or-later

2.5K SLoC


This crate uses SIMD code to achieve very fast Galois Field matrix multiplication.

Previous Implementation

I have previously implemented a version of this algorithm on a PlayStation 3. It is available here

SIMD Support

I will implement three different SIMD engines for field multiplication across vectors:

  • x86 implementation of parallel long (bitwise) multiplication

  • Arm/Aarch64 NEON implementation using hardware polynomial multiply and table-based modular reduction (vmull/tvbl)

  • Arm NEON implementation of parallel long (bitwise) multiplication

  • 4-way armv6 (Thumb) implementation of the long multiplication routine

Support for Arm targets requires nightly Rust build.

Infinite Tape (Simulation)

Before I start writing arch-specific implementations, I'm focusing on clearly documenting how the algorithm works. I'm going to implement a non-SIMD version that uses the same basic ideas, but using a more rusty style (infinite iterators). That's in src/arch.rs and can be enabled as a feature:

cargo test --features simulator --tests simulator

I'll also use this to prove that the algorithm works as intended.

  • Write and test simulation of non SIMD algorithm

  • Write and test simulation of SIMD algorithm

Matrix multiplication

Using the simd version of the field multiplication routine, I now have:

  • SIMD version of x86 matrix multiply

It needs a bit more work, but it's tested and runs around 3x faster than the reference version. See benches/vector_mul.rs for details. To run that with all relevant optimisations, you might need to turn on some compile flags:

RUSTFLAGS="-O -C target-cpu=native -C target-feature=+ssse3,+sse4.1,+sse4.2,+avx" cargo bench -q "matrix" 


~11K SLoC