#linear-algebra #benchmarking #performance #transform #matrix #3d #different

bitshifter/mathbench

Comparing performance of Rust math libraries for common 3D game and graphics tasks

7 releases

0.1.8 Aug 14, 2019
0.1.7 Aug 13, 2019
0.1.6 Jul 21, 2019
0.1.4 Jun 22, 2019
0.1.2 May 30, 2019

#311 in Graphics APIs

177 stars & 5 watchers

100KB
1.5K SLoC

mathbench

Build Status

mathbench is a suite of unit tests and benchmarks comparing the output and performance of a number of different Rust linear algebra libraries for common game and graphics development tasks.

mathbench is written by the author of glam and has been used to compare the performance of glam with other similar 3D math libraries targeting games and graphics development, including:

The benchmarks

All benchmarks are performed using Criterion.rs. Benchmarks are logically into the following categories:

  • return self - attempts to measure overhead of benchmarking each type.
  • single operations - measure the performance of single common operations on types, e.g. a matrix inverse, vector normalization or multiplying two matrices.
  • throughput operations - measure the performance of common operations on batches of data. These measure operations that would commonly be processing batches of input, for example transforming a number of vectors with the same matrix.
  • workload operations - these attempt to recreate common workloads found in game development to try and demonstrate performance on real world tasks.

Despite best attempts, take the results of micro benchmarks with a pinch of salt.

Operation benchmarks

  • matrix benches - performs common matrix operations such as transpose, inverse, determinant and multiply.
  • rotation 3d benches - perform common 3D rotation operations.
  • transform 2d & 3d benches - bench special purpose 2D and 3D transform types. These can be compared to 3x3 and 4x4 matrix benches to some extent.
  • transformations benches - performs affine transformations on vectors - uses the best available type for the job, either matrix or transform types depending on the library.
  • vector benches - perform common vector operations.

Workload benchmarks

  • euler bench - performs an Euler integration on arrays of 2D and 3D vectors

The benchmarks are currently focused on f32 types as that is all glam currently supports.

Crate differences

Different libraries have different features and different ways of achieving the same goal. For the purpose of trying to get a performance comparison sometimes mathbench compares similar functionality, but sometimes it's not exactly the same. Below is a list of differences between libraries that are notable for performance comparisons.

Matrices versus transforms

The euclid library does not support generic square matrix types like the other libraries tested. Rather it has 2D and 3D transform types which can transform 2D and 3D vector and point types. Each library has different types for supporting transforms but euclid is unique amongst the libraries tested in that is doesn't have generic square matrix types.

The Transform2D is stored as a 3x2 row major matrix that can be used to transform 2D vectors and points.

Similarly Transform3D is used for transforming 3D vectors and points. This is represented as a 4x4 matrix so it is more directly comparable to the other libraries however it doesn't support some operations like transpose.

There is no equivalent to a 2x2 matrix type in euclid.

Matrix inverse

Note that cgmath and nalgebra matrix inverse methods return an Option whereas glam and euclid do not. If a non-invertible matrix is inverted by glam or euclid the result will be invalid (it will contain NaNs).

Quaternions versus rotors

Most libraries provide quaternions for performing rotations except for ultraviolet which provides rotors.

Wide benchmarks

All benchmarks are gated as either "wide" or "scalar". This division allows us to more fairly compare these different styles of libraries.

"scalar" benchmarks operate on standard scalar f32 values, doing calculations on one piece of data at a time (or in the case of a "horizontal" SIMD library like glam, one Vec3/Vec4 at a time).

"wide" benchmarks operate in a "vertical" AoSoA (Array-of-Struct-of-Arrays) fashion, which is a programming model that allows the potential to more fully use the advantages of SIMD operations. However, it has the cost of making algorithm design harder, as scalar algorithms cannot be directly used by "wide" architectures. Because of this difference in algorithms, we also can't really directly compare the performance of "scalar" vs "wide" types because they don't quite do the same thing (wide types operate on multiple pieces of data at the same time).

The "wide" benchmarks still include glam, a scalar-only library, as a comparison. Even though the comparison is somewhat apples-to-oranges, in each of these cases, when running "wide" benchmark variants, glam is configured to do the exact same amount of final work, producing the same outputs that the "wide" versions would. The purpose is to give an idea of the possible throughput benefits of "wide" types compared to writing the same algorithms with a scalar type, at the cost of extra care being needed to write the algorithm.

To learn more about AoSoA architecture, see this blog post by the author of nalgebra which goes more in depth to how AoSoA works and its possible benefits. Also take a look at the "Examples" section of ultraviolet's README, which contains a discussion of how to port scalar algorithms to wide ones, with the examples of the Euler integration and ray-sphere intersection benchmarks from mathbench.

Note that the nalgebra_f32x4 and nalgebra_f32x8 benchmarks require a Rust

Additionally the f32x8 benchmarks will require the AVX2 instruction set, to enable that you will need to build with RUSTFLAGS='-C target-feature=+avx2.

Build settings

The default profile.bench settings are used, these are documented in the cargo reference.

Some math libraries are optimized to use specific instruction sets and may benefit building with settings different to the defaults. Typically a game team will need to decided on a minimum specification that they will target. Deciding on a minimum specifiction dictates the potential audience size for a project. This is an important decision for any game and it will be different for every project. mathbench doesn't want to make assumptions about what build settings any particular project may want to use which is why default settings are used.

I would encourage users who to use build settigs different to the defaults to run the benchmarks themselves and consider publishing their results.

Benchmark results

The following is a table of benchmarks produced by mathbench comparing glam performance to cgmath, nalgebra, euclid, vek, pathfinder_geometry, static-math and ultraviolet on f32 data.

These benchmarks were performed on an Intel i7-4710HQ CPU on Linux. They were compiled with the 1.56.1 (59eed8a2a 2021-11-01) Rust compiler. Lower (better) numbers are highlighted within a 2.5% range of the minimum for each row.

The versions of the libraries tested were:

  • cgmath - 0.18.0
  • euclid - 0.22.6
  • glam - 0.20.1
  • nalgebra - 0.29.0
  • pathfinder_geometry - 0.5.1
  • static-math - 0.2.3
  • ultraviolet - 0.8.1
  • vek - 0.15.3 (repr_c types)

See the full mathbench report for more detailed results.

Scalar benchmarks

Run with the command:

cargo bench --features scalar scalar
benchmark glam cgmath nalgebra euclid vek pathfinder static-math ultraviolet
euler 2d x10000 16.23 us 16.13 us 9.954 us 16.18 us 16.2 us 10.42 us 9.97 us 16.17 us
euler 3d x10000 15.95 us 32.11 us 32.13 us 32.13 us 32.13 us 16.27 us 32.16 us 32.11 us
matrix2 determinant 2.0386 ns 2.0999 ns 2.1018 ns N/A 2.0997 ns 2.0987 ns 2.0962 ns 2.1080 ns
matrix2 inverse 2.8226 ns 8.4418 ns 7.6303 ns N/A N/A 3.3459 ns 9.4636 ns 5.8796 ns
matrix2 mul matrix2 2.6036 ns 5.0007 ns 4.8172 ns N/A 9.3814 ns 2.5516 ns 4.7274 ns 4.9428 ns
matrix2 mul vector2 x1 2.4904 ns 2.6144 ns 2.8714 ns N/A 4.2139 ns 2.0839 ns 2.8873 ns 2.6250 ns
matrix2 mul vector2 x100 227.5271 ns 243.3579 ns 265.1698 ns N/A 400.6940 ns 219.7127 ns 267.8780 ns 243.9880 ns
matrix2 return self 2.4235 ns 2.8841 ns 2.8756 ns N/A 2.8754 ns 2.4147 ns 2.8717 ns 2.8697 ns
matrix2 transpose 2.2887 ns 3.0645 ns 7.9154 ns N/A 2.9635 ns N/A 3.0637 ns 3.0652 ns
matrix3 determinant 3.9129 ns 3.8107 ns 3.8191 ns N/A 3.8180 ns N/A 3.8151 ns 8.9368 ns
matrix3 inverse 17.5373 ns 18.6931 ns 12.3183 ns N/A N/A N/A 12.8195 ns 21.9098 ns
matrix3 mul matrix3 9.9578 ns 13.3648 ns 7.8154 ns N/A 35.5802 ns N/A 6.4938 ns 10.0527 ns
matrix3 mul vector3 x1 4.8090 ns 4.9339 ns 4.5046 ns N/A 12.5518 ns N/A 4.8002 ns 4.8118 ns
matrix3 mul vector3 x100 0.4836 us 0.4808 us 0.4755 us N/A 1.247 us N/A 0.4816 us 0.4755 us
matrix3 return self 5.4421 ns 5.4469 ns 5.4526 ns N/A 5.4656 ns N/A 5.4718 ns 5.4043 ns
matrix3 transpose 9.9567 ns 10.0794 ns 10.9704 ns N/A 9.9257 ns N/A 10.7350 ns 10.5334 ns
matrix4 determinant 6.2050 ns 11.1041 ns 69.2549 ns 17.1809 ns 18.5233 ns N/A 16.5331 ns 8.2704 ns
matrix4 inverse 16.4386 ns 47.0674 ns 71.8174 ns 64.1356 ns 284.3703 ns N/A 52.6993 ns 41.1780 ns
matrix4 mul matrix4 7.7715 ns 26.7308 ns 8.6500 ns 10.4414 ns 86.1501 ns N/A 21.7985 ns 26.8056 ns
matrix4 mul vector4 x1 3.0303 ns 7.7400 ns 3.4091 ns N/A 21.0968 ns N/A 6.2971 ns 6.2537 ns
matrix4 mul vector4 x100 0.6136 us 0.9676 us 0.627 us N/A 2.167 us N/A 0.7893 us 0.8013 us
matrix4 return self 7.1741 ns 6.8838 ns 7.5030 ns N/A 7.0410 ns N/A 6.7768 ns 6.9508 ns
matrix4 transpose 6.6826 ns 12.4966 ns 15.3265 ns N/A 12.6386 ns N/A 15.2657 ns 12.3396 ns
ray-sphere intersection x10000 56.2 us 55.7 us 15.32 us 55.45 us 56.02 us N/A N/A 50.94 us
rotation3 inverse 2.3113 ns 3.1752 ns 3.3292 ns 3.3311 ns 3.1808 ns N/A 8.7109 ns 3.6535 ns
rotation3 mul rotation3 3.6584 ns 7.5255 ns 7.4808 ns 8.1393 ns 14.1636 ns N/A 6.8044 ns 7.6386 ns
rotation3 mul vector3 x1 6.4950 ns 7.6808 ns 7.5784 ns 7.5746 ns 18.2547 ns N/A 7.2727 ns 8.9732 ns
rotation3 mul vector3 x100 0.6465 us 0.7844 us 0.7573 us 0.7533 us 1.769 us N/A 0.7317 us 0.9416 us
rotation3 return self 2.4928 ns 2.8740 ns 2.8687 ns N/A 2.8724 ns N/A 4.7868 ns 2.8722 ns
transform point2 x1 2.7854 ns 2.8878 ns 4.4207 ns 2.8667 ns 11.9427 ns 2.3601 ns N/A 4.1770 ns
transform point2 x100 0.3316 us 0.3574 us 0.4445 us 0.3008 us 1.212 us 0.3184 us N/A 0.4332 us
transform point3 x1 2.9619 ns 10.6812 ns 6.1037 ns 7.7051 ns 13.2607 ns 3.0934 ns N/A 6.8419 ns
transform point3 x100 0.6095 us 1.27 us 0.8064 us 0.7674 us 1.446 us 0.6189 us N/A 0.8899 us
transform vector2 x1 2.4944 ns N/A 3.7174 ns 2.6273 ns 11.9424 ns N/A N/A 3.0458 ns
transform vector2 x100 0.3125 us N/A 0.3871 us 0.2817 us 1.213 us N/A N/A 0.3649 us
transform vector3 x1 2.8091 ns 7.7343 ns 5.5064 ns 4.4810 ns 15.4097 ns N/A N/A 4.8819 ns
transform vector3 x100 0.6035 us 0.9439 us 0.7573 us 0.6327 us 1.63 us N/A N/A 0.6703 us
transform2 inverse 9.0256 ns N/A 12.2614 ns 9.4803 ns N/A 8.9047 ns N/A N/A
transform2 mul transform2 4.5111 ns N/A 8.1434 ns 5.8677 ns N/A 3.8513 ns N/A N/A
transform2 return self 4.1707 ns N/A 5.4356 ns 4.2775 ns N/A 4.1117 ns N/A N/A
transform3 inverse 10.9869 ns N/A 71.4437 ns 56.0136 ns N/A 23.0392 ns N/A N/A
transform3 mul transform3d 6.5903 ns N/A 8.5673 ns 10.1802 ns N/A 7.6587 ns N/A N/A
transform3 return self 7.1828 ns N/A 7.2619 ns 7.2407 ns N/A 7.3214 ns N/A N/A
vector3 cross 2.4257 ns 3.6842 ns 3.7945 ns 3.6821 ns 3.8323 ns N/A 3.8622 ns 3.6927 ns
vector3 dot 2.1055 ns 2.3179 ns 2.3174 ns 2.3190 ns 2.3195 ns N/A 2.3204 ns 2.3160 ns
vector3 length 2.5020 ns 2.5002 ns 2.5986 ns 2.5013 ns 2.5021 ns N/A 2.5036 ns 2.5017 ns
vector3 normalize 4.0454 ns 5.8411 ns 8.4069 ns 8.0679 ns 8.8137 ns N/A N/A 5.8440 ns
vector3 return self 2.4087 ns 3.1021 ns 3.1061 ns N/A 3.1052 ns N/A 3.1136 ns 3.1071 ns

Wide benchmarks

These benchmarks were performed on an Intel i7-4710HQ CPU on Linux. They were compiled with the 1.59.0-nightly (207c80f10 2021-11-30) Rust compiler. Lower (better) numbers are highlighted within a 2.5% range of the minimum for each row.

The versions of the libraries tested were:

  • glam - 0.20.1
  • nalgebra - 0.29.0
  • ultraviolet - 0.8.1

Run with the command:

RUSTFLAGS='-C target-feature=+avx2' cargo +nightly bench --features wide wide
benchmark glam_f32x1 ultraviolet_f32x4 nalgebra_f32x4 ultraviolet_f32x8 nalgebra_f32x8
euler 2d x80000 142.7 us 63.47 us 63.94 us 69.27 us 69.25 us
euler 3d x80000 141.2 us 97.18 us 95.78 us 103.7 us 105.7 us
matrix2 determinant x16 18.6849 ns 11.4259 ns N/A 9.9982 ns N/A
matrix2 inverse x16 39.1219 ns 29.8933 ns N/A 22.8757 ns N/A
matrix2 mul matrix2 x16 42.7342 ns 36.4879 ns N/A 33.4814 ns N/A
matrix2 mul matrix2 x256 959.1663 ns 935.4148 ns N/A 862.0910 ns N/A
matrix2 mul vector2 x16 41.2464 ns 18.2382 ns N/A 17.2550 ns N/A
matrix2 mul vector2 x256 698.1177 ns 544.5315 ns N/A 540.9743 ns N/A
matrix2 return self x16 32.7553 ns 29.5064 ns N/A 21.4492 ns N/A
matrix2 transpose x16 32.3247 ns 46.4836 ns N/A 20.0852 ns N/A
matrix3 determinant x16 53.2366 ns 25.0158 ns N/A 22.1503 ns N/A
matrix3 inverse x16 275.9330 ns 78.3532 ns N/A 69.2627 ns N/A
matrix3 mul matrix3 x16 239.6124 ns 115.2934 ns N/A 116.6237 ns N/A
matrix3 mul matrix3 x256 3.26 us 1.959 us N/A 1.963 us N/A
matrix3 mul vector3 x16 78.4972 ns 40.4734 ns N/A 47.0164 ns N/A
matrix3 mul vector3 x256 1.293 us 1.0 us N/A 1.007 us N/A
matrix3 return self x16 112.4312 ns 78.4870 ns N/A 67.3272 ns N/A
matrix3 transpose x16 116.9654 ns 100.1097 ns N/A 67.4544 ns N/A
matrix4 determinant x16 98.8388 ns 56.1177 ns N/A 55.7623 ns N/A
matrix4 inverse x16 276.2637 ns 191.7471 ns N/A 163.8408 ns N/A
matrix4 mul matrix4 x16 230.9916 ns 222.3948 ns N/A 221.8563 ns N/A
matrix4 mul matrix4 x256 3.793 us 3.545 us N/A 3.67 us N/A
matrix4 mul vector4 x16 92.9485 ns 87.7341 ns N/A 90.4404 ns N/A
matrix4 mul vector4 x256 1.58 us 1.542 us N/A 1.596 us N/A
matrix4 return self x16 175.6153 ns 158.7861 ns N/A 167.6639 ns N/A
matrix4 transpose x16 184.0498 ns 193.5497 ns N/A 147.1365 ns N/A
ray-sphere intersection x80000 567.9 us 154.8 us N/A 61.49 us N/A
rotation3 inverse x16 32.7517 ns 32.8107 ns N/A 22.3662 ns N/A
rotation3 mul rotation3 x16 58.9408 ns 38.6848 ns N/A 34.3223 ns N/A
rotation3 mul vector3 x16 130.6707 ns 36.7861 ns N/A 26.1154 ns N/A
rotation3 return self x16 32.4345 ns 32.5213 ns N/A 21.8325 ns N/A
transform point2 x16 52.6534 ns 31.4527 ns N/A 32.7317 ns N/A
transform point2 x256 888.5654 ns 831.9341 ns N/A 848.0397 ns N/A
transform point3 x16 96.9017 ns 81.6828 ns N/A 82.8904 ns N/A
transform point3 x256 1.567 us 1.398 us N/A 1.43 us N/A
transform vector2 x16 43.7679 ns 29.9349 ns N/A 31.8630 ns N/A
transform vector2 x256 858.5660 ns 825.0261 ns N/A 851.7501 ns N/A
transform vector3 x16 96.5535 ns 80.1612 ns N/A 85.0659 ns N/A
transform vector3 x256 1.557 us 1.394 us N/A 1.438 us N/A
vector3 cross x16 42.1941 ns 26.6677 ns N/A 22.0924 ns N/A
vector3 dot x16 29.1805 ns 12.7972 ns N/A 12.2872 ns N/A
vector3 length x16 32.6014 ns 9.7692 ns N/A 9.4271 ns N/A
vector3 normalize x16 65.8815 ns 24.1661 ns N/A 20.3579 ns N/A
vector3 return self x16 32.0051 ns 42.9462 ns N/A 16.7808 ns N/A

Running the benchmarks

The benchmarks use the criterion crate which works on stable Rust, they can be run with:

cargo bench

For the best results close other applications on the machine you are using to benchmark!

When running "wide" benchmarks, be sure you compile with with the appropriate target-features enabled, e.g. +avx2, for best results.

There is a script in scripts/summary.py to summarize the results in a nice fashion. It requires Python 3 and the prettytable Python module, then can be run to generate an ASCII output.

Default and optional features

All libraries except for glam are optional for running benchmarks. The default features include cgmath, ultraviolet and nalgebra. These can be disabled with:

cargo bench --no-default-features

To selectively enable a specific default feature again use:

cargo bench --no-default-features --features nalgebra

Note that you can filter which benchmarks to run at runtime by using Criterion's filtering feature. For example, to only run scalar benchmarks and not wide ones, use:

cargo bench "scalar"

You can also get more granular. For example to only run wide matrix2 benchmarks, use:

cargo bench --features wide "wide matrix2"

or to only run the scalar "vec3 length" benchmark for glam, use:

cargo bench "scalar vec3 length/glam"

Crate features

There are a few extra features in addition to the direct features referring to each benchmarked library.

  • ultraviolet_f32x4, ultraviolet_f32x8, nalgebra_f32x4, nalgebra_f32x8 - these each enable benchmarking specific wide types from each of ultraviolet or nalgebra.
  • ultraviolet_wide, nalgebra_wide - these enable benchmarking all wide types from ultraviolet or nalgebra respectively.
  • wide - enables all "wide" type benchmarks
  • all - enables all supported libraries, including wide and scalar ones.
  • unstable - see next section

unstable feature

The unstable feature requires a nightly compiler, and it allows us to tell rustc not to inline certain functions within hot benchmark loops. This is used in the ray-sphere intersection benchmark in order to simulate situations where the autovectorizer would not be able to properly vectorize your code.

Running the tests

The tests can be run using:

cargo test

Publishing results

When publishing benchmark results it is important to document the details of how the benchmarks were run, including:

  • The version of mathbench used
  • The versions of all libraries benched
  • The Rust version
  • The build settings used, especially when they differ from the defaults
  • The specification of the hardware that was used
  • The output of scripts/summary.py
  • The full Criterion output from target/criterion

Adding a new library

There are different steps involved for adding a unit tests and benchmarks for a new library.

Benchmarks require an implementation of the mathbench::RandomVec trait for the types you want to benchmark. If the type implements the rand crate distribution::Distribution trait for Standard then you can simply use the impl_random_vec! macro in src/lib.rs. Otherwise you can provide a function that generates a new random value of your type pass that to impl_random_vec!.

To add the new libary type to a benchmark, add another bench_function call to the Criterion BenchmarkGroup.

Increment the patch version number of mathbench in the Cargo.toml.

Update CHANGELOG.md.

Build times

mathbench also includes a tool for comparing full build times in tools/buildbench. Incremental build times are not measured as it would be non trivial to create a meaningful test across different math crates.

The buildbench tool uses the -Z timings feature of the nightly build of cargo, thus you need a nightly build to run it.

buildbench generates a Cargo.toml and empty src/lib.rs in a temporary directory for each library, recording some build time information which is included in the summary table below. The temporary directory is created every time the tool is run so this is a full build from a clean state.

Each library is only built once so you may wish to run buildbench multiple times to ensure results are consistent.

By default crates are built using the release profile with default features enabled. There are options for building the dev profile or without default features, see buildbench --help for more information.

The columns outputted include the total build time, the self build time which is the time it took to build the crate on it's own excluding dependencies, and the number of units which is the number of dependencies (this will be 2 at minimum).

When comparing build times keep in mind that each library has different feature sets and that naturally larger libraries will take longer to build. For many crates tested the dependencies take longer than the math crate. Also keep in mind if you are already building one of the dependencies in your project you won't pay the build cost twice (unless it's a different version).

crate version total (s) self (s) units
cgmath 0.17.0 6.8 3.0 17
euclid 0.22.1 3.4 1.0 4
glam 0.9.4 1.1 0.6 2
nalgebra 0.22.0 24.2 18.0 24
pathfinder_geometry 0.5.1 3.0 0.3 8
static-math 0.1.6 6.9 1.7 10
ultraviolet 0.5.1 2.5 1.3 4
vek 0.12.0 34.4 10.1 16

These benchmarks were performed on an Intel i7-4710HQ CPU with 16GB RAM and a Toshiba MQ01ABD100 HDD (SATA 3Gbps 5400RPM) on Linux.

License

Licensed under either of

at your option.

Contribution

Contributions in any form (issues, pull requests, etc.) to this project must adhere to Rust's Code of Conduct.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Support

If you are interested in contributing or have a request or suggestion create an issue on github.

Dependencies

~8MB
~195K SLoC