#cesu8 #codec #decode #encode #mutf8

no-std simd_cesu8

An extremely fast, SIMD accelerated, encoding and decoding library for CESU-8 and Modified UTF-8

2 stable releases

1.0.1 Jul 19, 2024
0.1.0 Jul 19, 2024

#300 in Encoding

Download history 141/week @ 2024-09-15 59/week @ 2024-09-22 159/week @ 2024-09-29 48/week @ 2024-10-06 215/week @ 2024-10-13 206/week @ 2024-10-20 223/week @ 2024-10-27 206/week @ 2024-11-03 160/week @ 2024-11-10 201/week @ 2024-11-17 263/week @ 2024-11-24 226/week @ 2024-12-01 173/week @ 2024-12-08 255/week @ 2024-12-15 261/week @ 2024-12-22 410/week @ 2024-12-29

1,133 downloads per month
Used in 19 crates (2 directly)

Apache-2.0 OR MIT

315KB
545 lines

simd_cesu8Build Status Crates.io Version Crates.io MSRV

An extremely fast, SIMD-accelerated, encoding and decoding library for CESU-8 and Modified UTF-8.

Features

  • Extremely fast encoding and decoding of CESU-8 and Modified UTF-8.
  • Only allocates memory when absolutely necessary. Zero reallocations.
  • Supports lossy decoding of CESU-8 and MUTF-8.
  • Supports strict decoding of CESU-8 and MUTF-8.
  • Up to 25 times faster than known alternatives when encoding.
  • Up to 14 times faster than known alternatives when decoding.
  • Supports AVX 2 and SSE 4.2 implementations on x86 and x86-64.
  • ARM64 (aarch64) SIMD is supported.
  • WASM (wasm32) SIMD is supported.
  • PowerPC (powerpc) SIMD is (partially) supported.
  • Falls back to word-at-a-time implementations instead of byte-at-a-time.
  • No-std support.

Usage

Add this to your Cargo.toml file:

[dependencies.simd_cesu8]
version = "1"
features = ["nightly"]

It is extremely recommended to use the nightly feature, as it enables the fastest implementations of the library. If you don't want to use nightly, you can remove the feature and the library will still work, but will be slower.

Documentation

The documentation can be found on docs.rs.

For quick access to certain functions, I highly recommend pressing the "All Items" button in the top left corner of the documentation page.

MSRV

The minimum supported Rust version is 1.79.0.

Benchmarks

These benchmarks were run with an AMD Ryzen 9 7950X3D on WSL2 with Rust 1.81.0-nightly. Each of the operations were measured with an input exactly equal to 16,380 bytes. Data sets were randomly generated with the same seed for each operation. Each set contained 1,000 values, that were endlessly repeated as input. The benchmarks were run with the criterion crate. The tables got created via critcmp. The source code for the benchmarks can be found in the benches directory. Any call to simd_cesu8 was replaced with the cesu8 equivalent when collecting the data for the cesu8 benchmarks. The data collected can also be found in the benches directory.

We compare simd_cesu8 to the cesu8 library, as it's the most popular library for CESU-8 and MUTF-8 encoding and decoding in Rust. I originally started this project, however, because there's a few minor semantic errors in the cesu8 library that I wanted to fix myself.

Encoding

To understand the benchmarks, it's first a good idea to familiarize yourself with the datasets used to benchmark the encoding operations:

  • ascii_non_null_strings are strings exactly 16,380 bytes long with only ASCII characters and no null bytes.
  • ascii_null_alternating_strings are strings exactly 16,380 bytes long with only ASCII characters and null bytes alternating, where every odd byte is non-null and every even byte is null.
  • interspersed_strings are strings exactly 16,380 bytes long with every 20 bytes in the following pattern:
    • A non-null ASCII character (1 byte)
    • A 2-byte UTF-8 character
    • A 3-byte UTF-8 character
    • A 4-byte UTF-8 character
    • A null byte
    • A 2-byte UTF-8 character
    • A 3-byte UTF-8 character
    • A 4-byte UTF-8 character
  • null_strings are strings exactly 16,380 bytes long with only null bytes.
  • utf8_clamped_width_2_strings are strings exactly 16,380 bytes long with only 2-byte UTF-8 characters.
  • utf8_clamped_width_3_strings are strings exactly 16,380 bytes long with only 3-byte UTF-8 characters.
  • utf8_clamped_width_4_strings are strings exactly 16,380 bytes long with only 4-byte UTF-8 characters.

simd_cesu8::encode

Encode CESU-8 Benchmarks

simd_cesu8::encode is around %2,500 faster than cesu8::to_cesu8 in optimal conditions. In suboptimal conditions, simd_cesu8::encode is around %50 faster than cesu8::to_cesu8. The main reason for the speedup, beyond SIMD, is that simd_cesu8 doesn't quickly check for just ASCII characters, but instead, it checks for the start of 4-byte characters. This is a significant speedup over the traditional ASCII hot paths. Other speedups come in the form of cumulative, tiny optimizations that add up to a significant speedup.

simd_cesu8::mutf8::encode

Encode MUTF-8 Benchmarks

Similar to simd_cesu8::encode, simd_cesu8::mutf8::encode is around %1800 faster than cesu8::to_java_cesu8 in optimal conditions. In suboptimal conditions, simd_cesu8::mutf8::encode is around %50 faster than cesu8::to_java_cesu8. The same reason for the speedup applies here, but instead of checking for the start of 4-byte characters, we're also checking for any null bytes with SIMD. Other speedups come in the form of cumulative, tiny optimizations that add up to a significant speedup.

Decoding

To understand the benchmarks, it's first a good idea to familiarize yourself with the datasets used to benchmark the decoding operations:

  • ascii_non_null_bytes is a set of UTF-8 strings exactly 16,380 bytes long with only ASCII characters and no null bytes.
  • ascii_null_alternating_bytes is a set of UTF-8 strings exactly 16,380 bytes long with only ASCII characters and null bytes alternating, where every odd byte is non-null and every even byte is null.
  • interspersed_bytes is a set of UTF-8 strings exactly 16,380 bytes long with every 20 bytes in the following pattern:
    • A non-null ASCII character (1 byte)
    • A 2-byte UTF-8 character
    • A 3-byte UTF-8 character
    • A 4-byte UTF-8 character
    • A null byte
    • A 2-byte UTF-8 character
    • A 3-byte UTF-8 character
    • A 4-byte UTF-8 character
  • interspersed_cesu8_bytes is a set of CESU-8 strings exactly 16,380 bytes long with every 12 bytes in the following pattern:
    • An ASCII character (1 byte)
    • A 2-byte UTF-8 character
    • A 3-byte UTF-8 character
    • A 4-byte UTF-8 character encoded as a surrogate pair (6 bytes)
  • interspersed_mutf8_bytes is a set of MUTF-8 strings exactly 16,380 bytes long with every 26 bytes in the following pattern:
    • A non-null ASCII character (1 byte)
    • A non-null ASCII character (1 byte)
    • A 2-byte UTF-8 character
    • A 3-byte UTF-8 character
    • A 4-byte UTF-8 character encoded as a surrogate pair (6 bytes)
    • A null byte encoded as a 2-byte sequence
    • A 2-byte UTF-8 character
    • A 3-byte UTF-8 character
    • A 4-byte UTF-8 character encoded as a surrogate pair (6 bytes)
  • mutf8_null_bytes is a set of MUTF-8 strings exactly 16,380 bytes long with only null encoded as a 2-byte sequence.
  • null_bytes is a set of UTF-8 strings exactly 16,380 bytes long with only null bytes.
  • surrogate_pairs_bytes is a set of CESU-8 strings exactly 16,380 bytes long with only 6-byte surrogate pairs.
  • utf8_clamped_width_2_bytes is a set of UTF-8 strings exactly 16,380 bytes long with only 2-byte UTF-8 characters.
  • utf8_clamped_width_3_bytes is a set of UTF-8 strings exactly 16,380 bytes long with only 3-byte UTF-8 characters.
  • utf8_clamped_width_4_bytes is a set of UTF-8 strings exactly 16,380 bytes long with only 4-byte UTF-8 characters.

simd_cesu8::decode

Decode CESU-8 Benchmarks

In optimal conditions, simd_cesu8::decode can be around %1,400 faster than cesu8::to_cesu8. However, simd_cesu8::decode is slower than cesu8::to_cesu8 when decoding surrogate pairs. The main reason is that simd_cesu8 supports lossy decoding, but the internals of that seems to introduce a decent amount of overhead. simd_cesu8 is faster in all ways that cesu8 when LTO is enabled. However, I didn't run the benchmarks with LTO enabled because it seems disingenuous to do so as LTO shouldn't be a requirement for a library to be fast.

simd_cesu8::mutf8::decode

Decode MUTF-8 Benchmarks

Much like simd_cesu8::decode, simd_cesu8::mutf8::decode can be around %1,400 faster than cesu8::to_java_cesu8. But, the pitfalls are equally the same, as simd_cesu8::mutf8::decode is slower than cesu8::to_java_cesu8 when decoding surrogate pairs. The other catch is that we're slightly slower than cesu8 when decoding null bytes. The same reasoning applies here, as the overhead is higher.

Thanks

  • To the authors of the cesu8 library. It's a genuinely good library that helped me years ago learn some more complex Rust concepts.
  • To the authors of simdutf8 for arguably one of the biggest improvements to the borrow paths in this library.
  • To the authors of simdnbt for the inspiration of using portable_simd.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Dependencies