#string #smaz #string-compression

fast-smaz

Pure Rust SMAZ compression implementation

1 unstable release

0.1.0 Jul 3, 2022

#414 in Compression

35 downloads per month

MIT license

44KB
523 lines

fast-smaz

Pure Rust implementation of smaz algorithm, very close to the original antirez/smaz C implementation with some differences.

Example

use fast_smaz::Smaz;

fn main() {
    let my_message = "Hello world!";
    let compressed = my_message.smaz_compress();
    let decompressed = compressed.smaz_decompress().unwrap();
    assert_eq!(my_message.as_bytes(), decompressed);
}

Inspiration

This implementation was inspired in both original antirez/smaz and the peterc/rsmaz, a Ruby implementation of smaz.

Comparison

fast-smaz and antirez/smaz

This implementation is very similar to the original ANSI C implementation of SMAZ, it uses the exact same hash-table and hashing algorithm.

I haven't measured how performance compares, but the ANSI C implementation can do a lot of assumptions both tied to ANSI C specification and correctness of the implementation, while this Rust implementation has some additional checks to ensure the safetyness of the code, which may impact the performance, but should not be that impactful.

fast-smaz and silentsokolov/rust-smaz

silentsokolov/rust-smaz uses lazy_static! and creates the codebook hash table at first execution, instead of shipping a pre-generated hash table.

This translates into a slower encoding process (benchmarks on my computer shows 73% slower encoding), not only because of the overhead of lazy_static! initialization, but also the overhead of the atomic operation for initialization check and the hash distribution itself.

I've choosen to keep antirez original bucket-size and distribution because it looked well distributed, and performance results shows that it does have a very good performance.

Note that changing lazy_static! to OnceLock does not help, since both are very similar.

The performance improvement only applies to the encoding process, the decoding process has a similar performance, since both silentsokolov/rust-smaz and fast-smaz share a similar code, which is a simple lookup to the reverse codebook.

Also, fast-smaz uses const for the codebook and reverse codebook, which allows the compiler to do aggressive optimizations, this lowers cache-misses and RAM accesses, since const variables can (and probably will) be inlined at the use-site.

Features

custom-cookbook: For development only, was used as a testing codebook before the proper translation of the origional C implementation codebook. It's way more inefficient and with a poor distribution. It was constructed using the src/hashing.rs algorithm, I didn't wast too much time improving it and ensuring its correctness, it was only used to generate a correct and functional codebook for testing. Don't use this.

No runtime deps