1 unstable release

0.1.0	Jan 18, 2025

#432 in Cryptography

Used in 2 crates (via struct-compression-analyz…)

Custom license

91KB
1K SLoC

lossless-transform-utils

About

General purpose utility methods for creating lossless transforms for various file formats.

Utilities

This covers the API, a usage example and explanation for how to use the crate can be found at the bottom of the README file. Additional info for each method is attached to each method in the source code.

Histogram

Calculate the amount of times each byte appears in a dataset.

use lossless_transform_utils::histogram::histogram32_from_bytes;
use lossless_transform_utils::histogram::Histogram32;

let data = [1, 2, 3, 1, 2, 1];
let mut histogram = Histogram32::default();
histogram32_from_bytes(&data, &mut histogram);

assert_eq!(histogram.inner.counter[1], 3); // Byte value 1 appears 3 times
assert_eq!(histogram.inner.counter[2], 2); // Byte value 2 appears 2 times
assert_eq!(histogram.inner.counter[3], 1); // Byte value 3 appears 1 time

Entropy

Calculates the entropy (ideal code length) for a given histogram.

use lossless_transform_utils::histogram::Histogram32;
use lossless_transform_utils::entropy::code_length_of_histogram32;
use lossless_transform_utils::histogram::histogram32_from_bytes;

let data = [1, 2, 3, 1, 2, 1];
let mut histogram = Histogram32::default();
histogram32_from_bytes(&data, &mut histogram);

let entropy = code_length_of_histogram32(&histogram, data.len() as u64);
println!("Entropy: {}", entropy); // 'entropy' bits per byte

This allows you to estimate how compressible data is by entropy coding step of a compressor. The returned result is average number of bits needed to represent each symbol (1 byte).

i.e. 1.0 == 1 bit.

The code length in this module favours accuracy rather than performance, so it does proper f64 arithmetic, with log2; which is slow in terms of CPU time.

However, because the input histograms only have 256 elements, the accuracy tradeoff for performance is considered worthwhile here. The runtime is constant ~900ns per histogram on a 5900X as a point of reference.

Match Estimator

Estimates the number of >=3 byte LZ matches in a given input array.

use lossless_transform_utils::match_estimator::estimate_num_lz_matches_fast;

let data = [1, 2, 3, 1, 2, 1];
let num_lz_matches = estimate_num_lz_matches_fast(&data);
println!("Number of LZ matches: {}", num_lz_matches);

Crate Features

std [default]: Enables x86 CPU feature detection.
- Because x86 feature detection is implemented via CPU instruction, you can use the std feature in a no_std environment. It's just that the API needed here isn't available in no_std environments.
c-exports: Builds the library with C exports for the public APIs.
nightly: Enables x86 acceleration for histogram32 creation (requires naked ASM).
bench: Enable benchmarks for non-public API items.

These exist but are currently unused:

pgo: Additional workloads for profile guided optimization in the all benchmark project.

These features should not be used:

estimator-avx2: Uses AVX2 instructions for the match estimator.
- Don't use this, it's not fast enough, due to a lack of an AVX2 'scatter' instruction.
- Loses a bit of accuracy; at around same speed as scalar.
- If you're a SIMD wizard with better ideas, feel free to contribute.

These features are not fully tested:

estimator-avx512: Uses AVX512 instructions for the match estimator.
- I (Sewer) don't own AVX512 capable hardware.
- So this hasn't been fully benchmarked/optimized.
- It (should) be faster than scalar, however.
- Tested only in CI.

(If you have an AVX512 machine, please reach out with what perf. results you get!!)

Reference Performance Numbers

Tested with regular cargo build. R9 5900X, single core, CL16 3200MHz DDR4 RAM.

Histogram Creation:

entropy/code_length_of_histogram32/8388608
                        time:   [885.25 ns 886.12 ns 887.10 ns]
                        thrpt:  [1.1273 Melem/s 1.1285 Melem/s 1.1296 Melem/s]

histogram/portable/public-api/8388608
                        time:   [1.1709 ms 1.1720 ms 1.1733 ms]
                        thrpt:  [6.6587 GiB/s 6.6661 GiB/s 6.6723 GiB/s]

Match Estimator:

match_estimator/random_data/8388608
                        time:   [5.8430 ms 5.8499 ms 5.8586 ms]
                        thrpt:  [1.3335 GiB/s 1.3355 GiB/s 1.3371 GiB/s]
match_estimator/repeated_data_len_1/8388608
                        time:   [5.2753 ms 5.2781 ms 5.2816 ms]
                        thrpt:  [1.4792 GiB/s 1.4802 GiB/s 1.4810 GiB/s]
match_estimator/repeated_data_len_2/8388608
                        time:   [5.3962 ms 5.4068 ms 5.4204 ms]
                        thrpt:  [1.4413 GiB/s 1.4449 GiB/s 1.4478 GiB/s]
match_estimator/repeated_data_len_3/8388608
                        time:   [5.6629 ms 5.6675 ms 5.6726 ms]
                        thrpt:  [1.3772 GiB/s 1.3785 GiB/s 1.3796 GiB/s]
match_estimator/repeated_data_len_4/8388608
                        time:   [6.0370 ms 6.0420 ms 6.0478 ms]
                        thrpt:  [1.2918 GiB/s 1.2930 GiB/s 1.2941 GiB/s]
match_estimator/repeated_data_len_5/8388608
                        time:   [5.5378 ms 5.5467 ms 5.5567 ms]
                        thrpt:  [1.4060 GiB/s 1.4085 GiB/s 1.4108 GiB/s]
match_estimator/repeated_data_len_6/8388608
                        time:   [5.4477 ms 5.4510 ms 5.4547 ms]
                        thrpt:  [1.4323 GiB/s 1.4332 GiB/s 1.4341 GiB/s]
match_estimator/repeated_data_len_7/8388608
                        time:   [5.4951 ms 5.5006 ms 5.5071 ms]
                        thrpt:  [1.4186 GiB/s 1.4203 GiB/s 1.4217 GiB/s]

Note: Building with -C target-cpu=native may sometimes yield performance improvements on some CPUs, I've tried to rearrange the code to minimize that as much as possible however.

The combined speed/rate of the operations e.g. 'splitting data' for an 8MiB file on a single thread is 1.22GB/s.

Calculation:

histogram: 1.1720 ms (6.6661 GiB/s)
match_estimator: 5.6675 ms (1.3785GiB/s)
code_length_of_histogram32: constant 886.12 ns

8388608 bytes / (1.1720ms + 5.6675ms + 886.12ns) == 1169.525 MiB/s == 1226.335 MB/s

To put this into perspective, if you have 1GiB of data, and for each file you test 4 different splits/variants, your rate (per thread) is 305MB/s. Performance on multi threads has not been tested, but is expected to scale well, given the workload appears to be CPU bound.

Implementation Accuracy of the Match Estimator

Can be tested with cargo test -- --nocapture | grep -i "^\[res:"

Found Matches at (4K - 64K offsets):

[res:matches_4096_intervals_131072] matches: 126298, expected: < 126976, minimum: 113000, found: 99.5%
[res:matches_8192_intervals_131072] matches: 121123, expected: < 122880, minimum: 95000, found: 98.6%
[res:matches_16384_intervals_131072] matches: 112196, expected: < 114688, minimum: 60000, found: 97.8%
[res:matches_32768_intervals_131072] matches: 59506, expected: < 98304, minimum: 13000, found: 60.5%
[res:matches_65536_intervals_131072] matches: 3737, expected: < 65536, minimum: 450, found: 5.7%

False Positives Testing:

[res:no_matches_128k] matches: 68, expected: < 131, allowed_error: 0.1%, actual_error: 0.052%
[res:no_matches_long_distance_16777215] matches: 11949, expected: < 16777, allowed_error: 0.1%, actual_error: 0.071%

Parameters such as hash table sized were tuned for best tradeoff between performance and accuracy.

Development

For information on how to work with this codebase, see README-DEV.MD.

License

Licensed under MIT.

How to use this crate

This guide explains how to effectively use the crate to estimate and compare the compressibility of different data.

[Please note: I (Sewer) am not a compression expert; I just deliver high perf, bleeding edge libraries for modding games. Consult experts in the field]

Overview

To determine if one piece of data is more compressible than another, you'll want to look at two key metrics:

Entropy differences
LZ match count differences

These are provided by this crate.

A significant difference in either metric suggests a likely improvement in compression ratio.

Example: Notes

The parameters MATCH_RATIO_THRESHOLD and ENTROPY_THRESHOLD below are provided as reference only, appropriate numbers may vary depending on the nature of the input data.

If you are writing performance-focused transforms, you may want to factor in the number of clock cycles into the calculations of whether it's worthwhile so transform too.

Example: Reordering Data

In this case, you have a file composed of structures, for example 16-byte structs with bit packed values. You want to rearrange the order of the bits to improve the compression ratio.

In this case, LZ match differences are most relevant.

use lossless_transform_utils::match_estimator::*;

fn should_reorder_data(original: &[u8], reordered: &[u8]) -> bool {
    // Constants for determining significance
    const MATCH_RATIO_THRESHOLD: f64 = 2.0;  // needs 2x more matches

    // Calculate match ratios
    let matches_orig = estimate_num_lz_matches_fast(original) as f64 / original.len() as f64;
    let matches_reordered = estimate_num_lz_matches_fast(reordered) as f64 / reordered.len() as f64;

    matches_reordered > (matches_orig * MATCH_RATIO_THRESHOLD)
}

Tip: If you have a struct with bitfields, you can often find compression wins by rearranging bits such that fields likely to repeat are byte aligned. This is what we're testing for here.

Example: Splitting Data

When you have different types of data mixed together (e.g. 🍎 mixed with 🍐), you might want to split them into separate streams for compression.

You can measure if this is beneficial by comparing both matches and entropy.

use lossless_transform_utils::{
    entropy::*,
    histogram::*,
    match_estimator::*,
};

// original data: 🍎🍐🍎🍐🍎🍐🍎🍐
// split:
//     part1: 🍎🍎🍎🍎
//     part2: 🍐🍐🍐🍐
fn should_split(part1: &[u8], part2: &[u8]) -> bool {
    // Constants for determining significance
    const MATCH_RATIO_THRESHOLD: f64 = 2.0;  // One input needs 2x more matches
    const ENTROPY_THRESHOLD: f64 = 1.0;      // Or entropy difference >1 bit per byte

    // Calculate match ratios
    let matches1 = estimate_num_lz_matches_fast(part1) as f64 / part1.len() as f64;
    let matches2 = estimate_num_lz_matches_fast(part2) as f64 / part2.len() as f64;

    // Calculate entropies
    let mut hist1 = Histogram32::default();
    let mut hist2 = Histogram32::default();
    histogram32_from_bytes(part1, &mut hist1);
    histogram32_from_bytes(part2, &mut hist2);
    
    let entropy1 = code_length_of_histogram32(&hist1, part1.len() as u64);
    let entropy2 = code_length_of_histogram32(&hist2, part2.len() as u64);

    // Check for significant differences
    let significant_matches = matches1 > (matches2 * MATCH_RATIO_THRESHOLD) || 
                              matches2 > (matches1 * MATCH_RATIO_THRESHOLD);
    let significant_entropy = (entropy1 - entropy2).abs() > ENTROPY_THRESHOLD;

    significant_matches || significant_entropy
}

This is similar to what I do for BC7 blocks in my own dxt-lossless-transform project (soon).

Tip: Compressors dynamically adapt the entropy state of the data being compressed periodically as the data is being compressed (by resetting entropy table, etc.). Making smaller chunks of data within a file more similar therefore improves compression. (Even if entropy of the file as a whole is unchanged)

Example: Transforming Data

In this case, you have data that is not being rearranged, but instead being transformed (mutated) in a reversible way.

fn encode_delta_diff<T>(data: &mut [T]) 
where 
    T: Copy + std::ops::Sub<Output = T> + Default
{
    let mut prev = T::default(); // Equivalent to 0 for numeric types
    for item in data.iter_mut() {
        let v = *item;
        *item = v - prev;
        prev = v;
    }
}

fn decode_delta_diff<T>(data: &mut [T]) 
where 
    T: Copy + std::ops::Add<Output = T> + Default
{
    let mut prev = T::default();
    for item in data.iter_mut() {
        let v = *item;
        let new_v = prev + v;
        *item = new_v;
        prev = new_v;
    }
}

An example from "Reorder floats + Delta" from Aras' excellent blog, translated to Rust.

More details on linked blog, but essentially, if you're encoding 'differences', then more patterns may emerge in the data. Longer runs of bytes, or more repeated bytes.

In this case, when you have a 'transformed' and 'non-transformed' variant, you want to follow the same steps as when splitting data.

Dependencies

~290KB