1 unstable release

0.1.0 Oct 21, 2024

#321 in Compression

MIT/Apache

18KB
261 lines

Skive

An array slicer and compressor library.

Slice byte slices into sized chunks and compress them into blocks!

Each block is assigned an ID and is compressed using ZLib which can then be processed elsewhere.

It might be of use to someone, who knows


lib.rs:

A byte array slicer that slices data into sized chunks and ZLib compresses them.

It takes a u8 slice and splits it into evenly sized chunks (or less if the last chunk is less than the given size). These chunks are then compressed using the ZlibEncoder and given an ID. These can then be processed / utilised in a way that fits your purpose.

Each generated chunk is in the form a a BinBlock which holds the compressed data as well as some metadata (block ID, chunk size, compressed size, and block hash). Each block supplies functions which allow the chunk to be turned back into a Vec<u8> and to be constructed from a u8 slice as well. Each block checksum is computed as a Sha256 hash which is hex-encoded. When decompressed, this checksum is recomputed to ensure the data is valid.

A parallel version is also supplied which uses the ThreadPool struct to process larger inputs in less time. It behaves the same way as above. There exists an unordered version, Slicer::par_slice, and an ordered version Slicer::par_slice_ordered which orders the output chunks by their ID.

Bin Block Format

A BinBlock is the output component of the Slicer operations. It holds the compressed data as well as the following set of metadata:

  • A Block ID which signifies the Block number (Max value: 2^32 - 1)
  • The Block size which is the size of the chunked data (before compression)
    • This will be the same for all blocks except the last block in cases where the final chunk is < 2^n (Max value: 2^32 - 1)
  • The compressed size which is the size of the compressed data. (Max Value: cmp_data < block_size <= 2^32 -1)
  • The Sha256 hash of the uncompressed data. This is used to calculate a checksum.
  • The compressed data itself (Compressed using ZLib)

Each block can be converted to a Vec<u8> so it can be stored on disk or elsewhere with the [BinBlock::into_bytes()] function. The storage format is as follows (Big Endian):

  • 4 Bytes for the Block ID (u32)
  • 4 Bytes for the Block size (u32)
  • 4 Bytes for the compressed data size (u32)
  • 32 Bytes for the Hash
  • Remainder of the slice is the compressed data

Examples

Sequential Operation

Here you can chunk the data sequentially which will work well enough for small / medium sizes:

use skive::Slicer;
use std::fs;

fn main() -> std::io::Result<()> {
    let some_file = fs::read("some-file.pdf")?;
    
    // We want the data sliced into 2Mb chunks then compressed
    let cmp_blocks = Slicer::slice(&some_file, 2 * 1024 * 1024)?;

    // Now we can convert the blocks into bytes and send them on their way
    for block in cmp_blocks {
        let data = block.into_bytes().expect("unable to convert block to bytes");
        /* Send them across a network or something */
    }
    Ok(())
}

Parallel Operation

Here you can use the parallel slicer which will slice the data using the ThreadPool crate. You can specify the number of threads you wish to run concurrently and the pool will queue up operations for you automatically:

use skive::Slicer;
use std::fs;

fn main() -> std::io::Result<()> {
    let some_large_file = fs::read("some-huge-file-like-a-video.mp4")?;

    // We want to slice the data into 4Mb chunks across 8 threads and compresses them in
    // parallel
    let cmp_blocks = Slicer::par_slice(&some_large_file, 4 * 1024 * 1024, 8)?;

    // Now we can convert the blocks into bytes and send them on their way
    for block in cmp_blocks {
        let data = block.into_bytes().expect("unable to convert block to bytes");
        /* Send them across a network or something */
    }

    Ok(())
}

Dependencies

~1.6–2.2MB
~44K SLoC