#chunking #cdc #deduplication #deduplicate

bin+lib cdc-chunkers

A collection of Content Defined Chunking algorithms

3 releases

new 0.1.2 Feb 6, 2025
0.1.1 Feb 2, 2025
0.1.0 Feb 2, 2025

#515 in Algorithms

Download history 194/week @ 2025-01-28 224/week @ 2025-02-04

418 downloads per month
Used in chunkfs

MIT license

55KB
1.5K SLoC

Crates.io MIT licensed

rust-chunking

Content Based Chunking algorithms implementation:

Simple code to test an algorithm is provided in filetest.rs.

Features

  • Chunkers that work using std::iter::Iterator trait, giving out data about the source dataset in the form of chunks.
  • Chunker sizes can be customized on creation. Default size values are provided.
  • Other parameters from corresponding papers can also be modified on chunker creation.

Usage

To use them in custom code, the algorithms can be accessed using the corresponding modules, e.g.

fn main() {
    let data = vec![1; 1024 * 1024];
    
    let sizes = SizeParams::new(4096, 8192, 16384);
    let chunker = ultra::Chunker::new(&data, sizes);
  
    for chunk in chunker {
        println!("start: {}, length: {}", chunk.pos, chunk.len);
    }
  
    let default_leap = leap_based::Chunker::new(&data, SizeParams::leap_default());
    for chunk in default_leap {
        println!("start: {}, length: {}", chunk.pos, chunk.len);
    }
}

Dependencies

~2.5MB
~35K SLoC