#content-defined #chunking #data #cdc #data-streaming #size #original

fastcdc-alt

FastCDC (content defined chunking) implementation in pure Rust with an alternative API to the original crate

1 unstable release

0.2.2 Sep 17, 2023
0.2.1 Aug 16, 2023
0.2.0 Aug 16, 2023
0.1.0 Aug 14, 2023

#911 in Algorithms

MIT license

77KB
1K SLoC

FastCDC-Alt docs.rs Crates.io Test

This crate is a fork of the original Rust implementation by Nathan Fiedler. (nlfiedler/fastcdc-rs)
Included here is the enhanced 2020 "FastCDC" content defined chunking algorithm described by Wen Xia, et al.

This fork introduces an adjusted and a bit more complicated alternative API to increase the flexibility and reduce the memory overhead.
The cut points produced by this fork are identical to the ones produced by the original crate.

This README and all docs are adapted to the adjusted API.

Whats different?

The adjusted FastCDC structure now allows you to provide the data piece by piece required to find the next cut point.
The advantages of this approach are that it is no more required to keep the entire data to cut in one contiguous memory block and therefore also save on some memory copies.
This is most useful for e.g. advanced streaming logics.

Example usage with the original crate:

fn main(consumer: Receiver<Box<[u8]>>) {

  let cursor = 0;
  let mut intermediate_buffer = vec![1024 * 1024 * 16]; // 16 MiB
  for buffer in consumer.iter() {
    &intermediate_buffer[cursor..cursor + 4096].copy_from_slice(&buffer);
    
    cursor += 4096;
    if cursor == intermediate_buffer.len() {
      
      let fastcdc = FastCDC::new(&intermediate_buffer, 65535, 1048576, 16777216);
      for chunk in fastcdc {
        // .. process chunk
      }

      cursor = 0;
    }
  }
}

Very basic example usage with this fork.
You can see a full example in the fastcdc_cut file.

use std::collections::VecDeque;

fn main(consumer: Receiver<Box<[u8]>>) {
  let mut fastcdc = FastCDC::new(65535, 1048576, 16777216).unwrap();

  // Inform the FastCDC struct how much data we are expecting.
  fastcdc.set_content_length(134217728); // 128 MiB
  

  for buffer in consumer.iter() {
    let mut cursor = 0;
    
    loop {      
      if let Some(chunk) = fastcdc.cut(&buffer[cursor..]) {
        // .. process chunk hash

        cursor += chunk.cutpoint;
        if cursor == buffer.len() {
          break;
        }
      }
        break;
      }
    }
  }
}

What else?

  • The FastCDC iterator is now accessible using the FastCDC::as_iterator(&self, buffer: &[u8]) method.
  • The AsyncStreamCDC and StreamCDC implementations have been adapted, their APIs changed just a little bit.
  • To focus solely on the 2020 version in this fork, the ronomon and v2016 implementations and examples have been removed.

Requirements

  • Rust stable (2018 edition)

Building and Testing

$ cargo clean
$ cargo build
$ cargo test

Example Usage

Examples can be found in the examples directory of the source repository, which demonstrate finding chunk boundaries in a given file. There are both streaming and non-streaming examples, where the non-streaming examples use the memmap2 crate to read large files efficiently.

$ cargo run --example v2020 -- --size 16384 test/fixtures/SekienAkashita.jpg
    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/examples/v2020 --size 16384 test/fixtures/SekienAkashita.jpg`
hash=17968276318003433923 offset=0 size=21325
hash=4098594969649699419 offset=21325 size=17140
hash=15733367461443853673 offset=38465 size=28084
hash=4509236223063678303 offset=66549 size=18217
hash=2504464741100432583 offset=84766 size=24700

Non-streaming

An example using FastCDC to find chunk boundaries in data loaded into memory:

let contents = std::fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
let mut chunker = fastcdc_alt::FastCDC::new(16384, 32768, 65536).unwrap();
for chunk in chunker.as_iterator(&contents) {
    println!("offset={} length={}", chunk.offset, chunk.get_length());
}

Streaming

The StreamCDC version takes a Read source and uses a byte vector with capacity equal to the specified maximum chunk size.

let source = std::fs::File::open("test/fixtures/SekienAkashita.jpg").unwrap();
let chunker = fastcdc_alt::StreamCDC::new(source, 4096, 16384, 65535).unwrap();
for result in chunker {
  let (_data, chunk) = result.unwrap();
  println!("offset={} length={}", chunk.offset, chunk.get_length());
}

Async Streaming

There is also an async streaming version of FastCDC named AsyncStreamCDC, which takes an AsyncRead (both tokio and futures are supported via feature flags) and uses a byte vector with capacity equal to the specified maximum chunk size.

let source = std::fs::File::open("test/fixtures/SekienAkashita.jpg").unwrap();
let chunker = fastcdc_alt::AsyncStreamCDC::new(&source, 4096, 16384, 65535).unwrap();
let stream = chunker.as_stream();
let chunks = stream.collect::<Vec<_>>().await;

for result in chunks {
  let chunk = result.unwrap();
  println!("offset={} length={}", chunk.offset, chunk.get_length());
}

Reference Material

The original algorithm from 2016 is described in FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication.
The improved "rolling two bytes each time" version from 2020 is detailed in The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems.

Other Implementations

Dependencies

~0–6.5MB
~25K SLoC