#content-defined #cdc #hash #split #rolling #incremental

hash-roll

Rolling hashes & Content Defined Chunking (cdc)

3 releases (breaking)

0.3.0 Aug 17, 2020
0.2.0 May 18, 2017
0.1.1 Feb 25, 2016

#798 in Algorithms

Download history 8/week @ 2023-12-02 10/week @ 2023-12-16 11/week @ 2023-12-23 21/week @ 2023-12-30 4/week @ 2024-01-20 12/week @ 2024-01-27 39/week @ 2024-02-17 51/week @ 2024-02-24 15/week @ 2024-03-02 71/week @ 2024-03-09 15/week @ 2024-03-16

162 downloads per month

AGPL-3.0

99KB
2K SLoC

hash-roll

Documentation Crates.io Travis

Provides a generic API for abstracting over various implimentations of content defined chunking. Also provides implimentations of a number of content defined chunking algorithms.

Comparison of the Chunking options avaliable in Rust

Metrics

  • DER: duplicate elimination ratio

CDC References


lib.rs:

hash-roll provides various content defined chunking algorithms

Content defined chunking (CDC) algorithms are algorithms that examine a stream of input bytes (often represented as a slice like [u8], and provide locations within that input stream to split or chunk the stream into parts.

CDC algorithms generally try to optimize for the following:

  1. Processing speed (ie: bytes/second)
  2. Stability in split locations even when insertions/deletions of bytes occur
  3. Reasonable distributions of chunk lengths

API Concepts

  • Configured Algorithm Instance (impliments Chunk). Normally named plainly (like [Bup]). These can be thought of as "parameters" for an algorithm.
  • Incrimental (impliments ChunkIncr). Normally named with Incr suffix.

Because of the various ways one might use a CDC, and the different CDC algorithm characteristics, hash-roll provides a few ways to use them.

Configured Algorithm Instances are created from the set of configuration needed for a given algorithm. For example, this might mean configuring a window size or how to decide where to split. These don't include any mutable data, in other words: they don't keep track of what data is given to them. Configured Algorithm Instances provide the all-at-once APIs, as well as methods to obtain other kinds of APIs, like incrimental style apis.

CDC Algorithms and Window Buffering

Different CDC algorithms have different constraints about how they process data. Notably, some require a large amount of previous processed data to process additional data. This "large amount of previously processed data" is typically referred to as the "window". That said, note that some CDC algorithms that use a window concept don't need previously accessed data.

For the window-buffering algorithms, their is an extra cost to certain types of API implimentations. The documentation will note when these occur and suggest alternatives.

Generally, CDC interfaces that are incrimental will be slower for window-buffering algorithms. Using an explicitly allocating interface (which emits Vec<u8> or Vec<Vec<u8>>) will have no worse performance that the incrimental API, but might be more convenient. Using an all-at-once API will provide the best performance due to not requiring any buffering (the input data can be used directly).

Use Cases that drive API choices

  • accumulate vecs, emits vecs

    • incrimental: yes
    • input: Vec<u8>
    • internal state: Vec<Vec<u8>>
    • output: Vec<Vec<u8>>
  • stream data through

    • incrimenal: yes
    • input: &[u8]
  • mmap (or read entire) file, emit

    • incrimenal: no
    • input: &[u8]
    • output: &[u8]

Dependencies