#array #deduplication #chunks #memory #storing #byte #history

yanked array_cow

In memory array de-duplication, useful for efficient storing of a history of data versions

Uses old Rust 2015

0.1.0 Mar 19, 2017

#32 in #deduplication

Apache-2.0

92KB
2K SLoC

Array Cow

Introduction

In memory array de-duplication, useful for efficiently storing many versions of data.

This is suitable for storing undo history for example, and is effective with both binary and text data.

  • Configurable block sizes.
  • Supports array-stride to avoids overhead of detecting blocks and un-aliened offsets. (a stride of 1 for bytes works too)
  • Uses block hashing for de-duplication.
  • Each state must only reference its previous, making both linear and tree structures possible.

Further Work

It may be worth using mmap for data storage.


lib.rs:

Array storage to minimize duplication.

This is done by splitting arrays into chunks and using copy-on-write (COW), to de-duplicate chunks, from the users perspective this is an implementation detail.

Overview

Data Structure

This diagram is an overview of the structure of a single array-store.

note: The only 2 structures here which are referenced externally are the.

  • BArrayStore: The whole array store.
  • BArrayState: Represents a single state (array) of data. These can be add using a reference state, while this could be considered the previous or parent state. no relationship is kept, so the caller is free to add any state from the same BArrayStore as a reference.
<+> BArrayStore: root data-structure,
 |  can store many 'states', which share memory.
 |
 |  This can store many arrays, however they must share the same 'stride'.
 |  Arrays of different types will need to use a new BArrayStore.
 |
 +- <+> states (Collection of BArrayState's):
 |   |  Each represents an array added by the user of this API.
 |   |  and references a chunk_list (each state is a chunk_list user).
 |   |  Note that the list order has no significance.
 |   |
 |   +- <+> chunk_list (BChunkList):
 |       |  The chunks that make up this state.
 |       |  Each state is a chunk_list user,
 |       |  avoids duplicating lists when there is no change between states.
 |       |
 |       +- chunk_refs (List of BChunkRef): Each chunk_ref links to a a BChunk.
 |          Each reference is a chunk user,
 |          avoids duplicating smaller chunks of memory found in multiple states.
 |
 +- info (BArrayInfo):
 |  Sizes and offsets for this array-store.
 |  Also caches some variables for reuse.
 |
 +- <+> memory (BArrayMemory):
     |  Memory pools for storing BArrayStore data.
     |
     +- chunk_list (Pool of BChunkList):
     |  All chunk_lists, (reference counted, used by BArrayState).
     |
     +- chunk_ref (Pool of BChunkRef):
     |  All chunk_refs (link between BChunkList & BChunk).
     |
     +- chunks (Pool of BChunk):
        All chunks, (reference counted, used by BChunkList).
        These have their headers hashed for reuse so we can quickly check for duplicates.

De-Duplication

When creating a new state, a previous state can be given as a reference, matching chunks from this state are re-used in the new state.

First matches at either end of the array are detected. For identical arrays this is all thats needed.

De-duplication is performed on any remaining chunks, by hashing the first few bytes of the chunk (see: BCHUNK_HASH_TABLE_ACCUMULATE_STEPS).

\note This is cached for reuse since the referenced data never changes.

An array is created to store hash values at every 'stride', then stepped over to search for matching chunks.

Once a match is found, there is a high chance next chunks match too, so this is checked to avoid performing so many hash-lookups. Otherwise new chunks are created.

Example

let mut bs = array_cow::BArrayStore::new(1, 8);
let data_src_a = b"The quick brown fox jumps over the lazy dog";
let data_src_b = b"The quick brown fox almost jumps over the lazy dog";
let data_src_c = b"The little quick brown fox jumps over the lazy dog!";

let state_a = bs.state_add(data_src_a, None);
let state_b = bs.state_add(data_src_b, Some(state_a));
let state_c = bs.state_add(data_src_c, Some(state_b));

// Check the data is stored correctly
let data_dst = array_cow::BArrayStore::state_data_get_alloc(state_a);
assert_eq!(&data_src_a[..], &data_dst[..]);

let data_dst = array_cow::BArrayStore::state_data_get_alloc(state_b);
assert_eq!(&data_src_b[..], &data_dst[..]);

let data_dst = array_cow::BArrayStore::state_data_get_alloc(state_c);
assert_eq!(&data_src_c[..], &data_dst[..]);

No runtime deps