2 releases

0.9.1 Jan 10, 2022
0.9.0 Oct 26, 2021

#281 in Compression

Apache-2.0

3.5MB
5K SLoC

bitbottle

Bitbottle: a modern archive format.

Bitbottle is a data & file format for archiving collections of files & folders, like "tar", "zip", and "winrar". Its primary differentiating features are:

  • All important posix attributes are preserved (owner & group by name, permissions, create/modify timestamps, symlinks).
  • File contents are stored as a database of de-duplicated chunks using buzhash, similar to common backup utilities.
  • The format is streaming-friendly for readers: Metadata and content lists appear before file contents, to allow a subset of the files to be extracted with minimal buffering.
  • Compression may occur per-file or over the whole archive, using snappy (very fast) or LZMA2 (very compact).
  • Encryption is built-in: AES-128-GCM or XCHACHA20-POLY1305(*), using an SSH-style Ed25519 key or an argon2id password for authentication.
  • The container format (bottle) is easily extensible for future compression or encryption algorithms.

(*) I apologize for the ridiculous names. I did not name any of these algorithms.

Install from crate

cargo install bitbottle

Current status

After writing a few drafts in typescript going back to 2015, this is a rust version intended for a wider audience. As of Oct 2021, the basic tools work to build an archive and expand it. The file format is unlikely to change in a backward-incompatible way, though I reserve the right for emergencies until reaching 1.0.

The file format is documented in docs/format.md.

There are a couple of command-line tools for testing so far. All of them respond to --help.

My intention is to make this project useful as a library, not just a set of CLI tools, but the current API is a bit awkward and needs some love before being frozen.

bitbottle

"bitbottle" creates archives from a list of files and folders. To encrypt an archive of the bitbottle source, using an SSH public (test) key, and "snappy" compression:

> ./target/release/bitbottle -v --snappy --pub ./tests/data/test-key.pub -o ./src-test.bb src
Encrypting for robey@togusa     (34fd22aae3c59072fd6f48147309eb302ea30f6ae5fc6376f683df3e74485a7c)
    drwxrwxr-x  robey     robey            2021-10-16 16:01:41  src/
    -rw-rw-r--  robey     robey     12.0K  2021-10-23 12:15:15  src/bottle.rs
    -rw-rw-r--  robey     robey      9.7K  2021-10-22 16:29:15  src/file_list.rs
    [...]
Creating archive: 30 files, 225K bytes
Scanned unique blocks: 30 blocks, 225K bytes
Wrote 85.5K bytes.

unbottle

"unbottle" can show the contents of an archive:

> ./target/release/unbottle -v --info ./src-test.bb
Bitbottle encrypted with XCHACHA20_POLY1305, 1 public key (ED25519_NACL_SEALED)
    Block size: 1.00M
    Encrypted for: robey@togusa             (34fd22aae3c59072fd6f48147309eb302ea30f6ae5fc6376f683df3e74485a7c)
ERROR: No key or password provided for encrypted bottle

If the bottle is encrypted, you must use a secret key to decrypt it. For ED25519, that means an SSH private key:

> ./target/release/unbottle -v --info --secret ./tests/data/test-key ./src-test.bb
Decrypting with key: robey@togusa
Bitbottle encrypted with XCHACHA20_POLY1305, 1 public key (ED25519_NACL_SEALED)
    Block size: 1.00M
    Encrypted for: robey@togusa             (34fd22aae3c59072fd6f48147309eb302ea30f6ae5fc6376f683df3e74485a7c)
Bitbottle compressed with SNAPPY
    drwxrwxr-x  robey     robey            2021-10-16 16:01:41  src/
    -rw-rw-r--  robey     robey     12.0K  2021-10-23 12:15:15  src/bottle.rs
    -rw-rw-r--  robey     robey      9.7K  2021-10-22 16:29:15  src/file_list.rs
    [...]
Bitbottle: 30 files, 30 blocks, 225KB -> 85.5KB (BLAKE3 hash)

It will also expand an archive:

> ./target/release/unbottle -v --secret ./tests/data/test-key ./src-test.bb -d /tmp/src-test
Decrypting with key: robey@togusa
Bitbottle encrypted with XCHACHA20_POLY1305, 1 public key (ED25519_NACL_SEALED)
    Block size: 1.00M
    Encrypted for: robey@togusa             (34fd22aae3c59072fd6f48147309eb302ea30f6ae5fc6376f683df3e74485a7c)
Bitbottle compressed with SNAPPY
    drwxrwxr-x  robey     robey            2021-10-16 16:01:41  src/
    -rw-rw-r--  robey     robey     12.0K  2021-10-23 12:15:15  src/bottle.rs
    -rw-rw-r--  robey     robey      9.7K  2021-10-22 16:29:15  src/file_list.rs
    [...]
Bitbottle: 30 files, 30 blocks, 225KB -> 85.5KB (BLAKE3 hash)
Extracted 30 file(s) (225K bytes) to /tmp/src-test

buzscan

"buzscan" is a rust implementation of the buzhash chunking algorithm. It's mostly a demo and test tool for the algorithm used to build a bitbottle archive.

Buzhash is a type of rolling hash which computes a hash over a sliding window of data, rolling forward until it finds one with a specified number of trailing zeros.It breaks the file on these boundaries into roughly even-sized blocks, and emits each block's size and its hash (usually Blake3, but configurable). This can be used by an archiver to identify duplicate blocks. It's good at finding the same hash values inside large files, even after data is moved around.

Some implementations like borg (C source) use a random table or PRNG to map bytes. Buzscan uses a deterministic table built from recursive applications of CRC-32 that were selected to have a good bit distribution.

The "buzscan" CLI tool will traverse a list of files and folders (recursively) and build up a set of blocks, looking for duplicates, and report on the de-duplicated size of the data it found. It's very slow, because it's hashing everything it finds.

> ./target/release/buzscan .
[00:00:01]      935 files,      885 blocks, total disk space:  236M,  154M unique

Build

Some of the modules are apparently not pure-Rust, including argonautica and rust-lzma. They require some local package installs:

  • pkg-config
  • liblzma-dev
  • libclang (for argonautica)

(I wish there were native versions of these packages! Please help!)

cargo build --release
./target/release/bitbottle --help

To run the full test suite, which includes some integration tests written in python:

make test

Archive format

A standard file archive consists of:

  • (optional) an encrypted bottle containing:
  • (optional) a compressed bottle containing:
  • a file list containing:
    • one or more files (metadata, block lists)
    • one or more blocks of data

That is, the archive itself is a file list. The file list may be compressed, and the compressed data may also be encrypted. Encryption must be the outer-most layer if it is used. The file list is just a count of how many files and blocks are present, followed by a separate bottle for each file and each block.

To build an archive, write_archive (in archive.rs) is given a list of starting paths. It scans each path recursively, building up a list of every file to include, then uses buzhash to break each file into blocks of roughly the same size (1MB by default). Each block is identified by its size and hash (Blake3 by default). If we see multiple blocks with the same size and hash, they're duplicates, and we only need to write each block once.

Once scanning is complete, we write the each file's metadata (its "atlas") as a separate bottle: The header contains its path, permissions, size, and the hash of its overall contents for extra validation. Folders and symlinks are written too, with a size of zero, no hash, and no blocks. For normal files, the bottle stream is a list of the hashes of the blocks that make up its content. (If the file has only one block, we skip this step, since the file's overall hash is also the hash of its only block.) Then we write a separate bottle for each scanned block.

To expand an archive, expand_archive does the opposite: It reads the metadata for each file, and uses the list of block hashes to reassemble the file from each block.

The low-level format of the bitbottle file and the structure of a bottle is documented in docs/format.md.

For encryption with SSH keys, only Ed25519 keys are currently supported, and only in OpenSSH key files: technical description of OpenSSH key file format.

Authors

License

Apache 2.0 license, included in LICENSE.txt.

Dependencies

~11–16MB
~265K SLoC