#deduplicated #pointers #manager #encryption #file #pile

bin+lib yama

Deduplicated, compressed and encrypted content pile manager

1 unstable release

0.4.0 Jun 16, 2021

#993 in Filesystem


Used in datman

GPL-3.0-or-later

160KB
3.5K SLoC

山 (yama): deduplicated heap repository

note: this readme is not yet updated to reality…

yama
  [-w|--with [user@host:]path] [--with-encrypted true|false]

Backup Profiles

Remotes

In yama.toml, you can configure remotes:

[remote.bob]
encrypted = true
host = "bobmachine.xyz"
user = "bob"
path = "/home/bob/yama"

Subcommands

check: Check repository for consistency

Verifies the full repository satisfies the following consistency constraints:

  • all chunks have the correct hash
  • all pointers have a valid structure, recursively

Usage: yama check [--gc]

The amount of space occupied and occupied by unused chunks is reported.

If --gc is specified, unused chunks will be removed.

lsp: List tree pointers

Usage: yama lsp

rmp: Remove tree pointers

Usage: yama rmp pointer/path [--force]

If --force is not specified and the pointer is depended upon by another, then deletion is aborted with an error.

store: Store tree into repository

Usage: yama store [--dry-run] [ssh://user@host]/path/to/dir pointer/path [--exclusions path/to/exclusions.txt] [--differential pointer/parent]

The pointer must not exist and it will be created. If --differential is specified with an existing parent pointer, then the diretory listing is specified as a differential list to the parent. The intention of this is to reduce the size of the directory list.

Exclusion lists

Exclusion lists have pretty much the same format as .gitignore, one glob per line of files to not include, relative to the tree root.

extract: Extract file(s) from repository

Usage: yama extract [--dry-run] pointer/path[:path] [ssh://user@host]/path/to/local/dir[/]

If no path specified, extract root /. Trailing slash means that the file will be extracted as a child of the specified directory.

remote: Run operations on a remote repository

Usage: yama remote ssh://user@host/path/to/repo <subcommand>

remote store: Store local tree into remote repository

Usage is identical to yama store except store path must be local.

remote extract: Extract remote repository into local tree

Usage is identical to yama extract except target path must be local.

slave: Remote-controlled yama

Communicates over stdin/stdout to perform specified operations. Used when a yama command involves SSH.

Repository Storage Details

Pointers are stored in pointers.lmdb and chunks are stored in chunks.lmdb. It is expected that exclusion files will be kept in the same directory with the repository, if they are to be used on a recurring basis.

Chunks are compressed with zstd. It must first be trained and a training dictionary placed in repo root/zstd.dict. This dictionary file must not be lost or altered after chunks have been made using it. Doing so will void the integrity of the entire repository.

Chunks are hashed with BLAKE256, and chunks will have their xxHash calculated before being deduplicated away. (Collision being detected will result in abortion of the backup. It is expected to never happen but nevertheless we may not be sure.)

Remote Protocol Details

  • Compression is performed on the host where the data resides.
  • Only required chunks are compressed and diffused across the SSH connection.
  • There needs to be some mechanism to offer, decline and accept chunks, without buffers overflowing and bringing hosts down.

Processor Details

Other notes

zstd --train FILEs -o zstd.dict

  • Candidate size: find ~/Programming -size -4k -size +64c -type f -exec grep -Iq . {} \; -printf "%s\n" | jq -s 'add'
  • Want to sample:
    • find ~/Programming -size -4k -size +64c -type f -exec grep -Iq . {} \; -exec cp {} -t /tmp/d/ \;
    • du -sh
    • find > file.list
    • wc -l < file.list → gives a № lines
    • shuf -n 4242 file.list | xargs -x zstd --train -o zstd.dict for 4242 files. Chokes if it receives a filename with a space, just re-run until you get a working set.

Dependencies

~63MB
~847K SLoC