#btrfs #deduplication #tool #dedupe #io #hash #gitlab

app btrfs-dedupe

BTRFS whole-file deduplication tool

3 stable releases

Uses old Rust 2015

1.0.2 Nov 6, 2016

#19 in #btrfs

MIT license

15KB
528 lines

BTRFS Dedupe

This is a BTRFS deduplication utility. It operates in a batch mode, scanning for files with the same size, performing an SHA256 hash on each one, then invoking the kernel deduplication ioctl for all those that match.

It is written by James Pharaoh.

It is hosted at gitlab.wellbehavedsoftware.com (https://gitlab.wellbehavedsoftware.com/well-behaved-software/wbs-backup/tree/master/btrfs-dedupe) — please report any issues or feature requests here.

It is also available from the following locations:

General information

The utility is very simple. It takes a list of directories, scans for files with matching sizes, performs an SHA256 checksum on each one, then invokes the ioctl to deduplicate the entire file for every match it finds. Optionally, it can match filenames as well as sizes; this may make the program run faster in some cases.

Usage

From the built-in help:

$ btrfs-dedupe --help

Btrfs Dedupe 

USAGE:
    btrfs-dedupe [FLAGS] [<PATH>]

FLAGS:
    -h, --help              Prints help information
        --match-filename    Match filename as well as checksum
    -V, --version           Prints version information

ARGS:
    <PATH>...    Root path to scan for files

Alternatives

There are two alternatives, of which I am aware:

  • Duperemove — Performs a block-level hash on files and attempts to deduplicate parts of files. This is overkill for my purposes, although I have no reason to believe it does not work well. I believe it will be slower than this tool, since it does a far deeper analysis of file contents.

  • Bedup — Performs a similar task to this tool, plus it keeps a database of files in order to avoid checksumming again. The main implementation, however, does not use the kernel ioctls (which were simply not available when it was created), although a branch supports this. It also suffers from leaving filesystems in an inconsistent state in the case of errors, namely setting files as immutable, and it also crashes if there are many files to deduplicate.

There is also [ongoing work] (http://www.mail-archive.com/linux-btrfs%40vger.kernel.org/msg32862.html) to enable automatic realtime deduplication in the filesystem itself, but this is likely to take a long time to stablise, and there are fundamental issues with the concept which make it unsuitable for many cases.

There is a wiki page with general information about the state of deduplication in BTRFS.

Dependencies

~3.5MB
~74K SLoC