#dedup #program #cli

bin+lib multi-machine-dedup

Deduplication tool using SQLite to allow multi-machine features

1 unstable release

0.2.0 Dec 22, 2022

#9 in #program

21 downloads per month

GPL-3.0-or-later

22KB
263 lines

multi-machine-dedup

About

multi-machine-dedup is a deduplication tool using SQLite to allow multi-machine features.

multi-machine-dedup is an EDLA project.

The purpose of edla.org is to promote the state of the art in various domains.

Installation

cargo install multi-machine-dedup

How to use it

Index recursively a directory <DIRECTORY_FULL_PATH> labelled with a <LABEL> in a SQLite database <SQLITE_FILE>

 multi-machine-dedup index -l <LABEL> --db <SQLITE_FILE> <DIRECTORY_FULL_PATH>

Check a directory

 multi-machine-dedup check-integrity -l <LABEL> --db <SQLITE_FILE>

Compare two databases

 multi-machine-dedup compare --db1 <SQLITE_FILE_1> --db2 <SQLITE_FILE_2>

Example of SQL queries

You can use a convenient database tool like DBeaver CE or SQLiteStudio to query the generated SQLite database.

Find top duplicates files larger than <A_SIZE>

select label, full_path, hash,size,nb_dup from file , (select hash, count(*) as nb_dup from file where size > <A_SIZE>
group by hash order by nb_dup DESC, size DESC) as T
where file.hash = T.hash  order by nb_dup DESC, size DESC ;

Find all files with the same <CRC_VALUE>

select * from file where hash=<A_CRC_VALUE> ;

Find all files with image/jpeg MIME-type.

select * from hash where mime like "image/jpeg" ;

Tips

  • Enable debug mode in PowerShell
$Env:LOG='debug';  cargo run ...
  • Remove LOG environement variable in PorwerShell
remove-item Env:LOG
  • Show help for a <SUBCOMMAND>
multi-machine-dedup <SUBCOMMAND> --help

or

multi-machine-dedup help <SUBCOMMAND>

Roadmap

Inspired by https://github.com/hgrecco/dedup multi-machine-dedup will probably propose similar features.

License

© 2022 Olivier ROLAND. Distributed under the GPLv3 License.

Dependencies

~27–37MB
~597K SLoC