39 releases (4 stable)
1.3.0 | Jul 30, 2024 |
---|---|
1.2.0 | Feb 28, 2024 |
1.1.0 | Feb 13, 2023 |
1.0.0 | Jan 23, 2022 |
0.8.3 | Nov 20, 2020 |
#194 in Filesystem
56KB
1.5K
SLoC
YADF — Yet Another Dupes Finder
It's fast on my machine.
You should probably use fclones
.
Installation
Prebuilt Packages
Executable binaries for some platforms are available in the releases section.
Building from source
- Install Rust Toolchain
- Run
cargo install --locked yadf
Usage
yadf
defaults:
- search current working directory
$PWD
- output format is the same as the "standard"
fdupes
, newline separated groups - descends automatically into subdirectories
- search includes every files (including empty files)
yadf # find duplicate files in current directory
yadf ~/Documents ~/Pictures # find duplicate files in two directories
yadf --depth 0 file1 file2 # compare two files
yadf --depth 1 # find duplicates in current directory without descending
fd --type d a | yadf --depth 1 # find directories with an "a" and search them for duplicates without descending
fd --type f a | yadf # find files with an "a" and check them for duplicates
Filtering
yadf --min 100M # find duplicate files of at least 100 MB
yadf --max 100M # find duplicate files below 100 MB
yadf --pattern '*.jpg' # find duplicate jpg
yadf --regex '^g' # find duplicate starting with 'g'
yadf --rfactor over:10 # find files with more than 10 copies
yadf --rfactor under:10 # find files with less than 10 copies
yadf --rfactor equal:1 # find unique files
Formatting
Look up the help for a list of output formats yadf -h
.
yadf -f json
yadf -f fdupes
yadf -f csv
yadf -f ldjson
Help output.
Yet Another Dupes Finder
Usage: yadf [OPTIONS] [PATHS]...
Arguments:
[PATHS]... Directories to search
Options:
-f, --format <FORMAT> Output format [default: fdupes] [possible values: csv, fdupes, json, json-pretty, ld-json, machine]
-a, --algorithm <ALGORITHM> Hashing algorithm [default: ahash] [possible values: ahash, highway, metrohash, seahash, xxhash]
-n, --no-empty Excludes empty files
--min <size> Minimum file size
--max <size> Maximum file size
-d, --depth <depth> Maximum recursion depth
-H, --hard-links Treat hard links to same file as duplicates
-R, --regex <REGEX> Check files with a name matching a Perl-style regex, see: https://docs.rs/regex/1.4.2/regex/index.html#syntax
-p, --pattern <glob> Check files with a name matching a glob pattern, see: https://docs.rs/globset/0.4.6/globset/index.html#syntax
-v, --verbose... Increase logging verbosity
-q, --quiet... Decrease logging verbosity
--rfactor <RFACTOR> Replication factor [under|equal|over]:n
-o, --output <OUTPUT> Optional output file
-h, --help Print help (see more with '--help')
-V, --version Print version
For sizes, K/M/G/T[B|iB] suffixes can be used (case-insensitive).
Notes on the algorithm
Most¹ dupe finders follow a 3 steps algorithm:
- group files by their size
- group files by their first few bytes
- group files by their entire content
yadf
skips the first step, and only does the steps 2 and 3, preferring hashing rather than byte comparison. In my tests having the first step on a SSD actually slowed down the program.
yadf
makes heavy use of the standard library BTreeMap
, it uses a cache aware implementation avoiding too many cache misses. yadf
uses the parallel walker provided by ignore
(disabling its ignore features) and rayon
's parallel iterators to do each of these 2 steps in parallel.
¹: some need a different algorithm to support different features or different performance trade-offs
Design goals
I sought out to build a high performing artefact by assembling together libraries doing the actual work, nothing here is custom made, it's all "off-the-shelf" software.
Benchmarks
The performance of yadf
is heavily tied to the hardware, specifically the
NVMe SSD. I recommend fclones
as it has more hardware heuristics. and in general more features. yadf
on HDDs is terrible.
My home directory contains upwards of 700k paths and 39 GB of data, and is probably a pathological case of file duplication with all the node_modules, python virtual environments, rust target, etc. Arguably, the most important measure here is the mean time when the filesystem cache is cold.
Program (warm filesystem cache) | Version | Mean [s] | Min [s] | Max [s] |
---|---|---|---|---|
fclones |
0.29.3 | 7.435 ± 1.609 | 4.622 | 9.317 |
jdupes |
1.14.0 | 16.787 ± 0.208 | 16.484 | 17.178 |
ddh |
0.13 | 12.703 ± 1.547 | 10.814 | 14.793 |
dupe-krill |
1.4.7 | 15.555 ± 1.633 | 12.486 | 16.959 |
fddf |
1.7.0 | 18.441 ± 1.947 | 15.097 | 22.389 |
yadf |
1.1.0 | 3.157 ± 0.638 | 2.362 | 4.175 |
Program (cold filesystem cache) | Version | Mean [s] | Min [s] | Max [s] |
---|---|---|---|---|
fclones |
0.29.3 | 68.950 ± 3.694 | 63.165 | 73.534 |
jdupes |
1.14.0 | 303.907 ± 11.578 | 277.618 | 314.226 |
yadf |
1.1.0 | 52.481 ± 1.125 | 50.412 | 54.265 |
I test less programs here because it takes several hours to run.
The script used to benchmark can be read here.
Hardware used.
Extract from neofetch
and hwinfo --disk
:
- OS: Ubuntu 20.04.1 LTS x86_64
- Host: XPS 15 9570
- Kernel: 5.4.0-42-generic
- CPU: Intel i9-8950HK (12) @ 4.800GHz
- Memory: 4217MiB / 31755MiB
- Disk:
- model: "SK hynix Disk"
- driver: "nvme"
Dependencies
~5–15MB
~197K SLoC