#file #command-line-tool #duplicates #rust

app check_hashes

A fast and parallelized tool to detect duplicate files in a directory based on hashes

1 stable release

Uses new Rust 2024

new 1.0.0 Apr 28, 2025

#20 in #duplicates

MIT license

15KB
192 lines

Duplicate File Finder

A fast and efficient tool to detect duplicate files in a directory based on file content.


Features

  • Partial Hashing for quick initial grouping (reads first 4 KB).
  • Full Hashing for final confirmation (full file read or memory-mapped).
  • Parallelized using Rayon for high performance.
  • Progress Bars for visual feedback.
  • Supports large datasets and very large files.
  • Colored terminal output for better readability.

Usage

1. Install Rust (if you don't have it)

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

2. Clone and build the project

git clone https://github.com/yourusername/duplicate-file-finder.git
cd duplicate-file-finder
cargo build --release

3. Run the program

cargo run -- --path /path/to/your/directory

Or using the compiled release binary:

./target/release/duplicate-file-finder --path /path/to/your/directory

Example

cargo run -- --path ./Downloads

Sample output:

Scanning files...
Found 5321 files. Computing partial hashes...
Grouping files by partial hash...
421 candidate files after partial hashing. Computing full hashes...
Grouping by full hash...

❌ Duplicates found:

Group 1 (2 files) - Hash: d2f1d7e91c8b...
  /path/to/file1.jpg
  /path/to/file1_copy.jpg

Group 2 (3 files) - Hash: a34e1b1fe98d...
  /path/to/doc1.pdf
  /path/to/backup/doc1.pdf
  /path/to/archive/old/doc1.pdf

Found 2 duplicate groups.

Summary: Scanned 5321 files in 1m 12s.

Command-Line Arguments

Argument Description Example
--path or -p Directory to scan recursively --path ./Documents

How It Works

  • Step 1: Scan all files under the given directory recursively.
  • Step 2: Compute a partial hash (first 4KB) of each file.
  • Step 3: Group files with identical partial hashes.
  • Step 4: Compute full hashes for the candidate groups.
  • Step 5: Report groups of true duplicates based on full file content.

This two-step approach makes it very fast even for very large folders.


Dependencies

This project uses:

  • blake3 for fast cryptographic hashing.
  • clap for argument parsing.
  • rayon for parallel processing.
  • indicatif for progress bars.
  • colored for colored terminal output.
  • walkdir for recursive file walking.
  • memmap2 for memory-mapping large files.

Install all dependencies automatically when you run cargo build.


License

This project is licensed under the MIT License. See LICENSE for more information.

Dependencies

~7–16MB
~206K SLoC