#blake3 #duplicate #filesize #hash #csv #find #identical

bin+lib find_duplicate_files

find identical files according to their size and hashing algorithm

69 releases (19 breaking)

0.28.0 Apr 11, 2024
0.26.1 Apr 8, 2024
0.21.2 Mar 31, 2024
0.18.1 Dec 30, 2023
0.4.7 Jul 31, 2023

#1997 in Command line utilities

Download history 2/week @ 2024-09-18 4/week @ 2024-09-25 1/week @ 2024-10-09 53/week @ 2024-12-04 76/week @ 2024-12-11

129 downloads per month

BSD-3-Clause

78KB
1.5K SLoC

Rust 1K SLoC // 0.2% comments BASH 135 SLoC Zsh 64 SLoC Shell 4 SLoC // 0.6% comments

New project name

This project has been renamed to: find-identical-files.

old project name: find_duplicate_files

find_duplicate_files

Find identical files according to their size and hashing algorithm.

"A hash function is a mathematical algorithm that takes an input (in this case, a file) and produces a fixed-size string of characters, known as a hash value or checksum. The hash value acts as a summary representation of the original input. This hash value is unique (disregarding unlikely collisions) to the input data, meaning even a slight change in the input will result in a completely different hash value."

Hash algorithm options are:

  1. ahash (used by hashbrown)

  2. blake version 3 (default)

  3. fxhash (used byFireFox and rustc)

  4. sha256

  5. sha512

find_duplicate_files just reads the files and never changes their contents. See the function fn open_file() to verify.

Usage examples

1. To find duplicate files in the current directory, run the command:

find_duplicate_files

2. Search files in current directory with at least 5 identical files, run the command:

find_duplicate_files -n 5

With the --min_number (or -n) argument option, set the 'minimum number of identical files'.

With the --max_number (or -N) argument option, set the 'maximum number of identical files'.

If n = 0 or n = 1, all files will be reported.

If n = 2 (default), look for duplicate files or more identical files.

3. To find duplicate files with fxhash algorithm and yaml format:

find_duplicate_files -twa fxhash -r yaml

4. To find duplicate files in the Downloads directory and redirect the output to a json file for further analysis:

find_duplicate_files -vi ~/Downloads -r json > fdf.json

5. To find duplicate files in the current directory whose size is greater than or equal to 8 bytes:

find_duplicate_files -b 8

6. To find duplicate files in the current directory whose size is less than or equal to 1024 bytes:

find_duplicate_files -B 1024

7. To find duplicate files in the current directory whose size is between 8 and 1024 bytes:

find_duplicate_files -b 8 -B 1024

8. To find duplicate files in the current directory whose size is exactly 1024 bytes:

find_duplicate_files -b 1024 -B 1024

9. Export duplicate file information from the current directory to an CSV file (fdf.csv).

8.1 The CSV file will be saved in the currenty directory:

find_duplicate_files -c .

8.2 The CSV file will be saved in the /tmp directory:

find_duplicate_files --csv_dir=/tmp

10. Export duplicate file information from the current directory to an XLSX file (fdf.xlsx).

9.1 The XLSX file will be saved in the ~/Downloads directory:

find_duplicate_files -x ~/Downloads

9.2 The XLSX file will be saved in the /tmp directory:

find_duplicate_files --xlsx_dir=/tmp

11. To find duplicate files in the Downloads directory and export the result to /tmp/fdf.xlsx with the ahash algorithm:

find_duplicate_files -twi ~/Downloads -x /tmp -a ahash

Help

Type in the terminal find_duplicate_files -h to see the help messages and all available options:

find identical files according to their size and hashing algorithm

Usage: find_duplicate_files [OPTIONS]

Options:
  -a, --algorithm <ALGORITHM>
          Choose the hash algorithm [default: blake3] [possible values: ahash, blake3, fxhash, sha256, sha512]
  -b, --min_size <MIN_SIZE>
          Set a minimum file size (in bytes) to search for duplicate files
  -B, --max_size <MAX_SIZE>
          Set a maximum file size (in bytes) to search for duplicate files
  -c, --csv_dir <CSV_DIR>
          Set the output directory for the CSV file (fdf.csv)
  -d, --min_depth <MIN_DEPTH>
          Set the minimum depth to search for duplicate files
  -D, --max_depth <MAX_DEPTH>
          Set the maximum depth to search for duplicate files
  -f, --full_path
          Prints full path of duplicate files, otherwise relative path
  -g, --generate <GENERATOR>
          If provided, outputs the completion file for given shell [possible values: bash, elvish, fish, powershell, zsh]
  -i, --input_dir <INPUT_DIR>
          Set the input directory where to search for duplicate files [default: current directory]
  -n, --min_number <MIN_NUMBER>
          Minimum 'number of identical files' to be reported
  -N, --max_number <MAX_NUMBER>
          Maximum 'number of identical files' to be reported
  -o, --omit_hidden
          Omit hidden files (starts with '.'), otherwise search all files
  -r, --result_format <RESULT_FORMAT>
          Print the result in the chosen format [default: personal] [possible values: json, yaml, personal]
  -s, --sort
          Sort result by number of duplicate files, otherwise sort by file size
  -t, --time
          Show total execution time
  -v, --verbose
          Show intermediate runtime messages
  -w, --wipe_terminal
          Wipe (Clear) the terminal screen before listing the duplicate files
  -x, --xlsx_dir <XLSX_DIR>
          Set the output directory for the XLSX file (fdf.xlsx)
  -h, --help
          Print help (see more with '--help')
  -V, --version
          Print version

Building

To build and install from source, run the following command:

cargo install find_duplicate_files

Another option is to install from github:

cargo install --git https://github.com/claudiofsr/find_duplicate_files.git

Mutually exclusive features

Walking a directory recursively: jwalk or walkdir.

In general, jwalk (default) is faster than walkdir.

But if you prefer to use walkdir:

cargo install --features walkdir find_duplicate_files

Dependencies

~16–25MB
~351K SLoC