#find #duplicate #hash #blake3 #fxhash

bin+lib find_duplicate_files

find duplicate files according to their size and hashing algorithm

44 releases (9 breaking)

0.18.1 Dec 30, 2023
0.17.2 Dec 30, 2023
0.16.5 Nov 19, 2023
0.4.7 Jul 31, 2023

#276 in Command line utilities

Download history 1/week @ 2023-11-04 1/week @ 2023-11-11 169/week @ 2023-11-18 301/week @ 2023-11-25 202/week @ 2023-12-02 88/week @ 2023-12-09 108/week @ 2023-12-16 208/week @ 2023-12-23 104/week @ 2023-12-30 50/week @ 2024-01-06 4/week @ 2024-01-13 1/week @ 2024-01-20 318/week @ 2024-01-27 105/week @ 2024-02-03 318/week @ 2024-02-10 1502/week @ 2024-02-17

2,243 downloads per month

BSD-3-Clause

49KB
911 lines

find_duplicate_files

Find duplicate files according to their size and hashing algorithm.

"A hash function is a mathematical algorithm that takes an input (in this case, a file) and produces a fixed-size string of characters, known as a hash value or checksum. The hash value acts as a summary representation of the original input. This hash value is unique (disregarding unlikely collisions) to the input data, meaning even a slight change in the input will result in a completely different hash value."

Hash algorithm options are:

  1. ahash (used by hashbrown)

  2. blake version 3 (default)

  3. fxhash (used byFireFox and rustc)

  4. sha256

  5. sha512

find_duplicate_files just reads the files and never changes their contents. See the function fn open_file() to verify.

Usage examples

  1. To find duplicate files in the current directory, run the command:
find_duplicate_files
  1. To find duplicate files with fxhash algorithm and yaml format:
find_duplicate_files -csta fxhash -r yaml
  1. To find duplicate files in the Downloads directory and redirect the output to a json file for further analysis:
find_duplicate_files -p ~/Downloads -r json > fdf.json

Help

Type in the terminal find_duplicate_files -h to see the help messages and all available options:

find duplicate files according to their size and hashing algorithm

Usage: find_duplicate_files [OPTIONS]

Options:
  -a, --algorithm <ALGORITHM>
          Choose the hash algorithm [default: blake3] [possible values: ahash, blake3, fxhash, sha256, sha512]
  -c, --clear_terminal
          Clear the terminal screen before listing the duplicate files
  -f, --full_path
          Prints full path of duplicate files, otherwise relative path
  -g, --generate <GENERATOR>
          If provided, outputs the completion file for given shell [possible values: bash, elvish, fish, powershell, zsh]
  -m, --max_depth <MAX_DEPTH>
          Set the maximum depth to search for duplicate files
  -o, --omit_hidden
          Omit hidden files (starts with '.'), otherwise search all files
  -p, --path <PATH>
          Set the path where to look for duplicate files, otherwise use the current directory
  -r, --result_format <RESULT_FORMAT>
          Print the result in the chosen format [default: personal] [possible values: json, yaml, personal]
  -s, --sort
          Sort result by file size, otherwise sort by number of duplicate files
  -t, --time
          Show total execution time
  -v, --verbose
          Show intermediate runtime messages
  -h, --help
          Print help (see more with '--help')
  -V, --version
          Print version

Building

To build and install from source, run the following command:

cargo install find_duplicate_files

Another option is to install from github:

cargo install --git https://github.com/claudiofsr/find_duplicate_files.git

Mutually exclusive features

Walking a directory recursively: jwalk or walkdir.

In general, jwalk (default) is faster than walkdir.

But if you prefer to use walkdir:

cargo install --features walkdir find_duplicate_files

Dependencies

~12–23MB
~403K SLoC