#duplicate #finder #dupe #command-line-tool #filesize #filename

bin+lib clonehunter

An ultra simple command line utility that identifies groups of identical files and displays them to the console

3 unstable releases

0.2.1 Apr 16, 2024
0.2.0 Apr 15, 2024
0.1.0 Apr 11, 2024

#652 in Filesystem

Download history 101/week @ 2024-04-08 262/week @ 2024-04-15

363 downloads per month

MIT/Apache

35KB
491 lines

CloneHunter: An ultra simple command line utility that identifies groups of identical files and displays them to the console.

crate MIT licensed Rust Version Downloads Category

Copyright (c) 2024 Venkatesh Omkaram

How to Use?

If you have the program as a binary executable then run, clonehunter --help for usage. If you are running this program via Cargo, then run cargo run -- --help from the root folder for usage.

To install the program permanently on your system do cargo install clonehunter.

Example usage:

clonehunter your-folder-path -t 12 -c -v -m 50

-c stands for checksum. If you pass this option, clonehunter will find the file clones (aka duplicate files or identical files) based on a partial checksum by reading bytes from the beginning and the ending of the file content. If you do not pass -c option, then clonehunter will scan for clones based on a combination of file name, modified time and file size hash combined.

-m stands for max depth. The number after -m indicates how many sub levels we need to look for clones. The default value is 10. If you do not wish to specify a max depth, then pass the option --no-max-depth

-v stands for verbose. It prints the hashes of each and every file for you to compare and manually figure out clones.

-t stands for threads. Choose the number of threads to allocate the program to hunt. In the above example I am using 12 threads.

How it works?

There are two modes the program looks for duplicate files.

  1. Without checksum calculation
  2. With checksum calculation (by passing -c)

Without checksum calculation:

This applies when you do not pass a -c option. The program will look for clones based on a hash operation performed on the combination of file size, file name and modified times.

  • If two file names and file size are the same, that does not qualify as a clone
  • If two files file sizes and modified times are the same, that does not qualify as a clone
  • If two file names and modified times are the same, that does not qualify as a clone
  • Finally, if two file names, file size and modified times all are the same then it is definitely a clone

Now, the question may arise, what if there are two files with different file names, but the content inside is absolute the same, regardless of the modified time? It must be a clone correct? Yes. This is where 'With checksum calculation' -c option helps.

With checksum calculation:

A checksum is also a hash, but this is performed on the file content instead of the file metadata such as name, size and time.

  • If the file size is less than (<) 1 MB, then a checksum is performed on entire length of the file data.
  • If the file size is greater than (>) 1 MB, then the program takes the first 1 MB, the last 1 MB of the file and the file size is all combined together and a hash is taken on it. This way, we can be sure if two files are absolutely clones without performing a checksum on the whole length of the file

Some considerations

The program scans and outputs identical files based on best effort basis. This means that not all files it reports on can be deemed as 'Absolutely identical'. So, the key term here is "Possibly identical". This tool can be used when you want to do a quick analysis to see which files are POSSIBLY identical. This tool must not be used in critical places and business solutions, and must not be considered as the source of truth to delete any of those found identical files.

Also, using this tool will not destroy any files on your machine. There are no delete or write operations performed in the code. If you found any such strangeness, please raise an Issue. At most, the tool reports incorrect identical files or skips some of the files which are not accessible due to file permission restrictions.

Dependencies

~7–19MB
~207K SLoC