5 releases (1 stable)

1.0.0 Aug 26, 2024
0.2.0 Feb 4, 2024
0.1.2 Feb 13, 2023
0.1.1 Feb 10, 2023
0.1.0 Feb 10, 2023

#97 in Images

AGPL-3.0

21KB
456 lines

dupimg

A minimal command-line duplicate image finder with persistent caching.

Summary

Checks the similarity of images specified on the command line, by hashing them and computing their hamming distance using the image_hasher library. Likely duplicates are printed in groups on the command line.

Both hash and hamming distance computations are multithreaded; comparing 2762 images takes ~25 seconds on a Ryzen 9 5900X.

Computed image hashes are persistently stored in CSV files under ~/.cache/dupimg/, so that only the hamming distance calculations has to be re-done on subsequent comparisons with the same image.

Installation

Either run cargo install dupimg, or clone this repository and run cargo install --path . from the project root.

Basic usage

dupimg [-r directory1/ directory2/ ...] file1.jpg file2.png ...

For additional information, see dupimg --help.

Output format

Output groups are sorted alphabetically by path.

<PATH 1>            # first image
<DIST>  <PATH 2>    # hamming distance, first likely duplicate
<DIST>  [PATH ...]  # other likely duplicates

<PATH 3>            # second image
<DIST>  <PATH 4>
<DIST>  [PATH ...]

[...]

Recurse

-r may be specified to enable traversing specified directories.

When recurse is enabled, only PNG and JPG files will be checked. This also applies to filenames specified on the command line.

"Left-right" comparison

-l <FILE/DIRECTORY> may be specified to perform comparisons between two distinct sets of images -- aka. determine which images in the "left" set are also present in the "right" set, instead of comparing all images with each other.

-l must be specified per file/directory in order to assign them to the "left" set. It works in combination with -r/--recurse: e.g. -r -l dir1/ dir2/ compares all images under dir1/ with all images under dir2/.

When -l is specified for a single file only, dupimg effectively becomes a local reverse image search utility.

Threshold

-t <THRESHOLD>, where THRESHOLD is a positive integer, may be specified to adjust the duplicate detection threshold.

The default is 5, which with the default hash size errs on the side of caution, somewhat preferring false positives over false negatives. 0 gives very few false positives, but might miss some duplicates (e.g. due to compression artifacts).

Hash size

-h <SIZE>, where SIZE is a positive integer, may be specified to change the size of image hashes.

Different hash sizes are not comparable and are thus stored separately under ~/.cache/dupimg/.

The default hash size is 8 bytes, which works reasonably well for most images. Note that the detection threshold must be increased together with the hash size.

Credits

All code in this crate was written by myself.

All credits for libraries used go to their respective authors.

Dependencies

~9–17MB
~217K SLoC