7 releases (4 breaking)

new 0.6.0 Jun 24, 2024
0.5.0 Jun 20, 2024
0.4.0 Jun 17, 2024
0.3.0 Nov 7, 2023
0.2.2 Jun 9, 2022

#74 in Filesystem

Download history 1/week @ 2024-03-03 7/week @ 2024-03-31 312/week @ 2024-06-16

312 downloads per month

MIT license

21KB
395 lines

refine

Refine your file collection using Rust!

What it does

This is a tool that will scan any given paths, and run some command on them.

The dupes command will analyze and report the possibly duplicated files, both by size and name. It will even load a sample from each file, in order to guarantee they are indeed duplicated.

The new rebuild command is a marvel of engineering! If I say so myself. It will rebuild the filenames of your entire collection!

It is blazingly fast and tiny, made 100% in Rust 🦀!

In the future, this tool could make much more, like for instance moving duplicated files, including a GUI to enable easily acting upon them, etc., hence the open name refine...

How it works

The dupes command

  1. recursively detect all files in the given paths (excluding hidden .folders)
    • can optionally run only a shallow scan too.
  2. sort all the files by their sizes and by their words
    • the word extractor ignores repetition systems like -1, -2, and copy, copy 2.
  3. for each group with the exact same value, a sample of each file will be retrieved and compared
  4. each coincidence will be listed as possible duplicates:
-- by size

132.1kB x3
/Users/you/Downloads/path/file.ext
/Users/you/Downloads/another-path/other.any
/Volumes/External/backup-path/back.001

248.6MB x2
/Users/you/Downloads/video.mp4
/Volumes/External/backup-path/video.mpg.bak

-- by name

["bin", "cache", "query"] x2
904.2kB: ./target/debug/incremental/refine-1uzt8yoeb0t1e/s-gx7knsxvbx-1oc90bk-working/query-cache.bin
904.9kB: ./target/debug/incremental/refine-1uzt8yoeb0t1e/s-gx7knwsqka-w784iw-6s3nzkfcj1wxagnjubj1pm4v6/query-cache.bin

And, finally, a brief receipt will be printed:

total files: 13512
  by size: 339 duplicates
  by name: 12 duplicates

The rebuild command

  1. strip parts of the filenames, either before or after some matches, or exact ones in the middle;
  2. remove all sequence numbers they might have, like "copy 2" or "-3";
  3. smartly remove spaces and underscores to detect misspelled names;
  4. group the names according to the rest;
  5. smartly choose the most likely correct name among the group;
  6. sort the group entries by created date;
  7. regenerate a unified sequence with this new order; <-- Note this occurs on the whole group, regardless of the directory the file resides!
  8. renames the files to the new pattern.
/Users/you/Downloads/path/file.mp4 --> file-1.mp4
/Users/you/Downloads/path/video ok.mp4 --> video__ok-1.mp4
/Users/you/Downloads/another-path/video_ok.mp4 --> video__ok-2.mp4
/Volumes/External/backup-path/Video__OK.mp4 --> video__ok-3.mp4
/Users/you/Downloads/another-path/video not ok.mp4 --> video_not_ok-1.mp4

And, finally, a brief receipt will be printed:

total files: 21126
  changes: 1432

How to use it

Install with cargo install refine, then just:

 refine dupes ~/Downloads /Volumes/Drive ...

Or:

 refine rebuild ~/Downloads /Volumes/Drive ...

Send as many sources as you want.

Changelog

  • 0.6.0 Jun 24, 2024: new rebuild command, general polishing overall.
  • 0.5.0 Jun 20, 2024: support for shallow scan, verbose mode, dupes cmd ignores repetition systems.
  • 0.4.0 Jun 17, 2024: include dupes command, support match case and changing sample size.
  • 0.3.0 Nov 07, 2023: include dedup by both size and name.
  • 0.2.2 Jun 04, 2022: use 2KB sample size.
  • 0.2.1 Jun 04, 2022: improve error handling.
  • 0.2.0 Jun 01, 2022: publish, use split crate human-repr.
  • 0.1.1 May 27, 2022: samples the center of the files, which seems to fix false positives.
  • 0.1.0 May 25, 2022: first release, detects duplicated files, simple sampling strategy (1KB from the start of the files).

License

This software is licensed under the MIT License. See the LICENSE file in the top distribution directory for the full license text.


Maintaining an open source project is hard and time-consuming, and I've put much ❤️ and effort into this.

If you've appreciated my work, you can back me up with a donation! Thank you 😊

Donate with PayPal button


Dependencies

~3.5–5MB
~89K SLoC