6 releases

Uses old Rust 2015

0.3.0 Aug 8, 2022
0.2.3 Aug 17, 2020
0.2.1 Apr 17, 2020
0.1.0 May 17, 2017

#1479 in Filesystem

MIT license

11KB
282 lines

dupdup

Suite of python 2.7 programs to solve the following problems:

  • Lots of files of all sorts, scattered on a file system
  • Lots of them are dupes
  • More than one directory has lots of music or photos, most of them being dupes, but some of them being the only copy
  • The disk(s) are on a remote machine that is not super fast, and the link to the local machine is not super fast (i.e. it's faster to run scripts locally on the machine that has the disk than to mount the disk remotely and run analysis over the network)
  • The names of the file are not a good indicator to de-duplicate (automatic renaming of files by Lightroom or automatic music taggers, for example).

This happens when you buy a NAS to backup, and then you backup multiple machine like an animal without having decided on a sensible archival strategy, so you have multiple copies of everything, but you have some files that are exclusive to each machine.

dupdup.py

This program hashes all the files under the specified directory, and finds the duplicates, With -o file.json, it writes a report in JSON format for further analysis.

It's multi-pass, because the disk is slow: first pass is to hash the first 4k of the each file, second pass is to completely hash the files that are "possibly dupes", to make sure they are really dupes.

dupdup.py ../some_directory some/directory -o output_file.json

It also has a slightly out of date version in dupdup-rs.

JSON format

{
  hash1 : ["duplicated file 1a",
           "duplicated file 1b",
           ...],
  hash2 : ["duplicated file 2a",
          "duplicated file 2b",
          ...]
}

merge.py

Given source directory A and destination directory B, tries to find all files in A that are not in B, skipping the files that are in both.

This generates a shell script, full of mv and mkdir -p commands, that is to be inspected, and then run.

Again, this is not based on the name of the files, but on their content.

merge.py -i source_directory -o destination_directory -f merge_script.sh

dupdup.html

A web page that can be open directly without server, and helps deleting dupes.

It accepts a JSON file generated by dupdup.py, and displays each dupe tuple on a line.

One can then click on the file to keep amongst all the copies, and also shift-click to select a column: clic on an item, press shift, press on another item to select a range.

Once a good number of files have been picked, clicking on the export script button generates a shell script to inspect and then to copy to the remote machine, to delete all the files that:

  • Have not been picked
  • Have a duplicate file that has been picked

i.e., it will not touch files that have no element picked on their line.

Dependencies

~2.4–3.5MB
~59K SLoC