Duplicate file remover using md5sum--good enough?

Michael_Stora · August 30, 2010, 2:43pm

This is not a typical question. I have a fully working script but I'm interested in optimizing it.

I frequently back up photos and movies from my digital camera and cell phone to both my home/work desktops, laptops, wife's netbook, and my home NAS and often end up with multiple versions of the same files in folders of varying completeness. I previously wrote a very slow script that checked if files with the same length were identical. It was also limited to a single directory. I recently wrote this much faster script that digs recursively through a directory tree (with find instead of ls), creates a field containing size, checksum, basename and full path name, sorts them, and deletes all but the first when size and checksum are identical. The script uses pipes to avoid arrays and executes quite fast. The alphabetically first basename gets kept. I use a comma to separate the size and checksum part (to test if identical) and a backslash before the path name so I can pass everything on the pipe and separate after. There is a dash between size and checksum that was useful for debugging and I don't think impacts speed. The path ends up being part of the sort, but that is harmless since it is after the basename. The parameter substitutions are robust for either comma or backslash occurring in the file name or path (God forbid!).

#! /bin/bash

#Delete duplicate files starting at $1 recursive

dir=${1:-.}                                                                 #defaults to current directory '.'

find "$dir" | { while read path; do
    name=${path##*/}                                                        #basename
    if [ -f "$path" ]; then                                                 #if regular file
        sum=$(md5sum "$path")
        echo `stat -c %s "$path"`'-'${sum%%' '*}','"$name"'\\'"$path"       #length-md5sum,basename\path
    else continue                                                           #skip if not regular file
    fi
done } | sort | {                                                           #sort files
test=''
while read line; do
    front=${line%%,*}                                                       #size-md5sum
    back=${line##*\\}                                                       #full file name with path
    if [ "$front" = "$test" ]; then                                         #same size-md5sum as previous file?
        echo 'deleting duplicate file '"$back"; rm "$back"                  #if so, delete it.
    fi
    test="$front"
done }

Since it works so well, I'd like to use it as a general tool. Is md5sum good enough? Does adding size (was a legacy optimization from my first script) really add any benefit in collision resistance or execution speed in the sort command? Any ideas on how to make this script faster or more robust?

Edit: sha1sum is about 13% slower than md5sum but probably eliminates collision concerns.

Mike

rdcwayx · August 30, 2010, 10:14pm

First, if your system has no md5sum and stat commands, or not at default path, the command will have the risk to delete all files under the current folder.

Second, stat command is useless in your script. if MD5 key is same, the file size should be same.

The first find and while loop can be replaced by:

find . -type f -exec md5sum {} \;  > /tmp/file_list

For example, I got this output:

$ cat /tmp/file_list
d41d8cd98f00b204e9800998ecf8427e *./abc
323ba8e2da815f896181c53564c4b1d2 *./abcd/abc
323ba8e2da815f896181c53564c4b1d2 *./abcd/def
ae70e0b0a0077a006942c876250bc0f5 *./infile
5f1b0a73a2b4dc51bcad52c357d55d19 *./outfile
323ba8e2da815f896181c53564c4b1d2 *./xyx

Second while loop can be replaced by: (no need to sort it)

 awk 'a[$1]++ {gsub(/^\*/,"",$2); print "rm ", $2}' /tmp/file_list |sh

Michael_Stora · August 31, 2010, 2:48am

Wow! Thanks for pointing this out. I'll do something to detect this and abort the script.

Not true in the case of all MD5 or Sha collisions but probably no practical risk.

The first find and while loop can be replaced by:

find . -type f -exec md5sum {} \;  > /tmp/file_list

For example, I got this output:

$ cat /tmp/file_list
d41d8cd98f00b204e9800998ecf8427e *./abc
323ba8e2da815f896181c53564c4b1d2 *./abcd/abc
323ba8e2da815f896181c53564c4b1d2 *./abcd/def
ae70e0b0a0077a006942c876250bc0f5 *./infile
5f1b0a73a2b4dc51bcad52c357d55d19 *./outfile
323ba8e2da815f896181c53564c4b1d2 *./xyx

Yes but I want to keep the first file in alphabetical basename order not alphabetical directory name order. That will require more than just the checksum command.

exec can only call a single command or script (which really slows things down with disk access) but it cannot call a function (because it looks to bulletins and $PATH to find the command)--it might work if I export the function, however. It is also subject to race conditions, while piping the completed output of find is not. However, the -type flag helps remove the if then else statement:

find "$dir" -type f | { while read -r path; do
    name=${path##*/}                                                        #basename
    sum=$(sha1sum "$path")
    echo "${sum%%' '*}"','"$name"\\"$path"                                  #sha1sum,basename\path.
done }

Is using $1 and $2 in this way robust to whitespace in the file name or path? I suspect it is not, but I don't know enough awk to be sure.

Mike