This is not a typical question. I have a fully working script but I'm interested in optimizing it.
I frequently back up photos and movies from my digital camera and cell phone to both my home/work desktops, laptops, wife's netbook, and my home NAS and often end up with multiple versions of the same files in folders of varying completeness. I previously wrote a very slow script that checked if files with the same length were identical. It was also limited to a single directory. I recently wrote this much faster script that digs recursively through a directory tree (with find instead of ls), creates a field containing size, checksum, basename and full path name, sorts them, and deletes all but the first when size and checksum are identical. The script uses pipes to avoid arrays and executes quite fast. The alphabetically first basename gets kept. I use a comma to separate the size and checksum part (to test if identical) and a backslash before the path name so I can pass everything on the pipe and separate after. There is a dash between size and checksum that was useful for debugging and I don't think impacts speed. The path ends up being part of the sort, but that is harmless since it is after the basename. The parameter substitutions are robust for either comma or backslash occurring in the file name or path (God forbid!).
#! /bin/bash
#Delete duplicate files starting at $1 recursive
dir=${1:-.} #defaults to current directory '.'
find "$dir" | { while read path; do
name=${path##*/} #basename
if [ -f "$path" ]; then #if regular file
sum=$(md5sum "$path")
echo `stat -c %s "$path"`'-'${sum%%' '*}','"$name"'\\'"$path" #length-md5sum,basename\path
else continue #skip if not regular file
fi
done } | sort | { #sort files
test=''
while read line; do
front=${line%%,*} #size-md5sum
back=${line##*\\} #full file name with path
if [ "$front" = "$test" ]; then #same size-md5sum as previous file?
echo 'deleting duplicate file '"$back"; rm "$back" #if so, delete it.
fi
test="$front"
done }
Since it works so well, I'd like to use it as a general tool. Is md5sum good enough? Does adding size (was a legacy optimization from my first script) really add any benefit in collision resistance or execution speed in the sort command? Any ideas on how to make this script faster or more robust?
Edit: sha1sum is about 13% slower than md5sum but probably eliminates collision concerns.
Mike