thank you for the suggestions frans and radoulov.
i'm not familiar with perl, can you please elaborate on what that perl script does. It looks like it compares 2 directories looking for duplicate files instead of duplicate filenames, is this correct?
I have created 2 scripts now trying to find duplicate filenames but they are so slow, that's why I really need to optimise the algorithm.
in all the methods I create a complete file list of the directory with full paths. My only problem is how time consuming the scripts are. All the methods work but which is the most time efficient for long lists?
Method 1
go through path list one entry at a time looking for matching filenames further down the list.
paths with matching filenames are removed from list so that the next filename has less entires to compare to.
Method 2
create a another list in addition to the paths, that is, a list of duplicate filenames using uniq (list 2). filter the path list using grep using these duplicate filenames (list 2) to get a smaller path list (list 1)
go through each duplicate filename (in list 2) looking for the matching paths in the path list (list 1).
remove matching paths so that the next duplicate filename has less entires to compare to.
The question is
1) is the added file operation required for removing previous matching paths worth it.
2) which algorithm is better in terms of speed, method 1, 2, or some other way
3) I'd like to add a progress bar but i do not want it in stdout since that will interfere with the actual output of duplicates. how do I do this? should I use stderr?
The scripts
The scripts for both methods are below and they both work but directories with many, many, files (I tested with 25,000) take considerable time, I'd really like to speed the script up.
if you want to test either one you can create a simple test text file with example paths to duplicate files, then use
./scriptname.sh -f List_of_file_paths.txt
if you want to actually look for duplicate filenames in a directory just run the script and it will look for duplicates in the current working directory. for another directory use.
./scriptname.sh directory
Method 1
#!/bin/sh
# Filenames in shared memory directory
filepathlist=/dev/shm/filelist
filepathlistcomp=/dev/shm/filelistcomp
# Usage help printed
usage="$0 [-f list_file] [Directory]"
# Option processing
while test $# -gt 0 ; do
case "$1" in
-f) usefilelist=true; filelist="$2" shift 2 ;;
--help) echo $usage; exit 1 ;;
--*) break ;;
-*) echo $usage; exit 1 ;;
*) break ;;
esac
done
# store search directory given as command line argument
if [ ! -z "$2" ]; then
finddir="$2"
else
finddir="."
fi
if $usefilelist ;then
cp "$filelist" "$filepathlist"
else
find "$finddir" -type f > "$filepathlist"
fi
echo -n "" > "$filepathlistcomp"
while true ;do
read path < "$filepathlist"
filename=`basename "$path"`
printfirst=true
if [ "$path" = "" ];then
exit
fi
while read pathcomp ;do
if [ "$path" != "$pathcomp" ];then
filenamecomp=`basename "$pathcomp"`
if [ "$filename" = "$filenamecomp" ];then
if [ $printfirst = true ];then
echo "" #new line for new set
echo "$path"
printfirst=false
fi
echo "$pathcomp"
else
echo "$pathcomp" >> "$filepathlistcomp"
fi
fi
done < "$filepathlist"
cp "$filepathlistcomp" "$filepathlist"
echo -n "" > "$filepathlistcomp"
done
Method 2
#!/bin/sh
# Filenames in shared memory directory
filepathlist=/dev/shm/filelist
filepathlistcomp=/dev/shm/filelistcomp
filedupeslist=/dev/shm/filedupeslist
# Usage help printed
usage="$0 [-f list_file] [Directory]"
# Option processing
while test $# -gt 0 ; do
case "$1" in
-f) usefilelist=true; filelist="$2" shift 2 ;;
--help) echo $usage; exit 1 ;;
--*) break ;;
-*) echo $usage; exit 1 ;;
*) break ;;
esac
done
# store search directory given as command line argument
if [ ! -z "$2" ]; then
finddir="$2"
else
finddir="."
fi
if $usefilelist ;then
cp "$filelist" "$filepathlist"
else
find "$finddir" -type f > "$filepathlist"
fi
cat "$filepathlist" | awk -F'/' '{print $NF}' | sort | uniq -d > "$filedupeslist"
grep -f "$filedupeslist" "$filepathlist" > "$filepathlistcomp"
while read filedupe ;do
echo -n "" > "$filepathlist"
while read path ;do
if [ "$path" = "" ];then
break
fi
filename=`basename "$path"`
if [ "$filename" = "$filedupe" ];then
echo "$path"
else
echo "$path" >> "$filepathlist"
fi
done < "$filepathlistcomp"
cp "$filepathlist" "$filepathlistcomp"
echo ""
done < "$filedupeslist"