Duplicate filename algorithm

cue · March 20, 2010, 4:35am

Over the years I've created a bit of a mess in my directories with duplicate files. I've used fdupes to remove complete duplicates but there are still files which are almost identical which fdupes doesn't look for.

These have the same (or very similar) filenames. So I have tried to create a script to look for them and list them like fdupes (sets of duplicates separated by a blank line). What i have so far is this very inelegant script.

#!/bin/sh

filepathlist="filepathlist.txt"
filepathlistcomp="filepathlistcomp.txt"
duplicatefilenamelist="duplicatefilenamelist.txt"

echo "" > "$filenamelist"
echo "" > "$filepathlistcomp"
find > "$filepathlist"

while read path ;do
 filename=`basename "$path"`
 dupes=0
 while read pathcomp ;do
  filenamecomp=`basename "$pathcomp"`
  if [ "$filename" = "$filecomp" ];then
   if [ $dupes -gt 0 ];then
    echo "$filename" >> "$duplicatefilenamelist"
   fi
   dupes=1
  else
   echo "$path" >> "$filepathlistcomp" 
  fi
 done < "$filepathlist"
 
 echo "" >> "$duplicatefilenamelist"
 "$filepathlist" < "$filepathlistcomp"
done < "$filepathlist"

I'm sure there is a better way of doing this. would this script even work since I'm trying to change the file in the loop that's reading it. My main concern is efficiency in the algorithm. I tried to remove duplicates already accounted for by removing them from the list as it progresses through it but I have a feeling this will actually make it less efficient because of the added file operations. Any ideas on how best to approach this problem?

frans · March 20, 2010, 8:05am

I'm searching for a similar tool.
I found a simple way to print the duplicate file names

#!/bin/bash
FILES=/dev/shm/filelist
find -type f | awk -F'/' '{print $NF}' | sort | uniq -d > $FILES
while read F
do
	find -type f -name "$F"
	echo
done < $FILES

---------- Post updated at 13:05 ---------- Previous update was at 12:04 ----------

A complete script which could be optimized

#!/bin/bash
# Usage: find-dup [Path [Name]]
if [ -z "$1" ]
then
    read -p "Path to scan: " DIR    # Ask for a base path if not given as argument
else
    DIR="$1"
    if [ -z "$2" ]
    then
        read -p "File Names: " NAME    # Ask for a file pattern if not given as argument
        [ -n "$NAME" ] && NAME="-name $NAME"
    fi
fi
cd $DIR || exit 1
LIST=/dev/shm/filelist    # to store the temp filelists (ramdisk)
find -type f $NAME | awk -F'/' '{print $NF}' | sort | uniq -d > $LIST-1
while read F
do
    find -type f -name "$F" > $LIST-2
    i=0
    unset FILE
    while read L    # Creates an array with duplicate files
    do    ((i++)); FILE[$i]="$L"
    done < $LIST-2
    FILE[0]="Do not delete"
    OPT=""
    for ((i=0; i<${#FILE[@]}; i++))    # Displays the files with numbers for deletion
    do    OPT+=$i; echo -e "$i. ${FILE[$i]}"
    done
    K1=""
    until [[ $K1 = [$OPT] ]]
    do    read -s -n1 K1 <&1
    done
    if (($K1))
    then
        read -s -n1 -p "Confirm deletion of ${FILE[$K1]} (Y/N): " K2 <&1
        [[ $K2 = [yY] ]] && { echo; rm -v "${FILE[$K1]}"; } || echo "No deletion"
    else
        echo "No deletion"
    fi
    echo
done < $LIST-1

radoulov · March 20, 2010, 8:19am

You can use the Perl non standard module File::Find:Duplicates,
if you need to compare the content:

perl -MFile::Find::Duplicates -e'
    @dupes = find_duplicate_files("dir1", "dir2");
    printf "Files %s (of size %d) hash to %s\n", 
      (join "," , @{$_->files}), $_->size, $_->md5
        for @dupes'

cue · March 22, 2010, 3:45am

thank you for the suggestions frans and radoulov.

i'm not familiar with perl, can you please elaborate on what that perl script does. It looks like it compares 2 directories looking for duplicate files instead of duplicate filenames, is this correct?

I have created 2 scripts now trying to find duplicate filenames but they are so slow, that's why I really need to optimise the algorithm.

in all the methods I create a complete file list of the directory with full paths. My only problem is how time consuming the scripts are. All the methods work but which is the most time efficient for long lists?

Method 1
go through path list one entry at a time looking for matching filenames further down the list.

paths with matching filenames are removed from list so that the next filename has less entires to compare to.

Method 2
create a another list in addition to the paths, that is, a list of duplicate filenames using uniq (list 2). filter the path list using grep using these duplicate filenames (list 2) to get a smaller path list (list 1)
go through each duplicate filename (in list 2) looking for the matching paths in the path list (list 1).

remove matching paths so that the next duplicate filename has less entires to compare to.

The question is
1) is the added file operation required for removing previous matching paths worth it.
2) which algorithm is better in terms of speed, method 1, 2, or some other way
3) I'd like to add a progress bar but i do not want it in stdout since that will interfere with the actual output of duplicates. how do I do this? should I use stderr?

The scripts
The scripts for both methods are below and they both work but directories with many, many, files (I tested with 25,000) take considerable time, I'd really like to speed the script up.

if you want to test either one you can create a simple test text file with example paths to duplicate files, then use

./scriptname.sh -f List_of_file_paths.txt

if you want to actually look for duplicate filenames in a directory just run the script and it will look for duplicates in the current working directory. for another directory use.

./scriptname.sh directory

Method 1

#!/bin/sh

# Filenames in shared memory directory
filepathlist=/dev/shm/filelist
filepathlistcomp=/dev/shm/filelistcomp

# Usage help printed
usage="$0 [-f list_file] [Directory]"

# Option processing
while test $# -gt 0 ; do
 case "$1" in
  -f) usefilelist=true; filelist="$2" shift 2 ;;          
  --help) echo $usage; exit 1 ;;
  --*) break ;;
  -*) echo $usage; exit 1 ;;
  *)  break ;;
 esac
done

# store search directory given as command line argument
if [ ! -z "$2" ]; then
 finddir="$2"
else
 finddir="."
fi


if $usefilelist ;then
 cp "$filelist" "$filepathlist"
else
 find "$finddir" -type f > "$filepathlist"
fi

echo -n "" > "$filepathlistcomp"

while true ;do
 read path < "$filepathlist"
 filename=`basename "$path"`
 printfirst=true
 if [ "$path" = "" ];then
  exit
 fi
 while read pathcomp ;do
  if [ "$path" != "$pathcomp" ];then
   filenamecomp=`basename "$pathcomp"`
   if [ "$filename" = "$filenamecomp" ];then
     if [ $printfirst = true ];then
       echo "" #new line for new set
       echo "$path"
       printfirst=false
     fi
     echo "$pathcomp"    
   else
     echo "$pathcomp" >> "$filepathlistcomp"
   fi
  fi
 done < "$filepathlist"
 cp "$filepathlistcomp" "$filepathlist"

 echo -n "" > "$filepathlistcomp"
done

Method 2

#!/bin/sh

# Filenames in shared memory directory
filepathlist=/dev/shm/filelist
filepathlistcomp=/dev/shm/filelistcomp
filedupeslist=/dev/shm/filedupeslist

# Usage help printed
usage="$0 [-f list_file] [Directory]"

# Option processing
while test $# -gt 0 ; do
 case "$1" in
  -f) usefilelist=true; filelist="$2" shift 2 ;;          
  --help) echo $usage; exit 1 ;;
  --*) break ;;
  -*) echo $usage; exit 1 ;;
  *)  break ;;
 esac
done

# store search directory given as command line argument
if [ ! -z "$2" ]; then
 finddir="$2"
else
 finddir="."
fi

if $usefilelist ;then
 cp "$filelist" "$filepathlist"
else
 find "$finddir" -type f > "$filepathlist"
fi

cat "$filepathlist" | awk -F'/' '{print $NF}' | sort | uniq -d > "$filedupeslist"
grep -f "$filedupeslist" "$filepathlist" > "$filepathlistcomp"

while read filedupe ;do
 
 echo -n "" > "$filepathlist"

 while read path ;do
  if [ "$path" = "" ];then
   break
  fi
  filename=`basename "$path"`
  if [ "$filename" = "$filedupe" ];then
   echo "$path"
  else
   echo "$path" >> "$filepathlist"
  fi
 done < "$filepathlistcomp"
 
 cp "$filepathlist" "$filepathlistcomp"
 echo ""

done < "$filedupeslist"

thegeek · March 22, 2010, 5:17am

If you want to look for file duplication -- content, you can take a look at my this tool: finddup | Get finddup at SourceForge.net which is in Perl.

cue · March 22, 2010, 8:05am

Thanks for creating that thegeek but is that not a content comparison. Can I ask how this differs from fdupes? The thing with fdupes is that it is a byte for byte content comparison. I used it to remove duplicate files (i.e. files that are exactly the same). However files that differed only slightly it would not list as a "duplicate", and rightly so. For example my filing system is in such a mess that I have multiple versions of the same file in different directories where I might of added something to the newer one. The files are probably 90% the same but they were not exact duplicates for fdupes to list them. I do not know of any tools (or how) to list files that are almost the same. Can this be done in finddup? if so that would be great.

This is why I'm comparing their filenames instead since I assume I probably didn't rename the files.

I've now solved the problem with effeciency too if anybody is interested. The extra file operations were not worth it and the "grep -f" line was extremely taxing. So I moved the grep into the loop and avoided the extra iterations of the loop with this too. The script before took hours to go through 25,000 files, this one takes less than 5 minutes. forgive the unecessary use of cat, file redirection gave me some trouble for some reason.

#!/bin/sh

# Filenames in shared memory directory
filepathlist=/dev/shm/filelist4
filepathlistcomp=/dev/shm/filelistcomp4
filedupeslist=/dev/shm/filedupeslist4

# Usage help printed
usage="$0 [-f list_file] [Directory]"

# Option processing
while test $# -gt 0 ; do
 case "$1" in
  -f) usefilelist=true; filelist="$2" shift 2 ;;          
  --help) echo $usage; exit 1 ;;
  --*) break ;;
  -*) echo $usage; exit 1 ;;
  *)  break ;;
 esac
done

# store search directory given as command line argument
if [ ! -z "$2" ]; then
 finddir="$2"
else
 finddir="."
fi

if $usefilelist ;then
 cp "$filelist" "$filepathlist"
else
 find "$finddir" -type f > "$filepathlist"
fi

cat "$filepathlist" | awk -F'/' '{print $NF}' | sort | uniq -d > "$filedupeslist"

while read filedupe ;do
 grep "$filedupe" "$filepathlist" > "$filepathlistcomp"
 
 while read path ;do
  if [ "$path" = "" ];then
   break
  fi
  filename=`basename "$path"`
  if [ "$filename" = "$filedupe" ];then
   echo "$path"
  fi
 done < "$filepathlistcomp"

 echo ""

done < "$filedupeslist"

radoulov · March 22, 2010, 10:12am

Yes,
as already stated, the previous Perl solutions compare the content of the files.
Could you try this Perl code and compare its performance with your shell script?

perl -MFile::Find -e'
  $d = shift || die "$0 dir\n";
  find { 
    wanted => sub {
      -f and push @{$u{$_}}, $File::Find::name;
      }
    }, $d;
  @{$u{$_}} > 1 and printf "found %s in: \n\n%s\n\n", 
    $_, join $/, @{$u{$_}} for keys %u;    
  ' <dirname>

cue · March 22, 2010, 6:07pm

radoulov:

Yes,
as already stated, the previous Perl solutions compare the content of the files.
Could you try this Perl code and compare its performance with your shell script?
perl -MFile::Find -e'
  $d = shift || die "$0 dir\n";
  find { 
   wanted => sub {
   -f and push @{$u{$_}}, $File::Find::name;
   }
   }, $d;
  @{$u{$_}} > 1 and printf "found %s in: \n\n%s\n\n", 
   $_, join $/, @{$u{$_}} for keys %u;    
  ' <dirname>

Thanks radoulov I tried the perl script and compared it to the shell script for a directory with 25,000 files and it is faster.

perl script from directory: 1 minutes 57 seconds

shell script from directory: 2 minutes 10 seconds
shell script from Precreated file list: 0 minutes 53 seconds

I'm not familiar with perl at all so can you please tell me how I could edit the perl script to read a piped input or file rather than search a directory with find. since the perl script does indeed seem faster and better.

radoulov · March 23, 2010, 5:35am

Sure,
could you post some small representative sample input and the output you'd like to get given that input.

cue · March 23, 2010, 9:05pm

sample input

./some/path/file1
./some/path/file2
./some/other/path/file1

./another/path/file2
./another/path/file3

sample output

./some/path/file1
./some/other/path/file1

./some/path/file2
./another/path/file2

I think this outputs the way I would like it to:

perl -MFile::Find -e'
$d = shift || die "$0 dir\n";
find { wanted => sub { -f and push @{$u{$_}}, $File::Find::name;}}, $d;
@{$u{$_}} > 1 and printf "%s\n\n", join $/, @{$u{$_}} for keys %u;' "$finddir"

but I don't know how to pipe the data to it or have a file argument so I can do general things like

find . -type f -size +10000 | SameFilenamePerlScript
find . -atime +6 | SameFilenamePerlScript

or

SameFilenamePerlScript filelist.txt

I have this in the shell script below for reference, would like to do so in perl using your faster method of finding and grouping duplicate filenames.

#!/bin/sh

# Filenames in shared memory directory
filepathlist=/dev/shm/filelist
filepathlistcomp=/dev/shm/filelistcomp
filedupeslist=/dev/shm/filedupeslist

cat /dev/null > $filepathlist

if readlink /proc/$$/fd/0 | grep -q "^pipe:"; then
  cat > $filepathlist
fi

# Usage help printed
usage="$0 [Directory_or_File]"

# Option processing
while test $# -gt 0 ; do
 case "$1" in
  --help) echo $usage; exit 1 ;;
  --*) break ;;
  -*) echo $usage; exit 1 ;;
  *)  break ;;
 esac
done

#if filepathlist created with pipe
if [ -s $filepathlist ] ;then 
  if [ ! -z "$1" ] ;then
      echo "$0 : $1 :Too many arguments">&2
      exit
  fi     
else
  if [ ! -z "$1" ] ;then #if CL argument is given check if its directory or file
    if [ -d "$1" ] ;then # if CL argument is a directory
      finddir="$1"
      find "$finddir" -type f > "$filepathlist"  
    elif [ -f "$1" ] ;then # if CL argument is a file
      cp "$1" "$filepathlist"
    else
      echo "$0 : $1 :Not a directory or file">&2
      exit        
    fi
  else  #if CL argument is NOT given search current directory
    finddir='.'
    find "$finddir" -type f > "$filepathlist"
  fi
fi

cat "$filepathlist" | awk -F'/' '{print $NF}' | sort | uniq -d > "$filedupeslist"

while read filedupe ;do
 grep "$filedupe" "$filepathlist" > "$filepathlistcomp"
 while read path ;do
  if [ "$path" = "" ];then
   break
  fi
  filename=`basename "$path"`
  if [ "$filename" = "$filedupe" ];then
   echo "$path"
  fi
 done < "$filepathlistcomp"
 echo ""
done < "$filedupeslist"

radoulov · March 24, 2010, 4:45am

Perl has all of that builtin:

perl -MFile::Find -e'
  $d = shift || die "$0 dir\n";
  find { 
    wanted => sub { 
      push @{$u{$_}}, $File::Find::name if -f and -s > 10000;
        }
    }, $d;
  @{$u{$_}} > 1 
    and printf "%s\n\n", join $/, @{$u{$_}} 
      for keys %u;
      ' "$finddir"

perl -MFile::Find -e'
  $d = shift || die "$0 dir\n";
  find { 
    wanted => sub { 
      6 < -A and push @{$u{$_}}, $File::Find::name;
        }
    }, $d;
  @{$u{$_}} > 1 
    and printf "%s\n\n", join $/, @{$u{$_}} 
      for keys %u;
      ' "$finddir"

Anyway, if you're already familiar with the find command, this should be easier:

find . -type f +size 10000 |
  perl -F/ -lane'
     push @{$_{$F[-1]}}, $_;
     END {
       @{$_{$_}} > 1 and print +(join $/, @{$_{$_}}), $/ 
         for keys %_;
       }'

cue · March 24, 2010, 9:39pm

radoulov:

Perl has all of that builtin:

perl -MFile::Find -e'
  $d = shift || die "$0 dir\n";
  find { 
   wanted => sub { 
   push @{$u{$_}}, $File::Find::name if -f and -s > 10000;
   }
   }, $d;
  @{$u{$_}} > 1 
   and printf "%s\n\n", join $/, @{$u{$_}} 
   for keys %u;
   ' "$finddir"

perl -MFile::Find -e'
  $d = shift || die "$0 dir\n";
  find { 
   wanted => sub { 
   6 < -A and push @{$u{$_}}, $File::Find::name;
   }
   }, $d;
  @{$u{$_}} > 1 
   and printf "%s\n\n", join $/, @{$u{$_}} 
   for keys %u;
   ' "$finddir"

Anyway, if you're already familiar with the find command, this should be easier:

find . -type f +size 10000 |
  perl -F/ -lane'
   push @{$_{$F[-1]}}, $_;
   END {
   @{$_{$_}} > 1 and print +(join $/, @{$_{$_}}), $/ 
   for keys %_;
   }'

Thank you again radoulov. That's exactly what I'm looking for. I need to learn perl some day.