Search compare and determine duplicate files

jao_madn · March 27, 2011, 5:59pm

Hi

May i ask if someone know a package that will search a directory recursively and compare determine duplicate files according to each filename, date modified or any attributes that will determine its duplicity

If none where should i start or what are those command in shell scripting that can be usefull for this task. Or such scripting language has the advantage of the others like perl is more advantage that shell scripting it just an example.

Can some one share some of there sample script for this task if they had..

thanks in advance.

Chubler_XL · March 27, 2011, 8:12pm

Some knowledge of your goals might help us in recommending packages, what are you trying to achieve with this?

jao_madn · March 28, 2011, 7:26am

@chubler_XL: My goals is if there a package available or should i write a script that will accomplished the FF. search and specified directory recursively and identify the files inside for duplication. Duplicate means the same content with respect to each filename or the same filename in different directory, the same size but different filename but the same content. for example and ebook pdf reside in different directory but the same ebooks. or the ebooks pdf with the same but different filename reside in different directory.

Thanks for the reply

Chubler_XL · March 28, 2011, 5:50pm

I'm not aware of any packages that do it, how about this script to find identical files:

if [ $# -ne 1 ]
then
    echo "usage: $0 <directory>"
    exit 1
fi
find $1 -type f -ls | awk '$7 > 0 { if($7 in sizes) { sizes[$7]=sizes[$7] SUBSEP $11; dup[$7]++} else sizes[$7]=$11 } END {for(i in dup) print sizes }' | while read
do
   OIFS="$IFS"
   IFS=$(printf \\034)
   F=( $REPLY )
   IFS="$OIFS"
   i=0
   while [ $i -lt ${#F[@]} ]
   do
       let j=i+1
       while [ $j -lt ${#F[@]} ]
       do
            if cmp -s "${F}" "${F[j]}"
            then
                echo ${F} and ${F[j]} are identical
            fi
            let j=j+1
       done
       let i=i+1
   done
done

It does a find dir -type f -ls and stores filename in an awk array with size as the index. If a file with the same size is found the size is written into the dup array. At the end all duplicate sizes are output in a SUBSEP list.

This list is read into the F array and cmp with the -s flag (ie no output, just exit status) is used to compare files - this command will stop comparing as soon as the first difference is found which is better than calculating CRCs for each file.

Note: if files A B C are identical you will get output
A is identical to B
A is identical to C
B is identical to C
This can be fixed with more logic, but I don't consider it an issue.

danmero · March 28, 2011, 6:31pm

What about fduppes

jao_madn · April 3, 2011, 6:07pm

Hi Chuber_XL:

Thanks for your script i now digesting your script..thanks very much..

---------- Post updated at 06:07 AM ---------- Previous update was at 05:29 AM ----------

hi chubler_xl:

your script runs very well. Can i ask additional questions or requirements on the script. the script will scan filename and compare each file with respect to their size right.. how can add in the script for which in addition of comparing the file size it will compare also or first compare the filename for the found list of files and add it on the output..for example display
dir1/filename1 and /dir1/dir2/**/filename1 are the same in filename

thanks in advance..

And also how to accomplished if i execute <find . -type f> or the same command and wants only to get the filename
Example:
dir1/dir2/filename1
dir1/filename2
dir1/dir2/dir3/filename3

Output
filename1
filename2
filename3

Chubler_XL · April 3, 2011, 6:33pm

Glad script was usefull, see my post in This thread for an update that also supports spaces in filenames.

I Can't understand the enhancement you are are after, can you reword it please, something liike this:

Say we have 3 different contents (A, B and C)

Files with contents A:
dir1/dir2/filename1
dir1/filename2
dir1/dir2/dir3/filename3

Files with contents B:
dir1/File_one
dir2/File_two

Files with contents C:
dir3/File_one

In this situation, what would the input and output to the script be like?

Eg:
input

dir1/dir2/filename1
dir1/filename2
dir1/dir2/dir3/filename3
dir1/File_one
dir2/File_two
dir3/File_one

Output:

"dir1/dir2/filename1" and "dir1/filename2" are identical
"dir1/filename1" and "dir1/dir2/dir3/filename3" are identical
"dir1/filename2" and "dir1/dir2/dir3/filename3" are identical
"dir1/File_one" and "dir2/File_two" are identical
"dir1/File_one" and "dir3/File_one" have same name but are different

jao_madn · April 4, 2011, 7:24am

@CHUBLER_XL:

Hello, the script above well determine of duplicate according to its byte size right, i would like to add the condition to compare also the filename reside in differrent sud dir.

output:
dir1/dir2/dir3/linux-ebook.pdf and /dir1/linux-ebook.pdf is identical in filename
or
dir1/dir2/C++_programming.pdf and /dir1/dir2/dir3/dir4/dir5/dir6/dir7/dir8/C++_programming.pdf is identical filename

when i try the script and change the byte size of the linux-ebook.pdf or files the comparison failed.

Chubler_XL · April 4, 2011, 7:51pm

This should do it

if [ $# -ne 1 ] || [ ! -d $1 ]
then
    echo "usage: $0 <directory>"
    exit 1
fi
SHOWSAME=1
find $1 -type f -ls | awk '
  $8 > 0 {
     gsub("\\\\ ", SUBSEP); F=$12; gsub(SUBSEP, " ", F); # Deal with space(s) in filename
     if($8 in sizes) {
         sizes[$8]=sizes[$8] SUBSEP F;
         dup[$8]++
     } else sizes[$8]=F
     bn=F;
     sub(".*/", "", bn);
     if (bn in basenames) {
         basenames[bn]=basenames[bn] SUBSEP F;
         dupname[bn]++
     } else basenames[bn]=F;
  }
  END {for(i in dup) print sizes; print "-NOSAME-" ; for(i in dupname) print basenames; }' | while read
do
  # SUBSEP (34 Octal) between each filename that has same size
  # Change IFS to Load Array F with a group of 2 (or more) files
  OIFS="$IFS"
  IFS=$(printf \\034)
  F=( $REPLY )
  IFS="$OIFS"
  [ "${F[0]}" = "-NOSAME-" -a ${#F[@]} -eq 1 ] && SHOWSAME=0
  i=0
  while [ $i -lt ${#F[@]} ]
  do
     let j=i+1
     while [ $j -lt ${#F[@]} ]
     do
        if cmp -s "${F}" "${F[j]}"
        then
           [ $SHOWSAME -eq 1 ] && echo "\"${F}\"" and "\"${F[j]}\"" are identical
        else
           [ "$(basename "${F}")" = "$(basename "${F[j]}")" ] &&
               echo "\"${F}\"" and "\"${F[j]}\"" have same filename but are different
        fi
        let j=j+1
     done
     let i=i+1
  done
done

jao_madn · April 5, 2011, 4:25pm

chubler_xl:

This should do it

if [ $# -ne 1 ] || [ ! -d $1 ]
then
   echo "usage: $0 <directory>"
   exit 1
fi
SHOWSAME=1
find $1 -type f -ls | awk '
  $8 > 0 {
   gsub("\\\\ ", SUBSEP); F=$12; gsub(SUBSEP, " ", F); # Deal with space(s) in filename
   if($8 in sizes) {
   sizes[$8]=sizes[$8] SUBSEP F;
   dup[$8]++
   } else sizes[$8]=F
   bn=F;
   sub(".*/", "", bn);
   if (bn in basenames) {
   basenames[bn]=basenames[bn] SUBSEP F;
   dupname[bn]++
   } else basenames[bn]=F;
  }
  END {for(i in dup) print sizes; print "-NOSAME-" ; for(i in dupname) print basenames; }' | while read
do
  # SUBSEP (34 Octal) between each filename that has same size
  # Change IFS to Load Array F with a group of 2 (or more) files
  OIFS="$IFS"
  IFS=$(printf \\034)
  F=( $REPLY )
  IFS="$OIFS"
  [ "${F[0]}" = "-NOSAME-" -a ${#F[@]} -eq 1 ] && SHOWSAME=0
  i=0
  while [ $i -lt ${#F[@]} ]
  do
   let j=i+1
   while [ $j -lt ${#F[@]} ]
   do
   if cmp -s "${F}" "${F[j]}"
   then
   [ $SHOWSAME -eq 1 ] && echo "\"${F}\"" and "\"${F[j]}\"" are identical
   else
   [ "$(basename "${F}")" = "$(basename "${F[j]}")" ] &&
   echo "\"${F}\"" and "\"${F[j]}\"" have same filename but are different
   fi
   let j=j+1
   done
   let i=i+1
  done
done

Hi

i tested the new script and only output this multiple line

"" and "" have same filename but are different
"" and "" have same filename but are different
"" and "" have same filename but are different
"" and "" have same filename but are different
and so on
...........
...........
...........

thanks for the efforts

Chubler_XL · April 5, 2011, 6:44pm

Sorry, I tested it on a system that has a space in the group name so my field numbers were out.

New version works by getting size as 4th-last field, so should be much more robust. During testing I also found some systems don't have SUBSEP as \034, safer to use actual octal value instead of SUBSEP for output strings:

#!/bin/bash
if [ $# -ne 1 ] || [ ! -d $1 ]
then
    echo "usage: $0 <directory>"
    exit 1
fi
SHOWSAME=1
find $1 -type f -ls | awk '
  {
     gsub("\\\\ ", SUBSEP); F=$NF; gsub(SUBSEP, " ", F); # Deal with space(s) in filename
     $1=$1
     SZ=$(NF-4)
     if(SZ > 0 && SZ in sizes) {
         sizes[SZ]=sizes[SZ] "\034" F;
         dup[SZ]++
     } else sizes[SZ]=F
     bn=F;
     sub(".*/", "", bn);
     if (bn in basenames) {
         basenames[bn]=basenames[bn] "\034" F;
         dupname[bn]++
     } else basenames[bn]=F;
  }
  END {for(i in dup) print sizes; print "-NOSAME-" ; for(i in dupname) print basenames; }' | while read
do
  # SUBSEP (34 Octal) between each filename that has same size
  # Change IFS to Load Array F with a group of 2 (or more) files
  OIFS="$IFS"
  IFS=$'\034'
  F=( $REPLY )
  IFS="$OIFS"
  [ "${F[0]}" = "-NOSAME-" -a ${#F[@]} -eq 1 ] && SHOWSAME=0
  i=0
  while [ $i -lt ${#F[@]} ]
  do
     let j=i+1
     while [ $j -lt ${#F[@]} ]
     do
        if cmp -s "${F}" "${F[j]}"
        then
           [ $SHOWSAME -eq 1 ] && echo "\"${F}\"" and "\"${F[j]}\"" are identical
        else
           [ "$(basename "${F}")" = "$(basename "${F[j]}")" ] &&
               echo "\"${F}\"" and "\"${F[j]}\"" have same filename but are different
        fi
        let j=j+1
     done
     let i=i+1
  done
done

jao_madn · April 7, 2011, 9:35am

@Chubler_XL

Thanks for the script its working now. for the two condition..

@Danmero

THanks also i find usefull the package fdupes for finding and deleting files in cmd.