Find duplicate files by file size

Dirk_Einecke · April 1, 2011, 3:16pm

Hi!

I want to find duplicate files (criteria: file size) in my download folder.

I try it like this:

find /Users/frodo/Downloads \! -type d -exec du {} \; | sort > /Users/frodo/Desktop/duplicates_1.txt;
cut -f 1 /Users/frodo/Desktop/duplicates_1.txt | uniq -d | grep -hif - /Users/frodo/Desktop/duplicates_1.txt > /Users/frodo/Desktop/duplicates_2.txt;

But this doesn't work. Can anybody tell my what's wrong or provide a other/better solution? Thanks!

Dirk

jim_mcnamara · April 1, 2011, 4:03pm

how about cksum - that is far easier to use. It gives a filesize. Or you can use the check sum, either way.
This code assumes your cksum implmentation gives:

cksum filename
checksum  filesize filename

cksum /path/to/files/* |
  awk ' { if( $2 in arr) 
            {print "duplicates ", $3, arr[$2], "duplicate filesize = ", $2} 
              else 
            {arr[$2]=$3} }'

Dirk_Einecke · April 1, 2011, 4:18pm

Hi!

Well, cksum is to slow. There can be files with > 2GB. And I want to scan also all subdirectories. The sum of the file size of all duplicated files is not important.

Dirk

alister · April 1, 2011, 5:33pm

You probably don't want to use du for this either. du returns the space allocated for a file, not the size of a file.

Regards,
Alister

Corona688 · April 1, 2011, 7:35pm

#!/bin/sh

# We find all files in path, feed them into ls -l with xargs,
# and sort them on the size column.
# We can't depend on ls' own sort when using xargs since enough
# files will end up splitting between several ls calls.
# Then we read the lines in order, and check for duplicate sizes.
find /path/to/dir -type f -print0 | xargs --null ls -l | sort -k 5,6 |
while read PERMS LINKS USER GROUP SIZE M D Y FILE
do
        # Skip symbolic links
        [ -h "$FILE" ] && continue

        if [ "$SIZE" -eq "$LASTSIZE" ]
        then
                echo "$FILE same size as $LASTFILE"
        else
                LASTSIZE="$SIZE"        ;       LASTFILE="$FILE"
        fi
# find will spew errors when it can't access a file, so ignore /dev/null.
done 2> /dev/null

---------- Post updated at 05:35 PM ---------- Previous update was at 04:43 PM ----------

Here's an improved version that checks checksums. It can churn through about 4 gigs of random files in 7 seconds, uncached, on my not-so-great system.

The trick is, it only checks checksums against files of the same size, and does a quick checksum on their first 512 bytes to filter out files that're obviously different. Maybe the first 16K, or first 256K would be better.

#!/bin/bash

TMP=$(mktemp)

# Given a list of files of the same size, "$TMP",
# it will check which ones have the same checksums.
function checkgroup
{
        local FILE
        local LASTSUM
        local LASTFILE

        [ -s "$TMP" ] || return

        # Check first 512 bytes of files.
        # If that differs, who cares about the rest?
        while read FILE
        do
                SUM=$(dd count=1 < "$FILE" | md5sum)
                read G SUM <<<"$SUM"
                echo "$SUM $FILE"
        done < "$TMP" | sort | while read SUM FILE
        do
                if [ "$LASTSUM" != "$SUM" ]
                then
                        LASTSUM="$SUM"
                        LASTFILE="$FILE"
                        UNPRINTED=1
                        continue
                fi

                [ -z "$UNPRINTED" ] || echo "$LASTFILE"
                UNPRINTED=""
                echo "$FILE"
        done | xargs -d '\n' md5sum | sort |
        while read SUM FILE
        do
                if [ "$SUM" != "$LASTSUM" ]
                then
                        LASTSUM="$SUM"
                        LASTFILE="$FILE"
                else
                        echo "$FILE == $LASTFILE"
                fi
        done
}

# Find all files, feed them through ls, sort them on size.
# Can't depend on ls' own sorting when there's too man files,
# it could be run more than once.
# Once we have the output, loop through looking for files
# the same size and make a list to feed into checkgroup.
find ~/public_html -type f | xargs ls -l | sort -k 5,6 |
while read PERMS LINKS USER GROUP SIZE M D Y FILE
do
        # Skip symbolic links
        [ -h "$FILE" ] && continue

        if [ "$SIZE" -eq "$LASTSIZE" ]
        then
                [ -s "$TMP" ] || echo "$LASTFILE" > "$TMP"
                echo "$FILE" >> "$TMP"
        else
                checkgroup "$LASTSIZE"
                LASTSIZE="$SIZE"        ;       LASTFILE="$FILE"
                :>"$TMP"
        fi
done

checkgroup

rm -f "$TMP"

cgkmal · April 1, 2011, 9:20pm

Hi Dirk Einecke,

Another option:

Bytes precision is used trying to get a more precise size comparison.

#!/bin/bash
find . -type f -print0 | (
    while read -d "" FILE ; do FILES=("${FILES[@]}" "$FILE") ; done

    ls -la "${FILES[@]}" | awk '{$1=$2=$3=$4=$6=$7="";print}' > /Users/frodo/Desktop/Listed_Files.txt
    ls -la "${FILES[@]}" | awk '{print $5}' | sort -k1,1nr | uniq -d > /Users/frodo/Desktop/Repeated_Sizes.txt

)

awk 'BEGIN{print "Size (bytes)  Files"}FNR==NR{a[$1];next} $1 in a' Repeated_Sizes.txt Listed_Files.txt > Duplicates_Files.txt

rm /Users/frodo/Desktop/Listed_Files.txt
rm /Users/frodo/Desktop/Repeated_Sizes.txt

Hope it helps

Regards

tene · April 2, 2011, 4:07am

Why dont you try this?
Go to your Downloads dir and run this.

ls -l | awk '$1!~/^d/{if(size[$5]!=""){ print}size[$5]=$8}'

alister · April 2, 2011, 11:42am

$1 !~ /^d/ is an error-prone approach. Better to simply use /^-/ .

Regards,
Alister

tene · April 3, 2011, 12:57pm

@alister
In the first post he tried to list all except dir. So I did this.

Chubler_XL · April 3, 2011, 5:31pm

Here is a solution that uses cmp -s , it's a utility that's designed to compare binary files so will probably be much quicker that cksum and the like. Again only files of identical byte size are compared.

if [ $# -ne 1 ] || [ ! -d $1 ]
then
    echo "usage: $0 <directory>"
    exit 1
fi
find $1 -type f -ls | awk '
  $8 > 0 {
     gsub("\\\\ ", SUBSEP); F=$12; gsub(SUBSEP, " ", F); # Deal with space(s) in filename
     if($8 in sizes) {
         sizes[$8]=sizes[$8] SUBSEP F;
         dup[$8]++
     } else sizes[$8]=F
  }
  END {for(i in dup) print sizes }' | while read
do
   # SUBSEP (34 Octal) between each filename that has same size
   # Change IFS to Load Array F with a group of 2 (or more) files 
   OIFS="$IFS"
   IFS=$(printf \\034)
   F=( $REPLY )
   IFS="$OIFS"
   i=0
   while [ $i -lt ${#F[@]} ]
   do
       let j=i+1
       while [ $j -lt ${#F[@]} ]
       do
           cmp -s "${F}" "${F[j]}" &&
               echo "\"${F}\"" and "\"${F[j]}\"" are identical
           let j=j+1
       done
       let i=i+1
    done
done