Find duplicate files by file size

Hi!

I want to find duplicate files (criteria: file size) in my download folder.

I try it like this:

find /Users/frodo/Downloads \! -type d -exec du {} \; | sort > /Users/frodo/Desktop/duplicates_1.txt;
cut -f 1 /Users/frodo/Desktop/duplicates_1.txt | uniq -d | grep -hif - /Users/frodo/Desktop/duplicates_1.txt > /Users/frodo/Desktop/duplicates_2.txt;

But this doesn't work. Can anybody tell my what's wrong or provide a other/better solution? Thanks!

Dirk

how about cksum - that is far easier to use. It gives a filesize. Or you can use the check sum, either way.
This code assumes your cksum implmentation gives:

cksum filename
checksum  filesize filename
cksum /path/to/files/* |
  awk ' { if( $2 in arr) 
            {print "duplicates ", $3, arr[$2], "duplicate filesize = ", $2} 
              else 
            {arr[$2]=$3} }'

Hi!

Well, cksum is to slow. There can be files with > 2GB. And I want to scan also all subdirectories. The sum of the file size of all duplicated files is not important.

Dirk

You probably don't want to use du for this either. du returns the space allocated for a file, not the size of a file.

Regards,
Alister

#!/bin/sh

# We find all files in path, feed them into ls -l with xargs,
# and sort them on the size column.
# We can't depend on ls' own sort when using xargs since enough
# files will end up splitting between several ls calls.
# Then we read the lines in order, and check for duplicate sizes.
find /path/to/dir -type f -print0 | xargs --null ls -l | sort -k 5,6 |
while read PERMS LINKS USER GROUP SIZE M D Y FILE
do
        # Skip symbolic links
        [ -h "$FILE" ] && continue

        if [ "$SIZE" -eq "$LASTSIZE" ]
        then
                echo "$FILE same size as $LASTFILE"
        else
                LASTSIZE="$SIZE"        ;       LASTFILE="$FILE"
        fi
# find will spew errors when it can't access a file, so ignore /dev/null.
done 2> /dev/null

---------- Post updated at 05:35 PM ---------- Previous update was at 04:43 PM ----------

Here's an improved version that checks checksums. It can churn through about 4 gigs of random files in 7 seconds, uncached, on my not-so-great system.

The trick is, it only checks checksums against files of the same size, and does a quick checksum on their first 512 bytes to filter out files that're obviously different. Maybe the first 16K, or first 256K would be better.

#!/bin/bash

TMP=$(mktemp)

# Given a list of files of the same size, "$TMP",
# it will check which ones have the same checksums.
function checkgroup
{
        local FILE
        local LASTSUM
        local LASTFILE

        [ -s "$TMP" ] || return

        # Check first 512 bytes of files.
        # If that differs, who cares about the rest?
        while read FILE
        do
                SUM=$(dd count=1 < "$FILE" | md5sum)
                read G SUM <<<"$SUM"
                echo "$SUM $FILE"
        done < "$TMP" | sort | while read SUM FILE
        do
                if [ "$LASTSUM" != "$SUM" ]
                then
                        LASTSUM="$SUM"
                        LASTFILE="$FILE"
                        UNPRINTED=1
                        continue
                fi

                [ -z "$UNPRINTED" ] || echo "$LASTFILE"
                UNPRINTED=""
                echo "$FILE"
        done | xargs -d '\n' md5sum | sort |
        while read SUM FILE
        do
                if [ "$SUM" != "$LASTSUM" ]
                then
                        LASTSUM="$SUM"
                        LASTFILE="$FILE"
                else
                        echo "$FILE == $LASTFILE"
                fi
        done
}

# Find all files, feed them through ls, sort them on size.
# Can't depend on ls' own sorting when there's too man files,
# it could be run more than once.
# Once we have the output, loop through looking for files
# the same size and make a list to feed into checkgroup.
find ~/public_html -type f | xargs ls -l | sort -k 5,6 |
while read PERMS LINKS USER GROUP SIZE M D Y FILE
do
        # Skip symbolic links
        [ -h "$FILE" ] && continue

        if [ "$SIZE" -eq "$LASTSIZE" ]
        then
                [ -s "$TMP" ] || echo "$LASTFILE" > "$TMP"
                echo "$FILE" >> "$TMP"
        else
                checkgroup "$LASTSIZE"
                LASTSIZE="$SIZE"        ;       LASTFILE="$FILE"
                :>"$TMP"
        fi
done

checkgroup

rm -f "$TMP"

Hi Dirk Einecke,

Another option:

Bytes precision is used trying to get a more precise size comparison.

#!/bin/bash
find . -type f -print0 | (
    while read -d "" FILE ; do FILES=("${FILES[@]}" "$FILE") ; done

    ls -la "${FILES[@]}" | awk '{$1=$2=$3=$4=$6=$7="";print}' > /Users/frodo/Desktop/Listed_Files.txt
    ls -la "${FILES[@]}" | awk '{print $5}' | sort -k1,1nr | uniq -d > /Users/frodo/Desktop/Repeated_Sizes.txt

)

awk 'BEGIN{print "Size (bytes)  Files"}FNR==NR{a[$1];next} $1 in a' Repeated_Sizes.txt Listed_Files.txt > Duplicates_Files.txt

rm /Users/frodo/Desktop/Listed_Files.txt
rm /Users/frodo/Desktop/Repeated_Sizes.txt

Hope it helps

Regards

Why dont you try this?
Go to your Downloads dir and run this.

ls -l | awk '$1!~/^d/{if(size[$5]!=""){ print}size[$5]=$8}'

$1 !~ /^d/ is an error-prone approach. Better to simply use /^-/ .

Regards,
Alister

@alister
In the first post he tried to list all except dir. So I did this.

Here is a solution that uses cmp -s , it's a utility that's designed to compare binary files so will probably be much quicker that cksum and the like. Again only files of identical byte size are compared.

if [ $# -ne 1 ] || [ ! -d $1 ]
then
    echo "usage: $0 <directory>"
    exit 1
fi
find $1 -type f -ls | awk '
  $8 > 0 {
     gsub("\\\\ ", SUBSEP); F=$12; gsub(SUBSEP, " ", F); # Deal with space(s) in filename
     if($8 in sizes) {
         sizes[$8]=sizes[$8] SUBSEP F;
         dup[$8]++
     } else sizes[$8]=F
  }
  END {for(i in dup) print sizes }' | while read
do
   # SUBSEP (34 Octal) between each filename that has same size
   # Change IFS to Load Array F with a group of 2 (or more) files 
   OIFS="$IFS"
   IFS=$(printf \\034)
   F=( $REPLY )
   IFS="$OIFS"
   i=0
   while [ $i -lt ${#F[@]} ]
   do
       let j=i+1
       while [ $j -lt ${#F[@]} ]
       do
           cmp -s "${F}" "${F[j]}" &&
               echo "\"${F}\"" and "\"${F[j]}\"" are identical
           let j=j+1
       done
       let i=i+1
    done
done