how about cksum - that is far easier to use. It gives a filesize. Or you can use the check sum, either way.
This code assumes your cksum implmentation gives:
Well, cksum is to slow. There can be files with > 2GB. And I want to scan also all subdirectories. The sum of the file size of all duplicated files is not important.
#!/bin/sh
# We find all files in path, feed them into ls -l with xargs,
# and sort them on the size column.
# We can't depend on ls' own sort when using xargs since enough
# files will end up splitting between several ls calls.
# Then we read the lines in order, and check for duplicate sizes.
find /path/to/dir -type f -print0 | xargs --null ls -l | sort -k 5,6 |
while read PERMS LINKS USER GROUP SIZE M D Y FILE
do
# Skip symbolic links
[ -h "$FILE" ] && continue
if [ "$SIZE" -eq "$LASTSIZE" ]
then
echo "$FILE same size as $LASTFILE"
else
LASTSIZE="$SIZE" ; LASTFILE="$FILE"
fi
# find will spew errors when it can't access a file, so ignore /dev/null.
done 2> /dev/null
---------- Post updated at 05:35 PM ---------- Previous update was at 04:43 PM ----------
Here's an improved version that checks checksums. It can churn through about 4 gigs of random files in 7 seconds, uncached, on my not-so-great system.
The trick is, it only checks checksums against files of the same size, and does a quick checksum on their first 512 bytes to filter out files that're obviously different. Maybe the first 16K, or first 256K would be better.
#!/bin/bash
TMP=$(mktemp)
# Given a list of files of the same size, "$TMP",
# it will check which ones have the same checksums.
function checkgroup
{
local FILE
local LASTSUM
local LASTFILE
[ -s "$TMP" ] || return
# Check first 512 bytes of files.
# If that differs, who cares about the rest?
while read FILE
do
SUM=$(dd count=1 < "$FILE" | md5sum)
read G SUM <<<"$SUM"
echo "$SUM $FILE"
done < "$TMP" | sort | while read SUM FILE
do
if [ "$LASTSUM" != "$SUM" ]
then
LASTSUM="$SUM"
LASTFILE="$FILE"
UNPRINTED=1
continue
fi
[ -z "$UNPRINTED" ] || echo "$LASTFILE"
UNPRINTED=""
echo "$FILE"
done | xargs -d '\n' md5sum | sort |
while read SUM FILE
do
if [ "$SUM" != "$LASTSUM" ]
then
LASTSUM="$SUM"
LASTFILE="$FILE"
else
echo "$FILE == $LASTFILE"
fi
done
}
# Find all files, feed them through ls, sort them on size.
# Can't depend on ls' own sorting when there's too man files,
# it could be run more than once.
# Once we have the output, loop through looking for files
# the same size and make a list to feed into checkgroup.
find ~/public_html -type f | xargs ls -l | sort -k 5,6 |
while read PERMS LINKS USER GROUP SIZE M D Y FILE
do
# Skip symbolic links
[ -h "$FILE" ] && continue
if [ "$SIZE" -eq "$LASTSIZE" ]
then
[ -s "$TMP" ] || echo "$LASTFILE" > "$TMP"
echo "$FILE" >> "$TMP"
else
checkgroup "$LASTSIZE"
LASTSIZE="$SIZE" ; LASTFILE="$FILE"
:>"$TMP"
fi
done
checkgroup
rm -f "$TMP"
Here is a solution that uses cmp -s , it's a utility that's designed to compare binary files so will probably be much quicker that cksum and the like. Again only files of identical byte size are compared.
if [ $# -ne 1 ] || [ ! -d $1 ]
then
echo "usage: $0 <directory>"
exit 1
fi
find $1 -type f -ls | awk '
$8 > 0 {
gsub("\\\\ ", SUBSEP); F=$12; gsub(SUBSEP, " ", F); # Deal with space(s) in filename
if($8 in sizes) {
sizes[$8]=sizes[$8] SUBSEP F;
dup[$8]++
} else sizes[$8]=F
}
END {for(i in dup) print sizes }' | while read
do
# SUBSEP (34 Octal) between each filename that has same size
# Change IFS to Load Array F with a group of 2 (or more) files
OIFS="$IFS"
IFS=$(printf \\034)
F=( $REPLY )
IFS="$OIFS"
i=0
while [ $i -lt ${#F[@]} ]
do
let j=i+1
while [ $j -lt ${#F[@]} ]
do
cmp -s "${F}" "${F[j]}" &&
echo "\"${F}\"" and "\"${F[j]}\"" are identical
let j=j+1
done
let i=i+1
done
done