List duplicate files based on Name and size

prvnrk · January 1, 2014, 1:06pm

Hello,

I have a huge directory (with millions of files) and need to find out duplicates based on BOTH file name and File size.

I know

fdupes

but it calculates MD5 which is very time-consuming and especially it takes forever as I have millions of files.

Can anyone please suggest a script or tool to find duplicates just based on file name "and" file size. It would be nice to be able to filter based on minimum file size.

Thanks

ctsgnb · January 1, 2014, 3:13pm

Proceed in 2 steps :

Log size and filename in a tempfile (removing the path from the filename).
Then sort it and get the duplicates

find /huge_dir -type f -printf "%s %p\n" | sed 's:/.*/::' >/tmp/mytmp
sort /tmp/mytmp | uniq -d

Note that for processing such a number of objects it would be advisable to use a database instead.

find /huge_dir -type f -printf "%s %f\n" >/tmp/mytmp
sort /tmp/mytmp | uniq -d

RudiC · January 1, 2014, 3:43pm

I guess duplicate filenames means files in different directories? Do you need the full path of the dupes? Then - if your version of find and uniq allow for it - use printf "%h %f %s\n" and uniq -d --skip-fileds=1

ctsgnb · January 1, 2014, 4:26pm

Ha ! yup! ... i missed the %h and %f ... :rolleyes:

prvnrk · January 2, 2014, 11:20am

I tried the below but it doesn't show correct results.

find . -type f -printf "%s %f\n" |sort |uniq -d -f 2

Here, I was just trying to get list of duplicates with "File size" only.

# ls -l
total 104696
-rwx------+ 1 Admin None 24867520 Jan  1 21:08 Anand-My_Career_1-SDVL.7z
-rwx------+ 1 Admin None 28732186 Jan  1 21:09 Anand-My_Career_2-SDVL.7z
-rwx------+ 1 Admin None 24867520 Jan  1 21:08 Anand-My-Career-1-SDVL.7z
-rwx------+ 1 Admin None 28732186 Jan  1 21:08 Anand-My-Career-2-SDVL.7z

# find . -type f -printf "%s %f\n" |sort |uniq -d

# find . -type f -printf "%s %f\n" |sort |uniq -d -f 2
24867520 Anand-My_Career_1-SDVL.7z

#

It is supposed to display 2 duplicates but shows only one.
What is the mistake am I doing here?

ctsgnb · January 2, 2014, 12:21pm

You could give a try to something like :

find /huge_dir -type f -printf "%s %f %h\n" >/tmp/mytmp

Then to display those having the same size :

sort /tmp/mytmp | awk '{z=y;y=$1;w=x;x=$0;v=u;u=$2}(z==y){print w RS x}'

Then to display those having the same name :

sort -k 2,2 /tmp/mytmp | awk '{z=y;y=$1;w=x;x=$0;v=u;u=$2}(v==u){print w RS x}'

RudiC · January 2, 2014, 3:33pm

Try

find . -printf "%h\t%f\t%s\n" | sort -k2 | uniq -Df1

if your tools allow for those options...

prvnrk · January 2, 2014, 6:55pm

@ctsgnb - your solution shows incorrect results (shows all 4 files for the first command and nothing for second command.

@RudiC - I use cygwin where your solution displays nothing (no error also). I tried this also on redhat linux and same result.

Thanks for you efforts.
Could anyone please offer universal solutions please!