List duplicate files based on Name and size

Hello,

I have a huge directory (with millions of files) and need to find out duplicates based on BOTH file name and File size.

I know

fdupes

but it calculates MD5 which is very time-consuming and especially it takes forever as I have millions of files.

Can anyone please suggest a script or tool to find duplicates just based on file name "and" file size. It would be nice to be able to filter based on minimum file size.

Thanks

Proceed in 2 steps :

  1. Log size and filename in a tempfile (removing the path from the filename).
  2. Then sort it and get the duplicates
find /huge_dir -type f -printf "%s %p\n" | sed 's:/.*/::' >/tmp/mytmp
sort /tmp/mytmp | uniq -d

Note that for processing such a number of objects it would be advisable to use a database instead.

find /huge_dir -type f -printf "%s %f\n" >/tmp/mytmp
sort /tmp/mytmp | uniq -d

I guess duplicate filenames means files in different directories? Do you need the full path of the dupes? Then - if your version of find and uniq allow for it - use printf "%h %f %s\n" and uniq -d --skip-fileds=1

1 Like

Ha ! yup! ... i missed the %h and %f ... :rolleyes:

I tried the below but it doesn't show correct results.

find . -type f -printf "%s %f\n" |sort |uniq -d -f 2

Here, I was just trying to get list of duplicates with "File size" only.

# ls -l
total 104696
-rwx------+ 1 Admin None 24867520 Jan  1 21:08 Anand-My_Career_1-SDVL.7z
-rwx------+ 1 Admin None 28732186 Jan  1 21:09 Anand-My_Career_2-SDVL.7z
-rwx------+ 1 Admin None 24867520 Jan  1 21:08 Anand-My-Career-1-SDVL.7z
-rwx------+ 1 Admin None 28732186 Jan  1 21:08 Anand-My-Career-2-SDVL.7z

# find . -type f -printf "%s %f\n" |sort |uniq -d

# find . -type f -printf "%s %f\n" |sort |uniq -d -f 2
24867520 Anand-My_Career_1-SDVL.7z

#

It is supposed to display 2 duplicates but shows only one.
What is the mistake am I doing here? :frowning:

You could give a try to something like :

find /huge_dir -type f -printf "%s %f %h\n" >/tmp/mytmp

Then to display those having the same size :

sort /tmp/mytmp | awk '{z=y;y=$1;w=x;x=$0;v=u;u=$2}(z==y){print w RS x}'

Then to display those having the same name :

sort -k 2,2 /tmp/mytmp | awk '{z=y;y=$1;w=x;x=$0;v=u;u=$2}(v==u){print w RS x}'

Try

find . -printf "%h\t%f\t%s\n" | sort -k2 | uniq -Df1

if your tools allow for those options...

@ctsgnb - your solution shows incorrect results (shows all 4 files for the first command and nothing for second command.

@RudiC - I use cygwin where your solution displays nothing (no error also). I tried this also on redhat linux and same result.

Thanks for you efforts.
Could anyone please offer universal solutions please!