Directory containing files,Print names of the files in the directory that are exactly same content.

Given a directory containing say a few thousand files,
please output a list of all the names of the files in the directory that are exactly the same, i.e. have the same contents.

func(a_directory_name) output -> {�matches�: [[fn1, fn2 ...], [fn3, fn4 ...] ... ]}

e.g. func(�/home/my/files�) where the directory /home/ca31319/files might contain foo.txt, foo.iso, foo.jpeg, bar.txt, bar.doc, baz.csv, baz.ppt etc. and say the file foo.txt is the same as bar.doc and foo.iso is the same as baz.csv and baz.ppt then the output would be:

{
"matches": [
[
"foo.txt",
"bar.doc"
],
[
"foo.iso",
"baz.csv",
�baz.ppt�
]
]
}

Where exactly are stuck?

I tried the below code

for i in TEST/*;
do
for a in TEST/*;
do
if [[ $i == $a ]];then
echo "============"
else
comp=`comm -3 $i $a`;
if [[ $comp != "" ]];then
echo "=============="
else
echo "Matches the $i and $a"
fi
fi
done
done

You are comparing every pair twice. How about

md5sum TEST/* | 
awk '
        {CS[NR] = $1
         FN[NR] = $2
        }
END     {for (i=1; i<=NR; i++)
          for (j=i+1; j<NR; j++) if (CS == CS[j]) print FN "=" FN[j]
        }
'
1 Like

Another route would be to run the sum command on everything in the directory and redirect the output through sort. If the output of sum is the same for a pair (trio, etc.) of files, they should be identical.

Or, in pure shell:

for i in TEST/*
     do for a in $(ls -r TEST/*)
          do    [ $i == $a ] && break
                cmp -s $i $a && echo "Matches the $i and $a"
          done
     done
1 Like

Hi.

You may also wish to consider some programs from the class of utilities that deal with the general idea of differences::

        15) fdupes, rdfind, duff, jdupes find duplicate files

Some details:

rdfind  finds duplicate files (man)
Path    : /usr/bin/rdfind
Version : 1.3.4
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Help    : probably available with --help
Repo    : Debian 8.7 (jessie) 

fdupes  finds duplicate files in a given set of directories (man)
Path    : /usr/bin/fdupes
Version : 1.51
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Repo    : Debian 8.7 (jessie) 

jdupes  finds and performs actions upon duplicate files (man)
Path    : ~/executable/jdupes
Version : 1.5.1 (2016-11-01)
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)

duff    duplicate file finder (man)
Path    : /usr/bin/duff
Version : 0.5.2
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Repo    : Debian 8.7 (jessie)

Best wishes ... cheers, drl

To reduce the ls invocations, you could apply a small adaption:

for i in $(ls -r TEST/*)
      do for a in  TEST/*
           do    [ $i == $a ] && break
                 cmp -s $i $a && echo "Matches the $i and $a"
           done
      done