When I read the question, I had in mind a solution using arrays like that of vgersh99.
Finally I tried to see whether it were easy to make without arrays, and it's that solution that i have posted.
The vgersh99' solution is simpler and more readable.
I wanted to see the differences in performance between the two solution for a large volume of data.
For that I adapted the two solutions to determine the number of files duplicated files on my system.
I have build a file containing the list of all the files (field 1: directory path, field 2: name of the file)
The result file contains 64000 duplicate files approximately.
# find / | sed 's!/\([^/]*\)$!/ \1!' > files.txt
# wc files.txt
534733 1069473 34359804 files.txt
# head -10 files.txt
/
/ lost+found
/ home
/home/ lost+found
/home/ guest
/home/guest/ .sh_history
/home/ gseyjr
/home/gseyjr/ .profile
/home/ usertest
/home/usertest/ .profile
#
The solution with arrays :
$ cat dup1.sh
awk '
{
Files[$2] = ($2 in Files) ? Files[$2] ORS $0 : $0;
FilesCnt[$2]++
}
END {
for (f in Files) {
if (FilesCnt[f] > 1) {
print Files[f];
duplicates++;
}
}
print "\nDuplicates : " duplicates;
}
' files.txt
$ time dup1.sh > /dev/null
real 0m27.22s
user 0m26.74s
sys 0m0.40s
$
The solution without arrays :
The -T option of the sort command was required because there wasn't sufficient space available for work files on the current filesystem.
$ cat dup2.sh
sort -T /refiea/tmp -k2,2 files.txt |
awk '
BEGIN { first_duplicate = 1 }
{
file = $2;
if (file == prv_file) {
if (first_duplicate) {
print prv_rec;
duplicates++
}
print $0;
first_duplicate = 0;
} else {
prv_file = file;
prv_rec = $0;
first_duplicate = 1;
}
}
END {
print "Duplicates : " duplicates;
}
'
$time dup2.sh > /dev/null
real 0m39.85s
user 0m2.92s
sys 0m0.10s
$
In fact, the sort itself takes more time to run that the complete solution with arrays.
$ time sort -T /refiea/tmp -k2,2 files.txt > /dev/null
real 33.06
user 32.28
sys 0.73
$
Conclusion:
The arrays win the contest.
Awk' arrays are yours friends.
They are easy to use and powerful.