Sort, Uniq, Duplicates

Input File is :
-------------
25060008,0040,03,
25136437,0030,03,
25069457,0040,02,
80303438,0014,03,1st
80321837,0009,03,1st
80321977,0009,03,1st
80341345,0007,03,1st
84176527,0047,03,1st
84176527,0047,03,
20000735,0018,03,1st
25060008,0040,03,

I am using the following in the script :
------------------------------------
cat InputFile | sort -t, -k1,2 | uniq -d > "Duplicates"

This gets 25060008,0040,03, into the Duplicates file.
But I also want 84176527,0047,03, in the Duplicates file.

Basically I want the script to sort on the first 2 fields (delimited by comma) and if duplicates are found for first 2 fields I want it to be written to "Duplicates" file.

Please guide.

Try that:

sort -t, -k1,2 InputFile | awk -F, '{ if ((key=$1 "," $2)==prv_key) print; prv_key=key}' > "Duplicates"

Jean-Pierre.

25060008,0040,03,

this is the only line that is duplicate

In the above sample of records only the third field is common '03'
and not the first or the second field.

How would you expect that to be termed as duplicates based on two fields ? :slight_smile:

Hi MatrixMadhan,
Please look at the inputfile :
84176527,0047,03,1st
84176527,0047,03,
Is a duplicate record if I want to sort on 1st and 2nd field.

I sorted the issue with :
cat inputfile | sort -t -k1,2 -u > unq
cat inputfile | sort -t -k1,2 > non-unq
comm -23 non-unq unq > duplicates

MatrixMadhan, Jean-Pierre : Thanks.

Thanks.

awk -F"," '{ line[$1.$2] = $0
             arr[$1.$2]++
           }
END{     for (i in arr) {
            if ( arr > 1 ){
	       print line > "duplicates"
	    }
	 } 
 }' file