Sort, Uniq, Duplicates

Amruta_Pitkar · May 16, 2007, 5:04am

Input File is :
-------------
25060008,0040,03,
25136437,0030,03,
25069457,0040,02,
80303438,0014,03,1st
80321837,0009,03,1st
80321977,0009,03,1st
80341345,0007,03,1st
84176527,0047,03,1st
84176527,0047,03,
20000735,0018,03,1st
25060008,0040,03,

I am using the following in the script :
------------------------------------
cat InputFile | sort -t, -k1,2 | uniq -d > "Duplicates"

This gets 25060008,0040,03, into the Duplicates file.
But I also want 84176527,0047,03, in the Duplicates file.

Basically I want the script to sort on the first 2 fields (delimited by comma) and if duplicates are found for first 2 fields I want it to be written to "Duplicates" file.

Please guide.

aigles · May 16, 2007, 5:28am

Try that:

sort -t, -k1,2 InputFile | awk -F, '{ if ((key=$1 "," $2)==prv_key) print; prv_key=key}' > "Duplicates"

Jean-Pierre.

matrixmadhan · May 16, 2007, 5:30am

25060008,0040,03,

this is the only line that is duplicate

matrixmadhan · May 16, 2007, 5:32am

In the above sample of records only the third field is common '03'
and not the first or the second field.

How would you expect that to be termed as duplicates based on two fields ?

Amruta_Pitkar · May 16, 2007, 11:12pm

Hi MatrixMadhan,
Please look at the inputfile :
84176527,0047,03,1st
84176527,0047,03,
Is a duplicate record if I want to sort on 1st and 2nd field.

I sorted the issue with :
cat inputfile | sort -t -k1,2 -u > unq
cat inputfile | sort -t -k1,2 > non-unq
comm -23 non-unq unq > duplicates

MatrixMadhan, Jean-Pierre : Thanks.

Thanks.

ghostdog74 · May 17, 2007, 1:49am

awk -F"," '{ line[$1.$2] = $0
             arr[$1.$2]++
           }
END{     for (i in arr) {
            if ( arr > 1 ){
	       print line > "duplicates"
	    }
	 } 
 }' file