Remove duplicates based on the two key columns

kmsekhar · October 21, 2010, 4:54am

Hi All,
I needs to fetch unique records based on a keycolumn(ie., first column1) and also I needs to get the records which are having max value on column2 in sorted manner... and duplicates have to store in another output file.

Input :

Input.txt
1234,0,x
1234,1,y
5678,10,z
9999,10,k
5678,9,l

Desired Output:

Duplicates.txt
1234,0,x
5678,9,l

Uniqrecords.txt
1234,1,y
5678,10,z
9999,10,k

Regards,
MuniSekhar

Thanks in Advance....

Scrutinizer · October 21, 2010, 5:22am

Hi, what have you tried so far?

kmsekhar · October 21, 2010, 5:32am

earlier i am deleting all occurences on key column1 and stored into seperate file duplicates & uniq records, for that below

sort -t\| -k1 input1.txt|awk '{
x[$1]++
y[NR] = $0
} END {
for(i=1; i<=NR; i++)
{
tmp = y
[i]split(tmp,z)
print tmp> ((x[z[1]]>1) ? "output.txt" : "output2.txt")
}
}' SUBSEP="|" FS="|"

Scrutinizer · October 21, 2010, 5:39am

Well you certainly had the right idea. I noticed that you are using a | as field separator for sort, while the file is comma delimited. Also you should sort on the 2nd field I think if you want to keep the latest max value. If you reverse sort then you can get the max value first so the rest can go in the duplicates bin.

For the awk part you did not use a file separator (FS) which should be set to ',' then you do not need to use the split function. SUBSEP is used for 2-dimensional arrays and is not required here.

I would suggest something like this:

sort -t, -k2,2rn input.txt |
awk -F, '{print > ((A[$1]++)?"Duplicates.txt":"Uniqrecords.txt")}'

ctsgnb · October 21, 2010, 6:24am

OMG this is a tricky nice one!
ok in awk, when i see question about duplicate, i should think using ++

... question : in awk ---> "" + 1 = 0 ????

indeed :

# sort -t, -k2rn infile
5678,10,z
9999,10,k
5678,9,l
1234,1,y
1234,0,x
# sort -t, -k2,2rn infile | nawk -F, '{print A[$1]}'





# sort -t, -k2,2rn infile | nawk -F, '{print A[$1]++}'
0
0
1
0
1
#

Scrutinizer · October 21, 2010, 6:33am

Not exactly. Variables are initialized as empty string, and become zero if converted to a number.
The ++ gets done after the value gets printed. Compare:

$ sort -t, -k2,2rn input.txt | nawk -F, '{print A[$1]++,A[$1]}'
0 1
0 1
1 2
0 1
1 2

and

$ sort -t, -k2,2rn input.txt | nawk -F, '{print ++A[$1]}'
1
1
2
1
2

ctsgnb · October 21, 2010, 11:04am

@Scru1Linizer

Damned ! i missunderstood (once more ...) ... Thx for clarifiying again my fuzzy brain Dude !

Just tell me when you'll be tired of my question and i will just stfu ...

Scrutinizer · October 21, 2010, 11:12am

No worries, you are eager to learn and you ask good questions. Nothing wrong with that, is there?