Remove duplicates based on the two key columns

Hi All,
I needs to fetch unique records based on a keycolumn(ie., first column1) and also I needs to get the records which are having max value on column2 in sorted manner... and duplicates have to store in another output file.

Input :

Input.txt
1234,0,x
1234,1,y
5678,10,z
9999,10,k
5678,9,l

Desired Output:

Duplicates.txt
1234,0,x
5678,9,l

Uniqrecords.txt
1234,1,y
5678,10,z
9999,10,k

Regards,
MuniSekhar

Thanks in Advance....

Hi, what have you tried so far?

earlier i am deleting all occurences on key column1 and stored into seperate file duplicates & uniq records, for that below

sort -t\| -k1 input1.txt|awk '{
x[$1]++
y[NR] = $0
} END {
for(i=1; i<=NR; i++)
{
tmp = y
[i]split(tmp,z)
print tmp> ((x[z[1]]>1) ? "output.txt" : "output2.txt")
}
}' SUBSEP="|" FS="|"

Well you certainly had the right idea. I noticed that you are using a | as field separator for sort, while the file is comma delimited. Also you should sort on the 2nd field I think if you want to keep the latest max value. If you reverse sort then you can get the max value first so the rest can go in the duplicates bin.

For the awk part you did not use a file separator (FS) which should be set to ',' then you do not need to use the split function. SUBSEP is used for 2-dimensional arrays and is not required here.

I would suggest something like this:

sort -t, -k2,2rn input.txt |
awk -F, '{print > ((A[$1]++)?"Duplicates.txt":"Uniqrecords.txt")}'
1 Like

OMG this is a tricky nice one!
ok in awk, when i see question about duplicate, i should think using ++

... question : in awk ---> "" + 1 = 0 ????

indeed :

# sort -t, -k2rn infile
5678,10,z
9999,10,k
5678,9,l
1234,1,y
1234,0,x
# sort -t, -k2,2rn infile | nawk -F, '{print A[$1]}'





# sort -t, -k2,2rn infile | nawk -F, '{print A[$1]++}'
0
0
1
0
1
#

Not exactly. Variables are initialized as empty string, and become zero if converted to a number.
The ++ gets done after the value gets printed. Compare:

$ sort -t, -k2,2rn input.txt | nawk -F, '{print A[$1]++,A[$1]}'
0 1
0 1
1 2
0 1
1 2

and

$ sort -t, -k2,2rn input.txt | nawk -F, '{print ++A[$1]}'
1
1
2
1
2
1 Like

@Scru1Linizer

Damned ! i missunderstood (once more ...) ... Thx for clarifiying again my fuzzy brain Dude !

Just tell me when you'll be tired of my question and i will just stfu ...

:smiley:

No worries, you are eager to learn and you ask good questions. Nothing wrong with that, is there? :wink: