Awk: Remove Duplicates

siramitsharma · January 23, 2014, 1:52am

I have the following code for removing duplicate records based on fields in inputfile file & moves the duplicate records in duplicates file(1st Awk) & in 2nd awk i fetch the non duplicate entries in inputfile to tmp file and use move to update the original file.

Requirement:
Can both the awk be combined in single call? or is there any efficient way to do the same?

awk -F, 'dupentries[$1,$2,$3,$4,$5,$6,$7,$8]++' inputfile >> Duplicates
awk -F, '!dupentries[$1,$2,$3,$4,$5,$6,$7,$8]++' inputfile > inputfile.tmp
mv inputfile.tmp inputfile

Don_Cragun · January 23, 2014, 2:30am

Try:

awk -F, 'dupentries[$1,$2,$3,$4,$5,$6,$7,$8]++ {print > "Duplicates"; next};print' inputfile > inputfile.tmp
mv inputfile.tmp inputfile

siramitsharma · January 23, 2014, 4:22am

Hi Don,
it is giving following error at "print"

awk: dupentries[$1,$2,$3,$4,$5,$6,$7,$8]++ {print > "Duplicates"; next}; print
awk:                                                                     ^ syntax error

inputfile

24253886,1,9137,179274,20140111000049,1,N,,0,928678,67340,C2506Qkz,533,SSCHHA01S201401110005000000.PDSN,0,MB
24253886,1,9137,179274,20140111000049,0,N,,0,0,0,C2506Qkz,336,SSCHHA01S201401110005000000.PDSN,0,MB
24253886,1,9137,179274,20140111000049,0,N,,0,0,0,C2506Qkz,335,SSCHHA01S201401110005000000.PDSN,0,MB
24253886,1,9137,179274,20140111000049,1,N,,0,5589,7171,C2506Qkz,534,SSCHHA01S201401110005000000.PDSN,0,MB
24253886,1,9137,179274,20140111000049,0,N,,0,0,0,C2506Qkz,338,SSCHHA01S201401110005000000.PDSN,0,MB
24253886,1,9137,179274,20140111000049,0,N,,0,0,0,C2506Qkz,334,SSCHHA01S201401110005000000.PDSN,0,MB
4000050706,1,9137,275541,20140111000411,10,N,,0,8246472,1791142,C2706RXa,533,SSCHHA01S201401110005000000.PDSN,0,MB
4000050706,1,9137,275541,20140111000411,1,N,,0,344071,105732,C2706RXa,534,SSCHHA01S201401110005000000.PDSN,0,MB
4000050706,1,9137,275541,20140111001259,10,N,,0,6171716,4289817,C2706RZV,533,SSCHHA01S201401110015000002.PDSN,0,MB
4000050706,1,9137,275541,20140111001259,1,N,,0,17662,9883,C2706RZV,534,SSCHHA01S201401110015000002.PDSN,0,MB

RavinderSingh13 · January 23, 2014, 4:26am

Hello,

Just add {print} in place of print .
It should work then.

Thanks,
R. Singh

Franklin52 · January 23, 2014, 4:27am

Should be:

awk -F, 'dupentries[$1,$2,$3,$4,$5,$6,$7,$8]++ {print > "Duplicates"; next}{print}' inputfile > inputfile.tmp