Duplicate removal

duplicate · May 13, 2013, 6:18am

I have an input file of 5GB which contains duplicate records and have to remove duplicate records by retaing first instance of that record .

Based on 5 fields the duplicates has to be removed .

Kindly request to help me in writing a Unix Script.

Thanks
Asim

pamu · May 13, 2013, 6:19am

try

awk '!A[$0]++' file

vidyadhar85 · May 13, 2013, 6:38am

What are the 5 fields and what's the field seperator?

duplicate · May 13, 2013, 8:10am

Its a positional file based on the field position it has to track the field

5 fields are FirstName ,LastName,DOB,ZipCode,IdentificationID.

Field separator is "|".

vidyadhar85 · May 13, 2013, 8:13am

Sorry to confuse.. this mean these 5 fields can be there any where in file or are they first 5 fields of file?

Could you please post some sample data?

duplicate · May 14, 2013, 6:15am

these 5 fields are there in every record, at fixed position. Ex:

EmNo. FirstName LastName MidName DOB Gender ZipCode IdentificationID.

123456 John Kerry   M    26051952 M 760012 123456789
135628 John Kerry    K    26051952 M 760012 123456789
789456 Alex stewart  M  27071972 M 235612 987654321
986542 John Kerry    L    26051952 M 760012 123456789

my O/P shd be:

123456 John Kerry   M 26051952 M 760012 123456789
789456 Alex stewart  M 27071972 M 235612 987654321

Thanks!!

vidyadhar85 · May 14, 2013, 6:27am

try..

awk '!A[$2$3$5$7$8]++'  filename

duplicate · May 14, 2013, 7:33am

Thanks Vidya,

The script which you have given is working only for the records which is separated by white space .

since my file is positional may not contain white space always so please help me how to write script for this scenario.

0033912087101ASIM                          JOHN           W19210403MEE4101 W ILES AVE APT 2215                                    SPRINGFIELD         IL62711-0000NN

this is one sample record we are eliminating duplicates based on this 5 fields and respective positions

field             POSITION(START POS -END POS)
CUSTNO -     3-11
LNAME   -     14-30
FNAME   -     44-59
DOB      -      60-68
zip        -      153-163

please help me to write the script based on above scenario

vidyadhar85 · May 14, 2013, 8:12am

in that case you can try below..

awk '!A[substr($0,3,9)substr($0,14,17)substr($0,44,16)substr($0,60,9)substr($0,153,11)]++' filename

If there is changes in position feel free to make changes to it

duplicate · May 14, 2013, 10:12am

Thanks Vidya,

Its working fine .

I have a scenario what if the input file doesnt have duplicate records how can i handle the error

Should i have to handle any errors.

vidyadhar85 · May 15, 2013, 12:12am

If file doesnt have any duplicate record I dont think you have to handle any error messages.. Again it depends on your requirements

Regards,
Vidya

duplicate · May 15, 2013, 3:24am

Thanks Vidya,

can you please explain what does this mean(functionality) :awk '!A

awk '!A[substr($0,3,9)substr($0,14,17)substr($0,44,16)substr($0,60,9)substr($0,153,11)]++' filename