I have an input file of 5GB which contains duplicate records and have to remove duplicate records by retaing first instance of that record .
Based on 5 fields the duplicates has to be removed .
Kindly request to help me in writing a Unix Script.
Thanks
Asim
What are the 5 fields and what's the field seperator?
Its a positional file based on the field position it has to track the field
5 fields are FirstName ,LastName,DOB,ZipCode,IdentificationID.
Field separator is "|".
Sorry to confuse.. this mean these 5 fields can be there any where in file or are they first 5 fields of file?
Could you please post some sample data?
these 5 fields are there in every record, at fixed position. Ex:
EmNo. FirstName LastName MidName DOB Gender ZipCode IdentificationID.
123456 John Kerry M 26051952 M 760012 123456789
135628 John Kerry K 26051952 M 760012 123456789
789456 Alex stewart M 27071972 M 235612 987654321
986542 John Kerry L 26051952 M 760012 123456789
my O/P shd be:
123456 John Kerry M 26051952 M 760012 123456789
789456 Alex stewart M 27071972 M 235612 987654321
Thanks!!
try..
awk '!A[$2$3$5$7$8]++' filename
Thanks Vidya,
The script which you have given is working only for the records which is separated by white space .
since my file is positional may not contain white space always so please help me how to write script for this scenario.
0033912087101ASIM JOHN W19210403MEE4101 W ILES AVE APT 2215 SPRINGFIELD IL62711-0000NN
this is one sample record we are eliminating duplicates based on this 5 fields and respective positions
field POSITION(START POS -END POS)
CUSTNO - 3-11
LNAME - 14-30
FNAME - 44-59
DOB - 60-68
zip - 153-163
please help me to write the script based on above scenario
in that case you can try below..
awk '!A[substr($0,3,9)substr($0,14,17)substr($0,44,16)substr($0,60,9)substr($0,153,11)]++' filename
If there is changes in position feel free to make changes to it
Thanks Vidya,
Its working fine .
I have a scenario what if the input file doesnt have duplicate records how can i handle the error
Should i have to handle any errors.
If file doesnt have any duplicate record I dont think you have to handle any error messages.. Again it depends on your requirements
Regards,
Vidya
Thanks Vidya,
can you please explain what does this mean(functionality) :awk '!A
awk '!A[substr($0,3,9)substr($0,14,17)substr($0,44,16)substr($0,60,9)substr($0,153,11)]++' filename