I am a beginner in Unix. Though have been asked to write a script to filter(remove duplicates) data from a .dat file. File is very huge containig billions of records.
contents of file looks like
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40342424,OTC,mart_rec,98, ,0
30002157,40343369,OTC,mart_rec,99, ,0
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,100, ,0
30002157,40345665,OTC,mart_rec,100, ,0
30002157,40345665,OTC,mart_rec,100, ,0
first element, second element, third element constitues a primary key.
thus from these entries
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40342424,OTC,mart_rec,98, ,0
30002157,40342424,OTC,mart_rec,100, ,0
only first one is valid, though( complete line is may or may not be duplicated)
simillerly from these ,
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40343369,OTC,mart_rec,99, ,0
only first entry is valid, i.e.,
30002157,40343369,OTC,mart_rec,95, ,0
I need to make a script which creates a file( from manipluating the input file) as
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40345665,OTC,mart_rec,100, ,0
first occurance of the combination is taken, rest is ignored. Thus, I can not even sort the file because that may place a second occurance of a combination before the first occurance.
I would be greatful if any of you please advice me, how can I do it.
I hope I have explained the problem clearly.