Filter/remove duplicate .dat file with certain criteria

mukeshguliao · March 14, 2011, 8:51am

I am a beginner in Unix. Though have been asked to write a script to filter(remove duplicates) data from a .dat file. File is very huge containig billions of records.

contents of file looks like

 
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40342424,OTC,mart_rec,98, ,0
30002157,40343369,OTC,mart_rec,99, ,0
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,100, ,0
30002157,40345665,OTC,mart_rec,100, ,0
30002157,40345665,OTC,mart_rec,100, ,0

first element, second element, third element constitues a primary key.
thus from these entries
30002157,40342424,OTC,mart_rec,100, ,0
30002157,40342424,OTC,mart_rec,98, ,0
30002157,40342424,OTC,mart_rec,100, ,0

only first one is valid, though( complete line is may or may not be duplicated)

simillerly from these ,
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40343369,OTC,mart_rec,99, ,0
only first entry is valid, i.e.,
30002157,40343369,OTC,mart_rec,95, ,0

I need to make a script which creates a file( from manipluating the input file) as

30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40345665,OTC,mart_rec,100, ,0

first occurance of the combination is taken, rest is ignored. Thus, I can not even sort the file because that may place a second occurance of a combination before the first occurance.

I would be greatful if any of you please advice me, how can I do it.

I hope I have explained the problem clearly.

sk1418 · March 14, 2011, 9:18am

awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]} yourFile

tested with your example data here, it returned:

30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40345665,OTC,mart_rec,100, ,0

mukeshguliao · March 14, 2011, 9:43am

thanks for the quick response.

i am trying this in /usr/bin/csh shell. it gives an error
magu1@gmmagappu1% awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]} UnixEg.dat
unmatched '

do i have to run it in some other shell?

sk1418 · March 14, 2011, 9:59am

hi, i checked my post. it was my fault. one ' was missing. no idea how this happened, i did a copy&paste... sorry.
try the following line:

awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}' yourFile

mukeshguliao · March 14, 2011, 10:27am

it seems fine to me, but its still anit working

trying this (csh shell)
awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}' UnixEg.dat

getting
awk: syntax error near line 1
awk: illegal statement near line 1
awk: illegal statement near line 1

:wall: :wall:

sk1418 · March 14, 2011, 10:42am

it works here:

kent$ echo "30002157,40342424,OTC,mart_rec,100, ,0 
dquote> 30002157,40343369,OTC,mart_rec,95, ,0
dquote> 30002157,40342424,OTC,mart_rec,98, ,0
dquote> 30002157,40343369,OTC,mart_rec,99, ,0
dquote> 30002157,40342424,OTC,mart_rec,100, ,0
dquote> 30002157,40343369,OTC,mart_rec,100, ,0
dquote> 30002157,40345665,OTC,mart_rec,100, ,0
dquote> 30002157,40345665,OTC,mart_rec,100, ,0" | awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}'

30002157,40342424,OTC,mart_rec,100, ,0
30002157,40343369,OTC,mart_rec,95, ,0
30002157,40345665,OTC,mart_rec,100, ,0

tried in zsh and bash. both worked. I don't have csh installed.
which awk do you have? gawk?

mukeshguliao · March 15, 2011, 12:22am

hi i tried these two options in csh and bash

 
cat UnixEg.dat | awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}'
 
awk -F',' '{key=$1$2$3;if(key in a) next; else a[$1$2$3]=$0; print a[$1$2$3]}' UnixEg.dat

in both I am getting an

awk: syntax error near line 1
awk: illegal statement near line 1
awk: illegal statement near line 1

:banghead

---------- Post updated at 11:22 PM ---------- Previous update was at 09:47 PM ----------

found the solution