Remove duplicates based on a field's value

anniecarv · January 10, 2013, 6:42pm

Hi All,

I have a text file with three columns. I would like a simple script that removes lines in which column 1 has duplicate entries, but use the largest value in column 3 to decide which one to keep. For example:

Input file:

12345a rerere.rerere   len=23
11111c fsdfdf.dfsdfdsf   len=33 
22222a fds.fdsfdff.dsfdsf len=43
33333a ffffffffffff.ffff    len=53
33333a ererfdggg.g     len=55
33333a wewew.e        len=23
44444a  e.vv.ffffffffff    len=22

Output file:

12345a rerere.rerere   len=23
11111c fsdfdf.dfsdfdsf   len=33 
22222a fds.fdsfdff.dsfdsf len=43
33333a ererfdggg.g     len=55
44444a  e.vv.ffffffffff    len=22

Any help is appreciated!

rdrtx1 · January 10, 2013, 7:02pm

try:

awk '
{n=$0;sub(".*=", "", n);n+=0}
!a[$1] {mx[$1]=n}
{a[$1]=$0; if (n>=mx[$1]) {mx[$1]=n; o[$1]=$0}}
END {for (i in o) print o}
' input

vgersh99 · January 10, 2013, 7:06pm

awk -F'[ =]' '
  !($1 in a) || l[$1]<$NF {a[$1]=$0;l[$1]=$NF}
  END {
    for (i in a)
        print a
  }' myFile

anniecarv · January 10, 2013, 7:09pm

It worked! Thanks rdrtx1!