awk questions

nua7 · March 25, 2013, 12:18pm

Hi All,
I have a file in the format shown below. I need to pick up only those lines which have the highest value in column 4 , based on column1.

The lines marked as bold should be the output. Any help is appreciated!

Input file :

-10.5|6|L|32|MPP||R||012009|E
-10.5|6|L|43|MPF|B|R||012009|E
-10.5|6|L|44|MPF||R||012009|E
-10.5|6|L|54|MR|V|R||012009|E
-10.5|6|L|55|MR|B|R||012009|E
-10.5|6|L|56|MR||R||012009|E
-10.5|6|P|08|MPP||R|F|012009|E
-10.5|6|P|13|MPF|B|R|F|012009|E
-10.5|6|P|14|MPF||R|F|012009|E
-10.5|6|P|22|MR|V|R|F|012009|E
-10.5|6|P|23|MR|B|R|F|012009|E
-10.5|6|P|24|MR||R|F|012009|E
-10.75|6|L|16|||P||012008|E
-10.75|6|L|18|||U||012008|E
-11.3|6|L|51|MR|V|P||012009|E
-11.3|6|P|19|MR|V|P|G|012009|E
-11.4|6|L|35|MR|B|R||IN2010|E
-11.85|6|L|41|MPF|B|P||012009|E
-11.85|6|L|42|MPF||P||012009|E
-11.85|6|L|52|MR|B|P||012009|E
-11.85|6|P|11|MPF|B|P|G|012009|E
-11.85|6|P|12|MPF||P|G|012009|E
-11.85|6|P|20|MR|B|P|G|012009|E
-11.95|6|L|53|MR||P||012009|E
-11.95|6|P|21|MR||P|G|012009|E
-11|6|L|17|||R||012008|E
-12.4|6|L|22|MR|V|R||IN2012|E
-12.4|6|L|34|MR|V|R||IN2010|E
-12.4|6|L|50|MR|V|R||IN2013|E
-12.7|6|L|23|MR|B|R||IN2011|E
-13.67|6|L|16|||P||012007|E
-13.67|6|L|17|||R||012007|E
-13.67|6|L|18|||U||012007|E
-14.92|7|P|09|MPP||P|I|012009|E
-14.93|5|L|25|MPP||P||012009|E
-14.9|5|L|19|MR|B|R||IN2012|E
-14.9|6|L|22|MR|V|R||IN2011|E
-14|6|L|24|MR||R||IN2012|E
-14|6|L|51|MR||R||IN2013|E
-15.14|6|L|16|||P||012006|E
-15.14|6|L|17|||R||012006|E
-15.14|6|L|18|||U||012006|E
-15.18|6|L|33|MPP|V|U||012009|E
-15.18|6|L|34|MPP|E|U||012009|E
-15.18|6|L|35|MPP|V|X||012009|E
-15.18|6|L|36|MPP|E|X||012009|E
-15.18|6|L|45|MPF|B|U||012009|E
-15.18|6|L|46|MPF||U||012009|E
-15.18|6|L|47|MPF|B|X||012009|E
-15.18|6|L|48|MPF||X||012009|E
-15.18|6|L|57|MR|V|U||012009|E

anbu23 · March 25, 2013, 12:38pm

What have you tried?

rangarasan · March 25, 2013, 12:38pm

Hi, Try this,

awk -F'\|' 'a[$1]>$4{a[$1]=$4;v[$1]=$0;}END{for(i in v){print v;}}' file

If your file is large then, use sort cmd to sort the file based on col1 and col4 and then, print the last match. Cheers, Ranga:-)

Yoda · March 25, 2013, 12:40pm

awk -F\| ' {
                if ( $1 in A )
                {
                        split (A[$1], R, "|")
                        if ( R[4] < $4 )
                                A[$1] = $0
                }
                else
                        A[$1] = $0
} END {
        for ( i in A )
                print A
} ' file | sort -nr

nua7 · March 25, 2013, 12:54pm

Thank you all for the replies. But the problem is that your code does not eliminate unwanted entries.

I can use reverse sort command to sort the entries based on 1st and 4th column , but then I will have to print only the first occurence of the string.

sort -ur  -k1,4 NPBR2.XTR > tst1

0||P|30|MR||R|S|IN2010|E
0||P|29|MR||R|K|IN2010|E
0||P|28|MR|B|R|S|IN2010|E
0||P|27|MR|B|R|K|IN2010|E
0||P|26|MR|V|R|S|IN2010|E
0||P|25|MR|V|R|K|IN2010|E
0||P|24|MR||P|T|IN2010|E
0||P|23|MR||P|L|IN2010|E
0||P|22|MR|B|P|T|IN2010|E
0||P|21|MR|B|P|L|IN2010|E
0||P|20|MR|V|P|T|IN2010|E
0||P|19|MR|V|P|L|IN2010|E
0||P|15|PDP||X|S|IN2010|E
0||P|14|PDP||X|K|IN2010|E
0||P|13|PDP||R|S|IN2010|E
0||P|12|PDP||R|K|IN2010|E
0||P|11|PDP||P|S|IN2010|E
0||P|10|PDP||P|K|IN2010|E
0|7|P|18|MR||R|H|IN2010|E
0|7|P|17|MR|B|R|H|IN2010|E
0|7|P|16|MR|V|R|H|IN2010|E
0|7|P|15|MR||P|I|IN2010|E
0|7|P|14|MR|B|P|I|IN2010|E
0|7|P|13|MR|V|P|I|IN2010|E
0|7|P|09|PDP||X|H|IN2010|E
0|7|P|08|PDP||R|H|IN2010|E
0|7|P|07|PDP||P|H|IN2010|E
0|7|L|56|MR||A||IN2013|E

rdrtx1 · March 25, 2013, 2:27pm

try also:

awk '
!col1[$1] {col1[$1]=$1; cnt++; mx[$1]=$4}
{if ($4>=mx[$4]) {mx[$1]=$4; out[cnt-1]=$0}}
END {for (i=0; i<cnt; i++) print out}' FS="|" infile

ctsgnb · March 25, 2013, 3:18pm

sort -t\| +0 -1 +3 -4 -r yourfile | awk -F\| '!a[$1]++'

+m Start at the first character of the m+1th field.
-n End at the last character of the nth field (if -N omitted, assume the end of the line).

Just to clarify, if you consider using keys (-k options) and if you consider the line :

-10.5|6|L|32|MPP||R||012009|E

using -k1,4 will use the following key for sorting :
-10.5|6|L|32|MPP||R||012009|E

using -k4 will use the following key for sorting :
-10.5|6|L|32|MPP||R||012009|E

using -k4,4 will use the following key for sorting :
-10.5|6|L|32|MPP||R||012009|E

using -k1,1 will use the following key for sorting :
-10.5|6|L|32|MPP||R||012009|E

using -k1 will use the following key for sorting :
-10.5|6|L|32|MPP||R||012009|E

If no end position is specified the end of line is assumed
so using key :

sort -t\| -k1,1 -k4,4 -r yourfile | awk -F\| '!a[$1]++'

mstafreshi · March 25, 2013, 3:58pm

sort -t"|" -k1nr -k4nr filename | awk -F"|" '$1 != index1 { print; index1=$1}'

ctsgnb · March 25, 2013, 4:38pm

Won't work : give a test with the example provided and check :

you will get 2 entries whose $1 is -10.5

This is because of the key used ... check my previous post, i have meanwhile updated it.

Scrutinizer · March 25, 2013, 8:15pm

With a single awk you could also try this which reads the input file twice to preserve the order in the input file and would have modest memory requirements:

awk -F\| 'NR==FNR{if($4>M[$1] || !($1 in M)) M[$1]=$4; next} $4==M[$1]' file file

mstafreshi · March 26, 2013, 3:19am

sort -t"|" -k1nr -k4nr filename | awk -F"|" '$1 != index1 { print; index1=$1}'

And this is oputput

-10.5|6|L|56|MR||R||012009|E
-10.75|6|L|18|||U||012008|E
-11|6|L|17|||R||012008|E
-11.3|6|L|51|MR|V|P||012009|E
-11.4|6|L|35|MR|B|R||IN2010|E
-11.85|6|L|52|MR|B|P||012009|E
-11.95|6|L|53|MR||P||012009|E
-12.4|6|L|50|MR|V|R||IN2013|E
-12.7|6|L|23|MR|B|R||IN2011|E
-13.67|6|L|18|||U||012007|E
-14|6|L|51|MR||R||IN2013|E
-14.9|6|L|22|MR|V|R||IN2011|E
-14.92|7|P|09|MPP||P|I|012009|E
-14.93|5|L|25|MPP||P||012009|E
-15.14|6|L|18|||U||012006|E
-15.18|6|L|57|MR|V|U||012009|E

test is done on freeBSD .

ctsgnb · March 26, 2013, 4:12am

The test i did with your code was on a linux ubuntu but a reliable code should work independently of the unix distrib.

Scrutinizer · March 26, 2013, 4:15am

The correct way is to use -k1,1 and -k4,4 instead of -k1 and -k4 .