Printing uniq first field with the the highest second field

ailnilanjan · May 9, 2013, 7:48am

Hi All,

I am searching for a script which will produce an output file with the uniq first field with the second field having highest value among all the duplicates..

The output file will produce only the uniqs which are duplicate 3 times..

Input file

X 9
B 5
A 1
Z 9
T 4
C 9
A 4
C 3
B 2
T 7 
C 10
A 5
B 1
X 8

Output file

B 5
A 5
C 10

I am lost in various options using awk, sort, printf etc

Help me please..

mjf · May 9, 2013, 8:03am

Here is one solution using awk. Is the order of the output important?

 awk '($2 > array[$1] ) {array[$1] = $2} END {for(x in array)  {split(x,field); print field[1], array[x]}}' input.txt                         
A 5
B 5
C 10

ailnilanjan · May 10, 2013, 1:45am

I ran the command using input.txt content as

X 9
B 5
A 1
Z 9
T 4
C 9
A 4
C 3
B 2
T 7 
C 10
A 5
B 1
X 8

Ouput became

A 5
B 5
C 10
T 7
X 9
Z 9

But I need output of 3 occurrence only not all..

So the expected output should be

A 5
B 5
C 10

RudiC · May 10, 2013, 3:07am

Try

$ sort -k1,1 -k2,2g file | awk 'TMP1 != $1 && A[TMP1] >= 3 {print TMP0} {TMP0=$0; TMP1=$1; A[$1]++} END {if (A[TMP1] >=3) print TMP0}'
A 5
B 5
C 10

ailnilanjan · May 10, 2013, 3:15am

Showing the usage

usage: sort [-cmu] [-o output] [-T directory] [-S mem] [-z recsz]
        [-dfiMnr] [-b] [-t char] [-k keydef] [+pos1 [-pos2]] files...

hanson44 · May 10, 2013, 3:22am

Here is a way to do it using awk.

$ awk '{cnts[$1]++} $2 > fld2[$1] {fld2[$1]=$2; fld0[$1]=$0} END {for (key in cnts) { if (cnts[key] == 3) print fld0[key] }}' input3
A 5
B 5
C 10

Scrutinizer · May 10, 2013, 3:51am

That should work, what is your OS and version?

---
Alternative for 3 or more occurrences, not exactly 3 (not sure what you require):

awk '$2>=M[$1]{M[$1]=$2} ++A[$1]==3{print $1,M[$1]}' file

*edit *
this solution is broken, since it will not always print the highest value if there are more than 3 values.. (thanks RudiC)

hanson44 · May 10, 2013, 4:15am

I read it as "n == 3". But you're right, it's not really clear.

If the intent is "n >= 3", then you either:

Use the scrutinizer shorter solution designed for "n >= 3", or
Change == 3 to >= 3 in previous solution designed for "n == 3":

$ awk '{cnts[$1]++} $2 > fld2[$1] {fld2[$1]=$2; fld0[$1]=$0} END {for (key in cnts) { if (cnts[key] >= 3) print fld0[key] }}' input
A 5
B 5
C 10

Scrutinizer · May 10, 2013, 4:37am

It could also be read it as 4 (or more)

RudiC · May 10, 2013, 4:43am

Your proposal might run into trouble if there's more than 3 duplicates and the third one does not have the max ($2) associated.

Scrutinizer · May 10, 2013, 4:45am

You are right, it is a bad solution, so I guess it would need a sort first:

sort -k1,1 -k2,2rn file | awk '$2>=M[$1]{M[$1]=$2} ++A[$1]==3{print $1,M[$1]}'

or

sort -k1,1 -k2,2rn file | awk '!A[$1]++{M[$1]=$2} A[$1]==3{print $1,M[$1]}'

RudiC · May 10, 2013, 5:07am

Don't forget - the requestor's sort doesn't work ...

Scrutinizer · May 10, 2013, 6:03am

@hanson44: minor thing: if there are 3 occurrences that are 0, then an empty line will get printed..

ailnilanjan · May 10, 2013, 12:01pm

i had problem with sort so i used hanson44's code..

It ran perfectly..

awk '{cnts[$1]++} $2 > fld2[$1] {fld2[$1]=$2; fld0[$1]=$0} END {for (key in cnts) { if (cnts[key] >= 3) print fld0[key] }}' input