Printing uniq first field with the the highest second field

Hi All,

I am searching for a script which will produce an output file with the uniq first field with the second field having highest value among all the duplicates..

The output file will produce only the uniqs which are duplicate 3 times..

Input file

X 9
B 5
A 1
Z 9
T 4
C 9
A 4
C 3
B 2
T 7 
C 10
A 5
B 1
X 8

Output file

B 5
A 5
C 10

I am lost in various options using awk, sort, printf etc :confused:

Help me please..

Here is one solution using awk. Is the order of the output important?

 awk '($2 > array[$1] ) {array[$1] = $2} END {for(x in array)  {split(x,field); print field[1], array[x]}}' input.txt                         
A 5
B 5
C 10

I ran the command using input.txt content as

X 9
B 5
A 1
Z 9
T 4
C 9
A 4
C 3
B 2
T 7 
C 10
A 5
B 1
X 8

Ouput became

A 5
B 5
C 10
T 7
X 9
Z 9

But I need output of 3 occurrence only not all..

So the expected output should be

A 5
B 5
C 10

Try

$ sort -k1,1 -k2,2g file | awk 'TMP1 != $1 && A[TMP1] >= 3 {print TMP0} {TMP0=$0; TMP1=$1; A[$1]++} END {if (A[TMP1] >=3) print TMP0}'
A 5
B 5
C 10

Showing the usage

usage: sort [-cmu] [-o output] [-T directory] [-S mem] [-z recsz]
        [-dfiMnr] [-b] [-t char] [-k keydef] [+pos1 [-pos2]] files...

Here is a way to do it using awk.

$ awk '{cnts[$1]++} $2 > fld2[$1] {fld2[$1]=$2; fld0[$1]=$0} END {for (key in cnts) { if (cnts[key] == 3) print fld0[key] }}' input3
A 5
B 5
C 10
1 Like

That should work, what is your OS and version?

---
Alternative for 3 or more occurrences, not exactly 3 (not sure what you require):

awk '$2>=M[$1]{M[$1]=$2} ++A[$1]==3{print $1,M[$1]}' file

*edit *
this solution is broken, since it will not always print the highest value if there are more than 3 values.. (thanks RudiC)

1 Like

I read it as "n == 3". But you're right, it's not really clear.

If the intent is "n >= 3", then you either:

  • Use the scrutinizer shorter solution designed for "n >= 3", or
  • Change == 3 to >= 3 in previous solution designed for "n == 3":
$ awk '{cnts[$1]++} $2 > fld2[$1] {fld2[$1]=$2; fld0[$1]=$0} END {for (key in cnts) { if (cnts[key] >= 3) print fld0[key] }}' input
A 5
B 5
C 10
1 Like

It could also be read it as 4 (or more) :slight_smile:

Your proposal might run into trouble if there's more than 3 duplicates and the third one does not have the max ($2) associated.

1 Like

You are right, it is a bad solution, so I guess it would need a sort first:

sort -k1,1 -k2,2rn file | awk '$2>=M[$1]{M[$1]=$2} ++A[$1]==3{print $1,M[$1]}' 

or

sort -k1,1 -k2,2rn file | awk '!A[$1]++{M[$1]=$2} A[$1]==3{print $1,M[$1]}'

Don't forget - the requestor's sort doesn't work ...

@hanson44: minor thing: if there are 3 occurrences that are 0, then an empty line will get printed..

i had problem with sort so i used hanson44's code..

It ran perfectly..

awk '{cnts[$1]++} $2 > fld2[$1] {fld2[$1]=$2; fld0[$1]=$0} END {for (key in cnts) { if (cnts[key] >= 3) print fld0[key] }}' input