Hi All,
I am searching for a script which will produce an output file with the uniq first field with the second field having highest value among all the duplicates..
The output file will produce only the uniqs which are duplicate 3 times..
Input file
X 9
B 5
A 1
Z 9
T 4
C 9
A 4
C 3
B 2
T 7
C 10
A 5
B 1
X 8
Output file
B 5
A 5
C 10
I am lost in various options using awk, sort, printf etc
Help me please..
mjf
May 9, 2013, 8:03am
2
Here is one solution using awk. Is the order of the output important?
awk '($2 > array[$1] ) {array[$1] = $2} END {for(x in array) {split(x,field); print field[1], array[x]}}' input.txt
A 5
B 5
C 10
I ran the command using input.txt content as
X 9
B 5
A 1
Z 9
T 4
C 9
A 4
C 3
B 2
T 7
C 10
A 5
B 1
X 8
Ouput became
A 5
B 5
C 10
T 7
X 9
Z 9
But I need output of 3 occurrence only not all..
So the expected output should be
A 5
B 5
C 10
RudiC
May 10, 2013, 3:07am
4
Try
$ sort -k1,1 -k2,2g file | awk 'TMP1 != $1 && A[TMP1] >= 3 {print TMP0} {TMP0=$0; TMP1=$1; A[$1]++} END {if (A[TMP1] >=3) print TMP0}'
A 5
B 5
C 10
Showing the usage
usage: sort [-cmu] [-o output] [-T directory] [-S mem] [-z recsz]
[-dfiMnr] [-b] [-t char] [-k keydef] [+pos1 [-pos2]] files...
Here is a way to do it using awk.
$ awk '{cnts[$1]++} $2 > fld2[$1] {fld2[$1]=$2; fld0[$1]=$0} END {for (key in cnts) { if (cnts[key] == 3) print fld0[key] }}' input3
A 5
B 5
C 10
1 Like
That should work, what is your OS and version?
---
Alternative for 3 or more occurrences, not exactly 3 (not sure what you require):
awk '$2>=M[$1]{M[$1]=$2} ++A[$1]==3{print $1,M[$1]}' file
*edit *
this solution is broken, since it will not always print the highest value if there are more than 3 values.. (thanks RudiC)
1 Like
I read it as "n == 3". But you're right, it's not really clear.
If the intent is "n >= 3", then you either:
Use the scrutinizer shorter solution designed for "n >= 3", or
Change == 3
to >= 3
in previous solution designed for "n == 3":
$ awk '{cnts[$1]++} $2 > fld2[$1] {fld2[$1]=$2; fld0[$1]=$0} END {for (key in cnts) { if (cnts[key] >= 3) print fld0[key] }}' input
A 5
B 5
C 10
1 Like
It could also be read it as 4 (or more)
RudiC
May 10, 2013, 4:43am
10
Your proposal might run into trouble if there's more than 3 duplicates and the third one does not have the max ($2) associated.
1 Like
You are right, it is a bad solution, so I guess it would need a sort first:
sort -k1,1 -k2,2rn file | awk '$2>=M[$1]{M[$1]=$2} ++A[$1]==3{print $1,M[$1]}'
or
sort -k1,1 -k2,2rn file | awk '!A[$1]++{M[$1]=$2} A[$1]==3{print $1,M[$1]}'
RudiC
May 10, 2013, 5:07am
12
Don't forget - the requestor's sort
doesn't work ...
@hanson44 : minor thing: if there are 3 occurrences that are 0, then an empty line will get printed..
i had problem with sort so i used hanson44's code..
It ran perfectly..
awk '{cnts[$1]++} $2 > fld2[$1] {fld2[$1]=$2; fld0[$1]=$0} END {for (key in cnts) { if (cnts[key] >= 3) print fld0[key] }}' input