Filter Row Based On Max Column Value After Group BY

angshuman · August 7, 2016, 9:26am

Hello Team,

Need your expertise on following:

Here is the set of data:

C1|4|C1SP1|A1|C1BP1|T1
C1|4|C1SP2|A1|C1BP2|T2
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
C2|2|C2SP2|A2|C2BP2|T1

I need to filter above date base on following two steps:

Group them by column 1 and 4
Once grouped, print the row where column 6 is maxium

For example:
In column 1 we have C1, C2 and C3. There are two rows where column 1 = C1 and column 4 = A1.
Out of above two columns , second row should be printed because Column 6 has greater value T2.

Following will be the output:

C1|4|C1SP2|A1|C1BP2|T2
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2

Note - 1 < 2 which meams T1 < T2 and so on.

Thanks
Angsuman

RavinderSingh13 · August 7, 2016, 9:57am

Hello angshuman,

Could you please try following and let me know if this helps.

awk -F"|" FNR==NR'{A[$1,$4]=A[$1,$4]>$2?A[$1,$2]:$2;next}  (($1,$4) in A){print;delete A[$1,$4]}'  Input_file  Input_file

Output will be as follows.

C1|4|C1SP1|A1|C1BP1|T1
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2

Thanks,
R. Singh

angshuman · August 7, 2016, 10:19am

Hello Ravinder,

Thank you for your help. In your code, you have used $2. Are you referring to field 2 by $2? If yes, I think it should be $6. Because I wish to find the row where $6 is greater after grouping them based on $1 and $4.

However, it did not give the output using your code.

Thanks
Angsuman

joker · August 7, 2016, 10:23am

Here's my try:

 awk -F'|' '{if(d[$1$4"_2"]<$6){d[$1$4"_1"]=$0;d[$1$4"_2"]=$6}} END{ for(v in d) {if(substr(v,length(v))==1) print d[v]}}' input_file1 ...

RavinderSingh13 · August 7, 2016, 10:40am

Hello angshuman,

Could you please try following and let me know if this helps you.

awk -F"|" 'FNR==NR{sub(/[[:alpha:]]/,X,$NF);A[$1,$4]=A[$1,$4]>$NF+0?A[$1,$4]:$NF+0;next} {Q=$NF;sub(/[[:alpha:]]/,X,Q);if(A[$1,$4]==(Q+0)){print}}'  Input_file  Input_file

Output will be as follows.

C1|4|C1SP2|A1|C1BP2|T2
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2

Thanks,
R. Singh

angshuman · August 7, 2016, 10:51am

Hello Ravinder,

Sorry this time also it did not work. I am using enterprise Linux and awk version is GNU awk 3.1.5

Hello stomp,

Your try worked for me. I need to understand the code that you have given. Can you please guide me on this? What is the purpose of "_2" and "_1"

Thanks
Angsuman

joker · August 7, 2016, 10:55am

awk -F'|' '{if(d[$1$4"_2"]<$6){d[$1$4"_1"]=$0;d[$1$4"_2"]=$6}} END{ for(v in d) {if(substr(v,length(v))==1) print d[v]}}' input_file1 ...

d[..._1] contains the whole line.
d[..._2] contains the value for the comparison.

only the ..._1 value should be printed at the end.

RavinderSingh13 · August 7, 2016, 10:59am

Hello angshuman,

It is working fine for me. I tried it with following Input_file.

cat  Input_file
C1|4|C1SP1|A1|C1BP1|T1
C1|4|C1SP2|A1|C1BP2|T2
C1|4|C1SP1|A1|C1BP1|T8
C1|4|C1SP2|A1|C1BP2|T3
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
C2|2|C2SP2|A2|C2BP2|T1
C2|4|C1SP1|A3|C1BP1|T11
C2|4|C1SP2|A3|C1BP2|T21

Then running following command with above Input_file, I got the output as per your request.

awk -F"|" 'FNR==NR{sub(/[[:alpha:]]/,X,$NF);A[$1,$4]=A[$1,$4]>$NF+0?A[$1,$4]:$NF+0;next} {Q=$NF;sub(/[[:alpha:]]/,X,Q);if(A[$1,$4]==(Q+0)){print}}'  Input_file  Input_file

Output will be as follows.

C1|4|C1SP1|A1|C1BP1|T8
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
C2|4|C1SP2|A3|C1BP2|T21

Let me know if you have any queries on same.

Thanks,
R. Singh

Scrutinizer · August 7, 2016, 11:06am

Also try:

awk -F \| '$6>M[$1,$4]{M[$1,$4]=$6; R[$1,$4]=$0} END{for(i in R) print R}'  file

Please note that the comparison of $6 is a string comparison, not a numerical comparison. It is not clear what is regarded to be the "greater value"

joker · August 7, 2016, 11:06am

Ravinders solutions works for me if I I exactly execute it as he stated it.

Especially:

awk .... Input_file Input_file

The Input_file has to be specified twice.

@Scrutinizer: Thanks for optimizing

angshuman · August 7, 2016, 11:11am

Thank you Stomp. Now here is my understanding. Please correct me if wrong:

awk scans each row of the file.
It stores entire row in d[$1$4"_1" and column to compare in d[$1$4"_2"
After comparing it prints the row where column 6 value is greater.

Few more questions:

What is the purpose of substr(v,length(v))==1). Does this check for the value _1?
Does awk scan and store the entire file first and then starts comapring?
How does awk know that column 1 and column 4 need to be identical and compare value of column 6 between 1st and 2nd row or so on?

These questions may appear very naive. I am not very familiar with array in awk. Sorry to trouble you on this.

Thanks
Angsuman

joker · August 7, 2016, 11:16am

Hi Angshuman,

take Scrutinizers solution. It's same as mine, but simplified. The process is that way:

Read input file line by line
if stored group($1$4) is found and comparison-value is lower or not-existing then store the line in array / replace existing line
at the end print all stored lines in array

Akshay_Hegde · August 17, 2016, 7:15am

What if file contains data like below? I am sorry if I understood requirement wrong

[akshay@localhost tmp]$ cat file
C1|4|C1SP1|A1|C1BP1|T1
C1|4|C1SP2|A1|C1BP2|T2
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
C2|2|C2SP2|A2|C2BP2|T1111

Try if output order doesn't matter

[akshay@localhost tmp]$ awk -F\| '{g=$1 FS $4; v=$6; gsub(/[^0-9]*/,"",v)}(g in A && A[g]<v)||!(g in A){A[g]=v; B[g]=$0}END{for(i in B)print B}' file
C3|3|C3SP1|A3|C3BP1|T2
C1|4|C1SP2|A1|C1BP2|T2
C2|2|C2SP2|A2|C2BP2|T1111