Filter Row Based On Max Column Value After Group BY

Hello Team,

Need your expertise on following:

Here is the set of data:

C1|4|C1SP1|A1|C1BP1|T1
C1|4|C1SP2|A1|C1BP2|T2
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
C2|2|C2SP2|A2|C2BP2|T1

I need to filter above date base on following two steps:

  1. Group them by column 1 and 4
  2. Once grouped, print the row where column 6 is maxium

For example:
In column 1 we have C1, C2 and C3. There are two rows where column 1 = C1 and column 4 = A1.
Out of above two columns , second row should be printed because Column 6 has greater value T2.

Following will be the output:

C1|4|C1SP2|A1|C1BP2|T2
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2

Note - 1 < 2 which meams T1 < T2 and so on.

Thanks
Angsuman

Hello angshuman,

Could you please try following and let me know if this helps.

awk -F"|" FNR==NR'{A[$1,$4]=A[$1,$4]>$2?A[$1,$2]:$2;next}  (($1,$4) in A){print;delete A[$1,$4]}'  Input_file  Input_file

Output will be as follows.

C1|4|C1SP1|A1|C1BP1|T1
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2

Thanks,
R. Singh

1 Like

Hello Ravinder,

Thank you for your help. In your code, you have used $2. Are you referring to field 2 by $2? If yes, I think it should be $6. Because I wish to find the row where $6 is greater after grouping them based on $1 and $4.

However, it did not give the output using your code.

Thanks
Angsuman

Here's my try:

 awk -F'|' '{if(d[$1$4"_2"]<$6){d[$1$4"_1"]=$0;d[$1$4"_2"]=$6}} END{ for(v in d) {if(substr(v,length(v))==1) print d[v]}}' input_file1 ...

Hello angshuman,

Could you please try following and let me know if this helps you.

awk -F"|" 'FNR==NR{sub(/[[:alpha:]]/,X,$NF);A[$1,$4]=A[$1,$4]>$NF+0?A[$1,$4]:$NF+0;next} {Q=$NF;sub(/[[:alpha:]]/,X,Q);if(A[$1,$4]==(Q+0)){print}}'  Input_file  Input_file

Output will be as follows.

C1|4|C1SP2|A1|C1BP2|T2
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2

Thanks,
R. Singh

Hello Ravinder,

Sorry this time also it did not work. I am using enterprise Linux and awk version is GNU awk 3.1.5

Hello stomp,

Your try worked for me. I need to understand the code that you have given. Can you please guide me on this? What is the purpose of "_2" and "_1"

Thanks
Angsuman

awk -F'|' '{if(d[$1$4"_2"]<$6){d[$1$4"_1"]=$0;d[$1$4"_2"]=$6}} END{ for(v in d) {if(substr(v,length(v))==1) print d[v]}}' input_file1 ...

d[..._1] contains the whole line.
d[..._2] contains the value for the comparison.

only the ..._1 value should be printed at the end.

Hello angshuman,

It is working fine for me. I tried it with following Input_file.

cat  Input_file
C1|4|C1SP1|A1|C1BP1|T1
C1|4|C1SP2|A1|C1BP2|T2
C1|4|C1SP1|A1|C1BP1|T8
C1|4|C1SP2|A1|C1BP2|T3
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
C2|2|C2SP2|A2|C2BP2|T1
C2|4|C1SP1|A3|C1BP1|T11
C2|4|C1SP2|A3|C1BP2|T21

Then running following command with above Input_file, I got the output as per your request.

awk -F"|" 'FNR==NR{sub(/[[:alpha:]]/,X,$NF);A[$1,$4]=A[$1,$4]>$NF+0?A[$1,$4]:$NF+0;next} {Q=$NF;sub(/[[:alpha:]]/,X,Q);if(A[$1,$4]==(Q+0)){print}}'  Input_file  Input_file

Output will be as follows.

C1|4|C1SP1|A1|C1BP1|T8
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
C2|4|C1SP2|A3|C1BP2|T21

Let me know if you have any queries on same.

Thanks,
R. Singh

Also try:

awk -F \| '$6>M[$1,$4]{M[$1,$4]=$6; R[$1,$4]=$0} END{for(i in R) print R}'  file

Please note that the comparison of $6 is a string comparison, not a numerical comparison. It is not clear what is regarded to be the "greater value"

1 Like

Ravinders solutions works for me if I I exactly execute it as he stated it.

Especially:

awk .... Input_file Input_file 

The Input_file has to be specified twice.

@Scrutinizer: Thanks for optimizing :slight_smile:

Thank you Stomp. Now here is my understanding. Please correct me if wrong:

  1. awk scans each row of the file.
  2. It stores entire row in d[$1$4"_1" and column to compare in d[$1$4"_2"
  3. After comparing it prints the row where column 6 value is greater.

Few more questions:

  1. What is the purpose of substr(v,length(v))==1). Does this check for the value _1?
  2. Does awk scan and store the entire file first and then starts comapring?
  3. How does awk know that column 1 and column 4 need to be identical and compare value of column 6 between 1st and 2nd row or so on?

These questions may appear very naive. I am not very familiar with array in awk. Sorry to trouble you on this.

Thanks
Angsuman

Hi Angshuman,

take Scrutinizers solution. It's same as mine, but simplified. The process is that way:

  1. Read input file line by line
  2. if stored group($1$4) is found and comparison-value is lower or not-existing then store the line in array / replace existing line
  3. at the end print all stored lines in array

What if file contains data like below? I am sorry if I understood requirement wrong

[akshay@localhost tmp]$ cat file
C1|4|C1SP1|A1|C1BP1|T1
C1|4|C1SP2|A1|C1BP2|T2
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
C2|2|C2SP2|A2|C2BP2|T1111

Try if output order doesn't matter

[akshay@localhost tmp]$ awk -F\| '{g=$1 FS $4; v=$6; gsub(/[^0-9]*/,"",v)}(g in A && A[g]<v)||!(g in A){A[g]=v; B[g]=$0}END{for(i in B)print B}' file
C3|3|C3SP1|A3|C3BP1|T2
C1|4|C1SP2|A1|C1BP2|T2
C2|2|C2SP2|A2|C2BP2|T1111