Hello Team,
Need your expertise on following:
Here is the set of data:
C1|4|C1SP1|A1|C1BP1|T1
C1|4|C1SP2|A1|C1BP2|T2
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
C2|2|C2SP2|A2|C2BP2|T1
I need to filter above date base on following two steps:
Group them by column 1 and 4
Once grouped, print the row where column 6 is maxium
For example:
In column 1 we have C1, C2 and C3. There are two rows where column 1 = C1 and column 4 = A1.
Out of above two columns , second row should be printed because Column 6 has greater value T2.
Following will be the output:
C1|4|C1SP2|A1|C1BP2|T2
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
Note - 1 < 2 which meams T1 < T2 and so on.
Thanks
Angsuman
Hello angshuman,
Could you please try following and let me know if this helps.
awk -F"|" FNR==NR'{A[$1,$4]=A[$1,$4]>$2?A[$1,$2]:$2;next} (($1,$4) in A){print;delete A[$1,$4]}' Input_file Input_file
Output will be as follows.
C1|4|C1SP1|A1|C1BP1|T1
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
Thanks,
R. Singh
1 Like
Hello Ravinder,
Thank you for your help. In your code, you have used $2. Are you referring to field 2 by $2? If yes, I think it should be $6. Because I wish to find the row where $6 is greater after grouping them based on $1 and $4.
However, it did not give the output using your code.
Thanks
Angsuman
joker
August 7, 2016, 10:23am
4
Here's my try:
awk -F'|' '{if(d[$1$4"_2"]<$6){d[$1$4"_1"]=$0;d[$1$4"_2"]=$6}} END{ for(v in d) {if(substr(v,length(v))==1) print d[v]}}' input_file1 ...
angshuman:
Hello Ravinder,
Thank you for your help. In your code, you have used $2. Are you referring to field 2 by $2? If yes, I think it should be $6. Because I wish to find the row where $6 is greater after grouping them based on $1 and $4.
However, it did not give the output using your code.
Thanks
Angsuman
Hello angshuman,
Could you please try following and let me know if this helps you.
awk -F"|" 'FNR==NR{sub(/[[:alpha:]]/,X,$NF);A[$1,$4]=A[$1,$4]>$NF+0?A[$1,$4]:$NF+0;next} {Q=$NF;sub(/[[:alpha:]]/,X,Q);if(A[$1,$4]==(Q+0)){print}}' Input_file Input_file
Output will be as follows.
C1|4|C1SP2|A1|C1BP2|T2
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
Thanks,
R. Singh
Hello Ravinder,
Sorry this time also it did not work. I am using enterprise Linux and awk version is GNU awk 3.1.5
Hello stomp,
Your try worked for me. I need to understand the code that you have given. Can you please guide me on this? What is the purpose of "_2" and "_1"
Thanks
Angsuman
joker
August 7, 2016, 10:55am
7
awk -F'|' '{if(d[$1$4"_2"]<$6){d[$1$4"_1"]=$0;d[$1$4"_2"]=$6}} END{ for(v in d) {if(substr(v,length(v))==1) print d[v]}}' input_file1 ...
d[..._1] contains the whole line.
d[..._2] contains the value for the comparison.
only the ..._1 value should be printed at the end.
Hello angshuman,
It is working fine for me. I tried it with following Input_file.
cat Input_file
C1|4|C1SP1|A1|C1BP1|T1
C1|4|C1SP2|A1|C1BP2|T2
C1|4|C1SP1|A1|C1BP1|T8
C1|4|C1SP2|A1|C1BP2|T3
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
C2|2|C2SP2|A2|C2BP2|T1
C2|4|C1SP1|A3|C1BP1|T11
C2|4|C1SP2|A3|C1BP2|T21
Then running following command with above Input_file, I got the output as per your request.
awk -F"|" 'FNR==NR{sub(/[[:alpha:]]/,X,$NF);A[$1,$4]=A[$1,$4]>$NF+0?A[$1,$4]:$NF+0;next} {Q=$NF;sub(/[[:alpha:]]/,X,Q);if(A[$1,$4]==(Q+0)){print}}' Input_file Input_file
Output will be as follows.
C1|4|C1SP1|A1|C1BP1|T8
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
C2|4|C1SP2|A3|C1BP2|T21
Let me know if you have any queries on same.
Thanks,
R. Singh
Also try:
awk -F \| '$6>M[$1,$4]{M[$1,$4]=$6; R[$1,$4]=$0} END{for(i in R) print R}' file
Please note that the comparison of $6 is a string comparison, not a numerical comparison. It is not clear what is regarded to be the "greater value"
1 Like
joker
August 7, 2016, 11:06am
10
Ravinders solutions works for me if I I exactly execute it as he stated it.
Especially:
awk .... Input_file Input_file
The Input_file has to be specified twice.
@Scrutinizer : Thanks for optimizing
Thank you Stomp. Now here is my understanding. Please correct me if wrong:
awk scans each row of the file.
It stores entire row in d[$1$4"_1" and column to compare in d[$1$4"_2"
After comparing it prints the row where column 6 value is greater.
Few more questions:
What is the purpose of substr(v,length(v))==1). Does this check for the value _1?
Does awk scan and store the entire file first and then starts comapring?
How does awk know that column 1 and column 4 need to be identical and compare value of column 6 between 1st and 2nd row or so on?
These questions may appear very naive. I am not very familiar with array in awk. Sorry to trouble you on this.
Thanks
Angsuman
joker
August 7, 2016, 11:16am
12
Hi Angshuman,
take Scrutinizers solution. It's same as mine, but simplified. The process is that way:
Read input file line by line
if stored group($1$4) is found and comparison-value is lower or not-existing then store the line in array / replace existing line
at the end print all stored lines in array
What if file contains data like below? I am sorry if I understood requirement wrong
[akshay@localhost tmp]$ cat file
C1|4|C1SP1|A1|C1BP1|T1
C1|4|C1SP2|A1|C1BP2|T2
C2|3|C2SP1|A2|C2BP1|T2
C3|3|C3SP1|A3|C3BP1|T2
C2|2|C2SP2|A2|C2BP2|T1111
Try if output order doesn't matter
[akshay@localhost tmp]$ awk -F\| '{g=$1 FS $4; v=$6; gsub(/[^0-9]*/,"",v)}(g in A && A[g]<v)||!(g in A){A[g]=v; B[g]=$0}END{for(i in B)print B}' file
C3|3|C3SP1|A3|C3BP1|T2
C1|4|C1SP2|A1|C1BP2|T2
C2|2|C2SP2|A2|C2BP2|T1111