awk to grep rows by multiple fields

yifangt · May 1, 2012, 2:06pm

Hello,
I met a challenge to extract part of the table. I'd like to grep the first three matches based on field1 and field2. Input:

D A 92.85   1315    83      11     
D A 95.90   757     28      3      
D A 94.38   480     20      7      
D A 91.21   307     21      6      
D A 94.26   244     14      0      
D A 93.66   142     9       0      
D B 91.82   1321    92      16     
D B 94.85   757     30      4      
D B 94.17   480     22      6      
D B 90.79   304     26      2      
D B 93.39   242     16      0      
D B 90.97   144     11      2      
D C 89.86   1321    119     15

Output

D A 92.85   1315    83      11     
D B 91.82   1321    92      16     
D C 89.86   1321    119     15

Similar question about picking up the first match was posted before, but this one is more brain-twisting, and I feel there must be a simple script to do the job. Thanks a lot!

neutronscott · May 1, 2012, 2:19pm

Standard awk unique program

$ awk '!a[$1,$2]++' input
D A 92.85   1315    83      11
D B 91.82   1321    92      16
D C 89.86   1321    119     15

You also want limited to just 3 matches: awk '!a[$1,$2]++&&++m;m==3{exit}' input

joeyg · May 1, 2012, 2:22pm

$ cut -d" " -f1,2 <sample6.txt | sort -u > sample6a.txt

$ while read line; do cat sample6.txt|grep "$line"|head -1 ; done <sample6a.txt
D A 92.85   1315    83      11
D B 91.82   1321    92      16
D C 89.86   1321    119     15

The first command gets the unique matching patterns.
The second grabs the first line for each 'matching patterns'.

Scrutinizer · May 1, 2012, 3:01pm

Or:

nl infile | sort -k2,3 -k1 | sort -u -k2,3 | cut -f2- | head -n3

GNU sort:

sort -u -k1,2 infile | head -n3

yifangt · May 1, 2012, 3:08pm

Thanks Scott! I tried your first way myself, but thought it was wrong. Your second way is what I was struggling with. Awesome!
Thank you Joey and Scrutinizer! Widen my idea to solve the problem in different way. I felt embarassed when I saw Scrutinizer's script. Lots to learn and the tool is there. Thank you guys again

yanglei_fage · May 2, 2012, 5:46am

neutronscott:

Standard awk unique program
$ awk '!a[$1,$2]++' input
D A 92.85   1315    83      11
D B 91.82   1321    92      16
D C 89.86   1321    119     15
You also want limited to just 3 matches: awk '!a[$1,$2]++&&++m;m==3{exit}' input

can you explain a[$1,$2]++ mean, I never see this usage :(. I don't know I should google what to find its explanation

neutronscott · May 2, 2012, 9:01am

This will create element in array a, and increment it's value. So if element is already there, the test fails. First line makes a[D,A]=1 so next time there is a "D A", !a[D,A] is false, and is skipped.

I do not know better way to google than awk unique. Is popular method to use awk '!a[$0]++' for unique lines. Is different here in that we only use column 1 and 2.