Output counts of all matching strings lessthan a number using awk

The awk below is supposed to count all the matching $5 strings and count how many $7 values is less than 20. I don't think I need the portion in bold as I do not need any decimal point or format, but can not seem to get the correct counts. Thank you :).

file

chr5    77316500    77316628    chr5:77316500-77316628    AP3B1    62    152
chr5    77316500    77316628    chr5:77316500-77316628    AP3B1    63    153
chr16    14041460    14042214    chr16:14041460-14042214    ERCC4    333    19
chr16    14041460    14042214    chr16:14041460-14042214    ERCC4    334    19
chr16    14041460    14042214    chr16:14041460-14042214    ERCC4    335    19
chr15    31196856    31198110    chr15:31196856-31198110    FAN1    5    62
chr15    31196856    31198110    chr15:31196856-31198110    FAN1    6    62

desired output

AP3B1 0
ERCC4 3    
FAN1 0

awk with current output

awk '{sum[$5]+=$7 < 20; count[$5]++}  
    END{for(k in sum) printf "%s %.1f\n",  k, sum[k]/count[k]}' file
AP3B1 0.0
ERCC4 1.0
FAN1 0.0
awk \
'count[$5]=="" { count[$5]=0 } 
       $7 < 20 { count[$5]++ } 
END{
              for(k in count) 
                 printf "%s %d\n",  k, count[k]
}' file
1 Like

Thank you very much.

How does each $5 with a $7 greater than 20 get distinguished and counted from a $5 with a $7 less than 20. The code works great I am just trying to learn. Thank you :).

On line 2 it checks if $7 < 20. If it is it adds one to the number of counts < 20 of the corrsponding $5 value: count[$5].

1 Like

Hello cmccabe,

Could you please try following and let me know if this helps you.

awk 'FNR==NR{if($7<20){B[$5]++};C[$5]=B[$5];next}  ($5 in C){printf("%s %01d\n",$5,C[$5]);delete C[$5];}' Input_file  Input_file

Output will be as follows.

AP3B1 0
ERCC4 3
FAN1 0

Thanks,
R. Singh

1 Like

Another modification to the first post:

awk '{count[$5]+=($7 < 20)}  
    END{for(k in count) printf "%s %.1f\n",  k, count[k]}' file
1 Like

All the awk codes work great and thank you for the explanations, I really appreciate it :).