awk to print percent based on vales in file

cmccabe · February 28, 2020, 2:40pm

Trying to use awk to calulate a percent based on the count of each matching $5 in file divided by the count of each $7 that is greater than or = to 20. The portion of code before the first | gets the count of the matching $5 , then the next portion before the second | gets the count of each $7 that is greater than or = to 20. The last part gets the overall %. The awk does execute, but no output results and there probably is a bette way and hope my logic makes sense . Thank you :).

file

chr1	1787320	1787324	chr1:1787320-1787324	GNB1_1	1	394
chr1	1787320	1787324	chr1:1787320-1787324	GNB1_1	2	398
chr1	1787320	1787324	chr1:1787320-1787324	GNB1_1	3	17
chr1	1787320	1787324	chr1:1787320-1787324	GNB1_1	4	19
chr7	99203095	99203098	chr7:99203095-99203098	KPNA7_9	66	12
chr7	99203095	99203098	chr7:99203095-99203098	KPNA7_9	67	2
chr7	99203095	99203098	chr7:99203095-99203098	KPNA7_9	68	0
chrX	154370862	154370864	chrX:154370862-154370864	FLNA_26	375	0
chrX	154370862	154370864	chrX:154370862-154370864	FLNA_26	376	0

desired

GNB1_1	4 2 50.0%
KPNA7_9 3 2 33.3%
FLNA_26 2 0 0.0%

awk

awk -F '\t' '{c[$5]++}
END{
for (i in c) printf("%s\t%s\n",i,c)
}' file | awk 'count[$5]==""{ count[$5]=0 } 
            $7 <= 20{ count[$5]++} 
END{
              for(k in count) 
                 printf "%s %d\n",  k, count[k]
}' | awk '{A[$1]=$2;next} ($1 in A){X=(A[$1]/$3)*100;printf("%s %.1f\n",$1,  100-X)}' > output

RudiC · February 28, 2020, 3:50pm

Your desired output cannot be calculated from your input sample, as none of the "KPNA7_9" lines has a $7 greater than or equal 20.
Your first awk script prints two fields per line, so a $5 or $7 as referenced in the second will be nil.The only element printed is count[""] which is 3 .
In your third awk script, ALL lines on stdin will be operated upon by the first action, and then next will ignore the rest of the script. Nothing will ever be printed.

A decent, consistent structuring like indenting and block building, etc. - your choice, but stick to it - of the program(s) will help you (later) and others reading and understanding your logics.

Try instead

awk -F '\t' '
        {CNT5[$5]++
         CNT7[$5] += ($7 >= 20)
        }
END     {for (c in CNT5) printf("%s\t%d\t%d\t%7.2f%%\n", c, CNT5[c], CNT7[c], CNT7[c]/CNT5[c]*100)
        }
' file
FLNA_26    2    0    0.00%
GNB1_1     4    2   50.00%
KPNA7_9    3    0    0.00%

cmccabe · March 2, 2020, 8:28am

Thank you very much :).