The awk below using the sample input would output the following: Basically, it averages the text in $5 that matches if $7 < 30 .
awk '{if(len==0){last=$5;total=$7;len=1;getline}if($5!=last){printf("%s\t%f\n", last, total/len);last=$5;total=$7;len=1}else{total+=$7;len+=1}}END{printf("%s\t%f\n", last, total/len)}' Input.txt > output.txt
My question is I can not seem to add the correct syntax that will also output the total # of lines in $6 that represent $5 and the % of 7 < 30 I know my words may not be all that helpful so hopefully the desired output will help. Thank you :).
Desired output
ID Average Reads % of Baits
AGRN:exon.1 4.5714285 3.16742 (221 (# of lines in $6 / the # 0f lines < 30 in $7)
the boild is only to show the math and does not need rto be included.
, I'm a bit lost about what you want to average, as $5 is either a + sign, or the text "AGRN:exon.1". Same is valid for $7. And, the condition $7 < 30 is never tested in your code.
Where does the line AGRN:exon.1 4.5714285 come from? I can't seem to see the arithmetics...
You may want to revise your spec to enable others the jump in helping.
The awk (maybe not the best) calculates the average for all the $5 that are same and uses the value in $7 only if it is < 30. In the desired output that is the 4.5 #. What I would also like to include is % of $6 that makes up that number. I am not sure the best way and included the math in post 1 to try and help. Did this help any? Thank you :).