awk calculation with zero as N/A

In the below awk , I am trying to calculate percent for a given id. It is very close the problem is when the # being used in the calculation is zero. I am not sure how to code this condition into the awk as it happens frequently. The portion in italics was an attempt but that lead to an error. Thank you :).

file1 ( $2 is used in the calculation )

ABCA2 0
ABHD12 10

file2 ( $3 is used in the calculation )

ABCA2 9 232
ABHD12 211 648 
ABL2 83 0 

awk with error

awk 'function ceil(v) {return int(v)==v?v:int(v+1)}
        NR==FNR{f1[$1]=$2; next} 
       $1 in f1{print $1, ceil(10000*(1-f1[$1]/$3))/100 "%"} {print ($2 == 0 ? "NaN" : $1 / $2)}' file1 file2
awk: cmd. line:3: (FILENAME=file2 FNR=1) fatal: division by zero attempted 

Try something more like:

awk '
function ceil(v) {
     return int(v)==v?v:int(v+1)
}
NR==FNR{
     f1[$1]=$2
     next
} 
$1 in f1{
     print $1, ($3 == 0) ? "NaN" : (ceil(10000*(1-f1[$1]/$3))/100 "%")
}
{    print ($2 == 0 ? "NaN" : $1 / $2)
}' file1 file2

Note, however, that $1/$2 is always going to be zero when $1 is a string that starts with non-numeric characters (other than a few magic strings like Infinity and NaN ).

Hello Don,

Thank you for nice code, just correcting a typo here.

 awk '
function ceil(v) {
     return int(v)==v?v:int(v+1)
}
NR==FNR{
     f1[$1]=$2
     next
} 
$1 in f1{
     print $1, ($3 == 0) ? "NaN" : (ceil(10000*(1-f1[$1]/$3))/100 "%")
}
{    print ($2 == 0 ? "NaN" : $1 / $2)
}' file1 file2
 

Thanks,
R. Singh

1 Like

You're testing if $2 == 0 - which is good! But, you don't test if $3 == 0 , and I guess that's where the error occurs.

Yes. Thanks. I'll correct my earlier post in a couple of minutes.

  • Don

How would this be fixed, is there a better approach.

awk '
function ceil(v) {
     return int(v)==v?v:int(v+1)
}
NR==FNR{
     f1[$1]=$2
     next
} 
$1 in f1{
     print $1, ($3 == 0) ? "NaN" : (ceil(10000*(1-f1[$1]/$3))/100 "%")
}
{    print ($2 == 0 ? "NaN" : $1 / $2)
}' file1 file2
CHUK 3.85%
0
CHRD NaN
0

The zero's as you already thought are an issue. Thank you :).

With your sample file2 containing:

ABCA2 9 232
ABHD12 211 648 
ABL2 83 0

What output do you hope to produce when dividing ABCA2 by 9 , ABHD12 by 211 , or ABL2 by 83 ?

But isn't he or she calculating (1-f1[$1]/$3) ? With $1 == ABHD12 , f1["ABHD12"] == 10 (from file1), so the result should be around 1 - 0,015 = 0.985 , shouldn't it?

1 Like

I need to clean-up the data a bit more and will post back tomorrow. Thank you very much :).

I think part of my problem is that in the attached files, using the awk below Iam getting the correct counts for most of the ids. However, in cases like RYK I get an output of 250 in $2 , but if I manually look at each of the files I count 259 in $2 .

awk

awk '{A[$3] += $2} END{for (i in A) print i, A}' NA12878_newheader_base_counts_lessthan_30reads_perbase_lessthan_genes.txt (file1) NS12911_newheader_base_counts_lessthan_30reads_perbase_lessthan_genes.txt(file2) > all_genes_bases.txt 

I'm confused...

You ask us to download over a megabyte of data in a gzipped tar file that contains two files:

NA12878_newheader_base_counts_lessthan_30reads_perbase.bed
NS12911_newheader_base_counts_lessthan_30reads_perbase.bed

Neither of which are referenced by the above above code (even after removing the parenthetical elements from the file list).
And, if we change the script above to:

awk '{A[$3] += $2} END{for (i in A) print i, A}' NA12878_newheader_base_counts_lessthan_30reads_perbase.bed NS12911_newheader_base_counts_lessthan_30reads_perbase.bed > all_genes_bases.txt

the output produced is never going to have anything with an alphabetic string in the 1st output field because neither of these input file contain any alphabetic characters in their third fields.

Please explain what is going on here!

1 Like

I apologize for the confusion and will post back in a bit with a better example. Part of the issue that I am having, besides the zero line after most cases, is that some of the initial calculations are incorrect. The awk posted works for most but not all. Again I apologize and will post better examples with an explanation. Thank you :).

---------- Post updated at 08:03 AM ---------- Previous update was at 05:16 AM ----------

I believe I found my error on the miscalculation issue I was having in the above confusing post. I am not sure why in the output there are leading and trailing zero's or how to fix that. As you suspected that is happening but why is a mystery to me :). Thank you :).

current output

0
0
0
0
0
0
0
0
0
0
0
AASS 99.26%
0
0
ABCA10 97.61%
0
ABCA12 99.97%

desired output

AASS 99.26%
ABCA10 97.61%
ABCA12 99.97%

files used to calculate those:
file1

AASS 24
ABCA10 103
ABCA12 3

file2

AASS 23 3241
ABCA10 28 4301
ABCA12 52 8804

calculation math:
$1 in is matched to $1 of file2 and the file1 $2 / the file2 $3 x 100 =x
100 - x = %

example

AASS  24/3241 x 100 = 0.74
100 - 0.74 = 99.26%

Hello cmccabe,

Could you please try following and let me know if this helps.

awk 'FNR==NR{A[$1]=$2;next} ($1 in A){X=(A[$1]/$3)*100;printf("%s %.2f\n",$1,  100-X)}' file1 file2

Output will be as follows.

AASS 99.26
ABCA10 97.61
ABCA12 99.97

Thanks,
R. Singh

1 Like

works great.... thank you very much :).