awk calculation with zero as N/A

cmccabe · May 23, 2016, 2:25pm

In the below awk , I am trying to calculate percent for a given id. It is very close the problem is when the # being used in the calculation is zero. I am not sure how to code this condition into the awk as it happens frequently. The portion in italics was an attempt but that lead to an error. Thank you :).

file1 ( $2 is used in the calculation )

ABCA2 0
ABHD12 10

file2 ( $3 is used in the calculation )

ABCA2 9 232
ABHD12 211 648 
ABL2 83 0

awk with error

awk 'function ceil(v) {return int(v)==v?v:int(v+1)}
        NR==FNR{f1[$1]=$2; next} 
       $1 in f1{print $1, ceil(10000*(1-f1[$1]/$3))/100 "%"} {print ($2 == 0 ? "NaN" : $1 / $2)}' file1 file2
awk: cmd. line:3: (FILENAME=file2 FNR=1) fatal: division by zero attempted

Don_Cragun · May 23, 2016, 2:46pm

Try something more like:

awk '
function ceil(v) {
     return int(v)==v?v:int(v+1)
}
NR==FNR{
     f1[$1]=$2
     next
} 
$1 in f1{
     print $1, ($3 == 0) ? "NaN" : (ceil(10000*(1-f1[$1]/$3))/100 "%")
}
{    print ($2 == 0 ? "NaN" : $1 / $2)
}' file1 file2

Note, however, that $1/$2 is always going to be zero when $1 is a string that starts with non-numeric characters (other than a few magic strings like Infinity and NaN ).

RavinderSingh13 · May 23, 2016, 2:50pm

Hello Don,

Thank you for nice code, just correcting a typo here.

 awk '
function ceil(v) {
     return int(v)==v?v:int(v+1)
}
NR==FNR{
     f1[$1]=$2
     next
} 
$1 in f1{
     print $1, ($3 == 0) ? "NaN" : (ceil(10000*(1-f1[$1]/$3))/100 "%")
}
{    print ($2 == 0 ? "NaN" : $1 / $2)
}' file1 file2

Thanks,
R. Singh

RudiC · May 23, 2016, 2:50pm

You're testing if $2 == 0 - which is good! But, you don't test if $3 == 0 , and I guess that's where the error occurs.

Don_Cragun · May 23, 2016, 3:39pm

ravindersingh13:

Hello Don,

Thank you for nice code, just correcting a typo here.

 awk '
function ceil(v) {
   return int(v)==v?v:int(v+1)
}
NR==FNR{
   f1[$1]=$2
   next
} 
$1 in f1{
   print $1, ($3 == 0) ? "NaN" : (ceil(10000*(1-f1[$1]/$3))/100 "%")
}
{    print ($2 == 0 ? "NaN" : $1 / $2)
}' file1 file2

Thanks,
R. Singh

Yes. Thanks. I'll correct my earlier post in a couple of minutes.

Don

cmccabe · May 23, 2016, 5:21pm

How would this be fixed, is there a better approach.

awk '
function ceil(v) {
     return int(v)==v?v:int(v+1)
}
NR==FNR{
     f1[$1]=$2
     next
} 
$1 in f1{
     print $1, ($3 == 0) ? "NaN" : (ceil(10000*(1-f1[$1]/$3))/100 "%")
}
{    print ($2 == 0 ? "NaN" : $1 / $2)
}' file1 file2
CHUK 3.85%
0
CHRD NaN
0

The zero's as you already thought are an issue. Thank you :).

Don_Cragun · May 23, 2016, 5:32pm

With your sample file2 containing:

ABCA2 9 232
ABHD12 211 648 
ABL2 83 0

What output do you hope to produce when dividing ABCA2 by 9 , ABHD12 by 211 , or ABL2 by 83 ?

RudiC · May 23, 2016, 5:56pm

But isn't he or she calculating (1-f1[$1]/$3) ? With $1 == ABHD12 , f1["ABHD12"] == 10 (from file1), so the result should be around 1 - 0,015 = 0.985 , shouldn't it?

cmccabe · May 23, 2016, 6:22pm

I need to clean-up the data a bit more and will post back tomorrow. Thank you very much :).

cmccabe · May 23, 2016, 10:17pm

I think part of my problem is that in the attached files, using the awk below Iam getting the correct counts for most of the ids. However, in cases like RYK I get an output of 250 in $2 , but if I manually look at each of the files I count 259 in $2 .

awk

awk '{A[$3] += $2} END{for (i in A) print i, A}' NA12878_newheader_base_counts_lessthan_30reads_perbase_lessthan_genes.txt (file1) NS12911_newheader_base_counts_lessthan_30reads_perbase_lessthan_genes.txt(file2) > all_genes_bases.txt

Don_Cragun · May 23, 2016, 11:57pm

cmccabe:

I think part of my problem is that in the attached files, using the awk below Iam getting the correct counts for most of the ids. However, in cases like RYK I get an output of 250 in $2 , but if I manually look at each of the files I count 259 in $2 .

awk
awk '{A[$3] += $2} END{for (i in A) print i, A}' NA12878_newheader_base_counts_lessthan_30reads_perbase_lessthan_genes.txt (file1) NS12911_newheader_base_counts_lessthan_30reads_perbase_lessthan_genes.txt(file2) > all_genes_bases.txt 

I'm confused...

You ask us to download over a megabyte of data in a gzipped tar file that contains two files:

NA12878_newheader_base_counts_lessthan_30reads_perbase.bed
NS12911_newheader_base_counts_lessthan_30reads_perbase.bed

Neither of which are referenced by the above above code (even after removing the parenthetical elements from the file list).
And, if we change the script above to:

awk '{A[$3] += $2} END{for (i in A) print i, A}' NA12878_newheader_base_counts_lessthan_30reads_perbase.bed NS12911_newheader_base_counts_lessthan_30reads_perbase.bed > all_genes_bases.txt

the output produced is never going to have anything with an alphabetic string in the 1st output field because neither of these input file contain any alphabetic characters in their third fields.

Please explain what is going on here!

cmccabe · May 24, 2016, 9:03am

I apologize for the confusion and will post back in a bit with a better example. Part of the issue that I am having, besides the zero line after most cases, is that some of the initial calculations are incorrect. The awk posted works for most but not all. Again I apologize and will post better examples with an explanation. Thank you :).

---------- Post updated at 08:03 AM ---------- Previous update was at 05:16 AM ----------

I believe I found my error on the miscalculation issue I was having in the above confusing post. I am not sure why in the output there are leading and trailing zero's or how to fix that. As you suspected that is happening but why is a mystery to me :). Thank you :).

current output

0
0
0
0
0
0
0
0
0
0
0
AASS 99.26%
0
0
ABCA10 97.61%
0
ABCA12 99.97%

desired output

AASS 99.26%
ABCA10 97.61%
ABCA12 99.97%

files used to calculate those:
file1

AASS 24
ABCA10 103
ABCA12 3

file2

AASS 23 3241
ABCA10 28 4301
ABCA12 52 8804

calculation math:
$1 in is matched to $1 of file2 and the file1 $2 / the file2 $3 x 100 =x
100 - x = %

example

AASS  24/3241 x 100 = 0.74
100 - 0.74 = 99.26%

RavinderSingh13 · May 24, 2016, 9:20am

Hello cmccabe,

Could you please try following and let me know if this helps.

awk 'FNR==NR{A[$1]=$2;next} ($1 in A){X=(A[$1]/$3)*100;printf("%s %.2f\n",$1,  100-X)}' file1 file2

Output will be as follows.

AASS 99.26
ABCA10 97.61
ABCA12 99.97

Thanks,
R. Singh

cmccabe · May 24, 2016, 1:09pm

works great.... thank you very much :).