Normalization using awk

Hi

I have a file with

chr22_190_200    XXY    0    0    
 chr22_201_210    XXY    0    30    
 chr22_211_220    XXY    3    0    
 chr22_221_230    XXY    0    0    
 chr22_231_240    XXY    5    0    
 chr22_241_250    ABC    0    0    
 chr22_251_260   ABC    22    11    
 chr22_261_270    ABC    20    0    
 chr22_271_280    ABC    0    0  

I want to perform normalization in order to get a constant .. for instance for gene XXY i want to separate the reads and calculate the constant by summing up counts in column 3 and column 4 and based on the greater value divide the other column sum and get a constant

for example from the above file I just picked the reads for gene XXY and listed below:

chr22_190_200    XXY    0    0    
 chr22_201_210    XXY    0    30    
 chr22_211_220    XXY    3    0    
 chr22_221_230    XXY    0    0    
 chr22_231_240    XXY    5    0    

Total sum of column 3 is 8 and column 4 is 30

In the above sum of column 4 is higher than column 3 so the constant (c) will be 30/8 which is ~3.7

I can perform the above in excel for each gene but my file has 348000 genes. So I want to perform it using scripting.

The output should have all columns as above along with the constant listed in column 5

o/p:

chr22_190_200    XXY    0    0    3.7
 chr22_201_210    XXY    0    30     3.7

Thanks,

Diya

nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } FNR==NR {f3[$2]+=$3; f4[$2]+=$4;next}{print $0, (f3[$2]>f4[$2])?f3[$2]/f4[$2]:f4[$2]/f3[$2]}' myFile

or a bit shorter:

nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } FNR==NR {f3[$2]+=$3; f4[$2]+=$4;next}{div=f4[$2]/f3[$2];print $0, (f3[$2]>f4[$2])?1/div:div}' myFile

Thanks a lot for the quick response.

When I tried with my original file it dint work.. It worked with my example file which I posted.

The only difference is column 2 has names with hyphens and underscores. Do you think that will make difference.

Thanks,

Diya

In what way did it "not work"?

repost the portion of the real file that "didn't work" - please use code tags when doing so.

Thank you so much.

It worked..I had some issues on my end.

In my example above some of the symbol names in column 2 are like XXY_abc etc.. So when I execute the code below its actually treating XXY and XXY_abc or XXY_abc_XXY_bcd as different, but they are the same( as their starting is XXY)

How can I tell awk to iterate for each gene based on the first value( For instance if it sees XXY or XXY_abc it should consider both as same and normalize the counts)

Thanks,

Diya

Is it safe to assume that the gene name is anything preceding the first '' (XXY_abc or XXY_xyz or XXY_def_xyz) or simply XXY (if there's no trailing ''?

Hi,

Every gene precedes with a underscore after it.

Thanks,

Diya

nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } {gene=substr($2,1,index($2,"_")-1} FNR==NR {f3[gene]+=$3; f4[gene]+=$4;next}{div=f4[gene]/f3[gene];print $0, (f3[gene]>f4[gene])?1/div:div}' myFile

Thank you..

Hi,

I tried the code above and it gives me lot of syntax errors.

nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } {gene=substr($2,1,index($2,"_")-1} FNR==NR {f3[gene]+=$3; f4[gene]+=$4;next}{div=f4[gene]/f3[gene];print $0, (f3[gene]>f4[gene])?1/div:div}' myFile

.

I have color coded the text in the code in "red" where syntax errors appeared.

Thanks,

Diya

sorry:

nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } {gene=substr($2,1,index($2,"_")-1)} FNR==NR {f3[gene]+=$3; f4[gene]+=$4;next}{div=f4[gene]/f3[gene];print $0, (f3[gene]>f4[gene])?1/div:div}' myFile