Normalization using awk

Diya123 · May 31, 2011, 3:12pm

Hi

I have a file with

chr22_190_200    XXY    0    0    
 chr22_201_210    XXY    0    30    
 chr22_211_220    XXY    3    0    
 chr22_221_230    XXY    0    0    
 chr22_231_240    XXY    5    0    
 chr22_241_250    ABC    0    0    
 chr22_251_260   ABC    22    11    
 chr22_261_270    ABC    20    0    
 chr22_271_280    ABC    0    0

I want to perform normalization in order to get a constant .. for instance for gene XXY i want to separate the reads and calculate the constant by summing up counts in column 3 and column 4 and based on the greater value divide the other column sum and get a constant

for example from the above file I just picked the reads for gene XXY and listed below:

chr22_190_200    XXY    0    0    
 chr22_201_210    XXY    0    30    
 chr22_211_220    XXY    3    0    
 chr22_221_230    XXY    0    0    
 chr22_231_240    XXY    5    0

Total sum of column 3 is 8 and column 4 is 30

In the above sum of column 4 is higher than column 3 so the constant (c) will be 30/8 which is ~3.7

I can perform the above in excel for each gene but my file has 348000 genes. So I want to perform it using scripting.

The output should have all columns as above along with the constant listed in column 5

o/p:

chr22_190_200    XXY    0    0    3.7
 chr22_201_210    XXY    0    30     3.7

Thanks,

Diya

vgersh99 · May 31, 2011, 3:35pm

nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } FNR==NR {f3[$2]+=$3; f4[$2]+=$4;next}{print $0, (f3[$2]>f4[$2])?f3[$2]/f4[$2]:f4[$2]/f3[$2]}' myFile

or a bit shorter:

nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } FNR==NR {f3[$2]+=$3; f4[$2]+=$4;next}{div=f4[$2]/f3[$2];print $0, (f3[$2]>f4[$2])?1/div:div}' myFile

Diya123 · May 31, 2011, 3:47pm

Thanks a lot for the quick response.

When I tried with my original file it dint work.. It worked with my example file which I posted.

The only difference is column 2 has names with hyphens and underscores. Do you think that will make difference.

Thanks,

Diya

Corona688 · May 31, 2011, 3:49pm

In what way did it "not work"?

vgersh99 · May 31, 2011, 3:57pm

repost the portion of the real file that "didn't work" - please use code tags when doing so.

Diya123 · May 31, 2011, 4:02pm

Thank you so much.

It worked..I had some issues on my end.

Diya123 · June 2, 2011, 2:58pm

In my example above some of the symbol names in column 2 are like XXY_abc etc.. So when I execute the code below its actually treating XXY and XXY_abc or XXY_abc_XXY_bcd as different, but they are the same( as their starting is XXY)

How can I tell awk to iterate for each gene based on the first value( For instance if it sees XXY or XXY_abc it should consider both as same and normalize the counts)

Thanks,

Diya

vgersh99 · June 2, 2011, 8:37pm

Is it safe to assume that the gene name is anything preceding the first '' (XXY_abc or XXY_xyz or XXY_def_xyz) or simply XXY (if there's no trailing ''?

Diya123 · June 2, 2011, 11:10pm

Hi,

Every gene precedes with a underscore after it.

Thanks,

Diya

vgersh99 · June 3, 2011, 7:48am

nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } {gene=substr($2,1,index($2,"_")-1} FNR==NR {f3[gene]+=$3; f4[gene]+=$4;next}{div=f4[gene]/f3[gene];print $0, (f3[gene]>f4[gene])?1/div:div}' myFile

Diya123 · June 3, 2011, 2:10pm

Thank you..

Diya123 · June 7, 2011, 4:45pm

Hi,

I tried the code above and it gives me lot of syntax errors.

nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } {gene=substr($2,1,index($2,"_")-1} FNR==NR {f3[gene]+=$3; f4[gene]+=$4;next}{div=f4[gene]/f3[gene];print $0, (f3[gene]>f4[gene])?1/div:div}' myFile

.

I have color coded the text in the code in "red" where syntax errors appeared.

Thanks,

Diya

vgersh99 · June 7, 2011, 4:52pm

sorry:

nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } {gene=substr($2,1,index($2,"_")-1)} FNR==NR {f3[gene]+=$3; f4[gene]+=$4;next}{div=f4[gene]/f3[gene];print $0, (f3[gene]>f4[gene])?1/div:div}' myFile