Hi
I have a file with
chr22_190_200 XXY 0 0
chr22_201_210 XXY 0 30
chr22_211_220 XXY 3 0
chr22_221_230 XXY 0 0
chr22_231_240 XXY 5 0
chr22_241_250 ABC 0 0
chr22_251_260 ABC 22 11
chr22_261_270 ABC 20 0
chr22_271_280 ABC 0 0
I want to perform normalization in order to get a constant .. for instance for gene XXY i want to separate the reads and calculate the constant by summing up counts in column 3 and column 4 and based on the greater value divide the other column sum and get a constant
for example from the above file I just picked the reads for gene XXY and listed below:
chr22_190_200 XXY 0 0
chr22_201_210 XXY 0 30
chr22_211_220 XXY 3 0
chr22_221_230 XXY 0 0
chr22_231_240 XXY 5 0
Total sum of column 3 is 8 and column 4 is 30
In the above sum of column 4 is higher than column 3 so the constant (c) will be 30/8 which is ~3.7
I can perform the above in excel for each gene but my file has 348000 genes. So I want to perform it using scripting.
The output should have all columns as above along with the constant listed in column 5
o/p:
chr22_190_200 XXY 0 0 3.7
chr22_201_210 XXY 0 30 3.7
Thanks,
Diya
nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } FNR==NR {f3[$2]+=$3; f4[$2]+=$4;next}{print $0, (f3[$2]>f4[$2])?f3[$2]/f4[$2]:f4[$2]/f3[$2]}' myFile
or a bit shorter:
nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } FNR==NR {f3[$2]+=$3; f4[$2]+=$4;next}{div=f4[$2]/f3[$2];print $0, (f3[$2]>f4[$2])?1/div:div}' myFile
Thanks a lot for the quick response.
When I tried with my original file it dint work.. It worked with my example file which I posted.
The only difference is column 2 has names with hyphens and underscores. Do you think that will make difference.
Thanks,
Diya
In what way did it "not work"?
diya123:
Thanks a lot for the quick response.
When I tried with my original file it dint work.. It worked with my example file which I posted.
The only difference is column 2 has names with hyphens and underscores. Do you think that will make difference.
Thanks,
Diya
repost the portion of the real file that "didn't work" - please use code tags when doing so.
Thank you so much.
It worked..I had some issues on my end.
In my example above some of the symbol names in column 2 are like XXY_abc etc.. So when I execute the code below its actually treating XXY and XXY_abc or XXY_abc_XXY_bcd as different, but they are the same( as their starting is XXY)
How can I tell awk to iterate for each gene based on the first value( For instance if it sees XXY or XXY_abc it should consider both as same and normalize the counts)
Thanks,
Diya
diya123:
In my example above some of the symbol names in column 2 are like XXY_abc etc.. So when I execute the code below its actually treating XXY and XXY_abc or XXY_abc_XXY_bcd as different, but they are the same( as their starting is XXY)
How can I tell awk to iterate for each gene based on the first value( For instance if it sees XXY or XXY_abc it should consider both as same and normalize the counts)
Thanks,
Diya
Is it safe to assume that the gene name is anything preceding the first '' (XXY_abc or XXY_xyz or XXY_def_xyz) or simply XXY (if there's no trailing ' '?
Hi,
Every gene precedes with a underscore after it.
Thanks,
Diya
nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } {gene=substr($2,1,index($2,"_")-1} FNR==NR {f3[gene]+=$3; f4[gene]+=$4;next}{div=f4[gene]/f3[gene];print $0, (f3[gene]>f4[gene])?1/div:div}' myFile
Hi,
I tried the code above and it gives me lot of syntax errors.
nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } {gene=substr($2,1,index($2,"_")-1} FNR==NR {f3[gene]+=$3; f4[gene]+=$4;next}{div=f4[gene]/f3[gene];print $0, (f3[gene]>f4[gene])?1/div:div}' myFile
.
I have color coded the text in the code in "red" where syntax errors appeared.
Thanks,
Diya
sorry:
nawk 'BEGIN{ ARGV[ARGC++] = ARGV[1] } {gene=substr($2,1,index($2,"_")-1)} FNR==NR {f3[gene]+=$3; f4[gene]+=$4;next}{div=f4[gene]/f3[gene];print $0, (f3[gene]>f4[gene])?1/div:div}' myFile