Hi,
I have 2 files.
file1:
ABC 1160 1260
DEF 1360 1580
DEF 2300 2800
XYZ 1600 2200
file2:
chr1_1000_1050
chr1_1100_1150
chr3_1151_1200
chr3_1201_1250
chr6_1301_1350
chr6_1351_1400
chr6_1550_1600
chrX_1600_1650
chrX_1851_1900
For each row in file2 I want to know if it falls between the column 2 and column3 of file 1.. if so then it should be assigned that may counts..
output
ABC 1160 1260 2
DEF 1360 1580 2
DEF 2300 2800 0
XYZ 1600 2200 2
If I am not clear.. I can explain again in detail.
Thanks,
How about:
awk 'NR==FNR{from[NR]=$2;to[NR]=$3;next}
{c=0;for(i in to)
if(from<$3&&to>$2||from>$2&&to<$3) c++
print $0 OFS c }' FS="_" file2 FS="[ \t]*" file1
I am not sure if I am doing something wrong.. but I get a syntax error.I have colored the text with red..
NR==FNR{from[NR]=$2;to[NR]=$3;next}{c=0;for(i in to)if(from<$3&&to>$2||from>$2&&to<$3) c++ print $0 OFS c }
If you must put two commands in a row, put a ;
between them.
You'll also want to put { }
around all the commands you wish to be in the for-loop, otherwise it will just take the first command after the for-loop.
Hi,
I have tried the above code with my original dataset and it does not seem to give me the right output. However the code runs perfect on the example file.. My original file is complex.. I have changed my files accordingly.
file1:
chr1 87333735 87334735
chr1 94522156 94523156
chr1 179102446 179103446
chr2 1230097 1231097
chr1 6342783 6343783
chr2 147131761 147132761
chr1 167787600 167788600
chr1 167853465 167854465
chr3 167867712 167868712
chr3 167870899 167871899
file2:
chr1 245025451 245025500
chr1 245025951 245026000
chr1 245026151 245026200
chr2 245027551 245027600
chr1 245027601 245027650
chr2 245027651 245027700
chr1 247003001 247003050
chr1 247047901 247047950
chr4 247048701 247048750
chr1 247050751 247050800
chr3 247051101 247051150
chr1 247061401 247061450
chr3 247071451 247071500
What I want is for each row in file 2 basing on column 1(chr1,chr2 etc) it has to check if it falls in the interval range of file1 column2 and column3 for the specific column1. In other words if file 2 column 1 is chr1 then it has to assign the rows to chr1 of file2 by assigning that many counts to file 1 column 4.
Let me know if I am not clear.
Thanks,
For the new format of file2 change last line above to print $0 OFS c }' FS="[ \t]*" file2 file1
Note: that none of the values in your test file2 (approx 250 million) fall within the ranges in file1 (approx 87-167 million) so all counts were zero in the output.
Thanks for the reply. But in the code where is it considering the chr number of column1?? when you are looking at rows which have chr1 in file 2 then in file1 also it should look at chr1..Only if it matches then the counts should be assigned.
Regards,
Oops sorry I missed that additional requirement. Try this updated version:
awk 'NR==FNR{key[NR]=$1;from[NR]=$2;to[NR]=$3;next}
{c=0;for(i in to)
if($1 == key&&
(from<$3&&to>$2||from>$2&&to<$3)) c++
print $0 OFS c }' FS="[ \t]*" file2 file1