search and count

Diya123 · May 10, 2012, 7:49pm

Hi,

I have 2 files.

file1:

 ABC  1160  1260
DEF   1360 1580
DEF   2300 2800
XYZ  1600  2200

file2:

 chr1_1000_1050
chr1_1100_1150
chr3_1151_1200
chr3_1201_1250
chr6_1301_1350
chr6_1351_1400
chr6_1550_1600
chrX_1600_1650
chrX_1851_1900

For each row in file2 I want to know if it falls between the column 2 and column3 of file 1.. if so then it should be assigned that may counts..

output

 ABC  1160  1260  2
DEF   1360 1580  2
DEF   2300 2800 0
XYZ  1600  2200  2

If I am not clear.. I can explain again in detail.

Thanks,

Chubler_XL · May 10, 2012, 8:17pm

How about:

awk 'NR==FNR{from[NR]=$2;to[NR]=$3;next}
{c=0;for(i in to)
  if(from<$3&&to>$2||from>$2&&to<$3) c++
 print $0 OFS c }' FS="_" file2 FS="[ \t]*" file1

Diya123 · May 11, 2012, 12:28pm

I am not sure if I am doing something wrong.. but I get a syntax error.I have colored the text with red..

NR==FNR{from[NR]=$2;to[NR]=$3;next}{c=0;for(i in to)if(from<$3&&to>$2||from>$2&&to<$3) c++ print $0 OFS c }

Corona688 · May 11, 2012, 12:44pm

If you must put two commands in a row, put a ; between them.

You'll also want to put { } around all the commands you wish to be in the for-loop, otherwise it will just take the first command after the for-loop.

Diya123 · May 14, 2012, 2:45pm

Hi,

I have tried the above code with my original dataset and it does not seem to give me the right output. However the code runs perfect on the example file.. My original file is complex.. I have changed my files accordingly.

file1:

chr1    87333735        87334735
chr1    94522156        94523156
chr1    179102446       179103446
chr2    1230097 1231097
chr1    6342783 6343783
chr2    147131761       147132761
chr1    167787600       167788600
chr1    167853465       167854465
chr3    167867712       167868712
chr3    167870899       167871899

file2:

chr1	245025451	245025500
chr1	245025951	245026000
chr1	245026151	245026200
chr2	245027551	245027600
chr1	245027601	245027650
chr2	245027651	245027700
chr1	247003001	247003050
chr1	247047901	247047950
chr4	247048701	247048750
chr1	247050751	247050800
chr3	247051101	247051150
chr1	247061401	247061450
chr3	247071451	247071500

What I want is for each row in file 2 basing on column 1(chr1,chr2 etc) it has to check if it falls in the interval range of file1 column2 and column3 for the specific column1. In other words if file 2 column 1 is chr1 then it has to assign the rows to chr1 of file2 by assigning that many counts to file 1 column 4.

Let me know if I am not clear.

Thanks,

Chubler_XL · May 14, 2012, 6:49pm

For the new format of file2 change last line above to print $0 OFS c }' FS="[ \t]*" file2 file1

Note: that none of the values in your test file2 (approx 250 million) fall within the ranges in file1 (approx 87-167 million) so all counts were zero in the output.

Diya123 · May 14, 2012, 8:17pm

Thanks for the reply. But in the code where is it considering the chr number of column1?? when you are looking at rows which have chr1 in file 2 then in file1 also it should look at chr1..Only if it matches then the counts should be assigned.

Regards,

Chubler_XL · May 14, 2012, 10:19pm

Oops sorry I missed that additional requirement. Try this updated version:

awk 'NR==FNR{key[NR]=$1;from[NR]=$2;to[NR]=$3;next}
{c=0;for(i in to)
  if($1 == key&&
    (from<$3&&to>$2||from>$2&&to<$3)) c++
 print $0 OFS c }' FS="[ \t]*" file2 file1

Diya123 · May 16, 2012, 11:55am

Thanks,

It worked..