Compare intervals (columns) from two files (awk, grep, Perl?)

jcvivar · January 17, 2012, 1:58pm

Hi dear users,

I need to compare numeric columns in two files. These files have the following structure.

K.txt (4 columns)

A001      chr21      9805831      9846011
A002      chr21      9806202      9846263
A003      chr21      9887188      9988593
A003      chr21      9887188      9988593
A004      chr21      9895249      9988593
......
......

K.txt file's columns 3 and 4 are the starting and ending positions of an interval for each gene name in column 1.

S.txt (4 columns)

chr21    9411326    9411327    rs75025155
chr21    9411409    9411410    rs71235072
chr21    9805830    9805831    rs78200054
chr21    9887190    9887191    rs71235073
chr21    9895220    9895221    rs78302045
chr21    9988593    9988594    rs71220654
......
......

S.txt file's columns 2 and 3 are also intervals (but shorter than K.txt). Also S.txt file is larger than K.txt

These are the possible outcomes, (or intersections among the intervals):

S$3 <= K$3 (don't print to output)
S$2 <= K$3 AND S$3 >= K$3 (print to output)
S$2 >= K$3 AND S$3 <= K$4 (print to output)
S$2 <= K$4 AND S$3 >= K$4 (print to output)
S$2 >= K$4 (don't print to output)

output should have 2 columns (tab separated): first is column 4 from S.txt (S$4) and second is column 1 from K.txt (K$1). If there are multiple matches like in the example, they should be separated by commas.

rs71235073    A003
rs78200054    A001,B001
rs78302045    A004
.....
.....

Any suggestion will be very welcome.
Thank you!

Corona688 · January 17, 2012, 2:27pm

They both have the same number of rows then, to be read one by one and compared?

jcvivar · January 17, 2012, 2:31pm

They have different number of rows, but I'm afraid that either K.txt or S.txt should be read one by one.