Find if the value of a column is within the interval of other two

overcraft · March 6, 2024, 9:19pm

Hi there I have a file as such:

0       1868    21656294
207666  247863  21820641
535988  576070
1126564 1126946
1154056 1154132
1155526 1155964
1156013 1157103
1902995 1902997
.
.
.

where the first two columns indicate chromosome intervals and the third one the beginning/ending of a region of interest.

What I need to know is whether and how many times the two values in the third column appear in any interval defined in the first two, the problem of course is that the first two columns have 938089 pairs of values, so I was looking for a potential way to do it with awk possibly unless there are better solutions.

Thanks in advance for you help!

vgersh99 · March 6, 2024, 9:34pm

A small MORE representative sample would help.
Also give the a revised data sample, could you provide a desired output given the input. pls?
Anything you've tried on your own yet?

overcraft · March 6, 2024, 9:41pm

Unfortunately, I've looked up a bit of options but I'm not sure they are actually working well since the shell stuck at some point. Here is the reference at one of them.

Again, I kind of have the concept in my mind of what to do but due to my limited experience I resorted to look up/ask whether someone more practical already came up with something.

As per the mwe, I can attach the first 1000 lines of the file. I hope this helps!
upload.txt (15.6 KB)

P. S. any output similar to the one in the link above saying what value and in which interval is found will work, the main thing is to know the number of times this happens

vgersh99 · March 6, 2024, 10:07pm

@overcraft
well... there were some useful leads in this SO post.
Have you tried any of them?
If so, what did and what didn't work?
Here's my take with the simplified out:

[3rdColumnValue] [numberOfRangesItAppearsIn]

awk -f over.awk inputFile.txt inputFile.txt where over.awk is:

FNR==1{next}
FNR==NR && NF==3 { f3[$3];next}
{
 for (i in f3)
   if (i>=$1 && i<=$2)
     count[i]++
}
END {
  for(i in f3)
    printf("[%d]: [%d]\n", i, count[i])
}

Given your upload.txt:
awk -f over.awk upload.txt upload.txt produces:

[21656294]: [1]
[21820641]: [1981]

given the data sample in SOpost.txt on SO:

tart   End     Value
199146  199389  772617
377581  379032  779286
277876  279322  779287
255497  255843  809151
224415  224896  809550
246516  246946  810776
700000  800000  810777
363597  364598  810802
365242  365735  810898
454121  548113
413324  844816

awk -f over.awk SOpost.txt SOpost.txt produces:

[810777]: [2]
[810802]: [2]
[810776]: [2]
[809550]: [2]
[810898]: [2]
[779287]: [3]
[809151]: [2]
[772617]: [3]
[779286]: [3]

Are you looking for something else?
If so, what is it? And what's wrong with the SO post suggestions?

Looks like the counting is what you're after....

overcraft · March 7, 2024, 1:03pm

@vgersh99 that's great! Thanks, I'm getting the values in the third column and how many times are repeated in the intervals specified in the first two columns over the whole document.

For instance, in my whole big file I get the following:

[21656294]: [29]
[21820641]: [91361]

However, is there a way to tell also the ranges where this happens like in the SO post? Possibly adding something like:

print "Value", i, "is found between", ... , "and", ...

but I don't know how to refer to the intervals from your script... So, in the end I would imagine and output as such:

[21656294]: [29]
Value 21656294 is found between <start_interval> and <end_interval>
.
.
.
[21820641]: [91361]
.
.
.

Although it would create an inconvenient stdout I might redirect it to a file so that I can inspect those intervals and see how to better move forward for the next steps of the analyses.

vgersh99 · March 7, 2024, 1:14pm

@overcraft
I'm not going to gold-plate the initial implementation to fit your/SO's new output format, but I'll slightly change it.

FNR==1{next}
FNR==NR && NF==3 { f3[$3];next}
{
 for (i in f3)
   if (i>=$1 && i<=$2) {
     printf("Value %d is found between %d and %d\n", i, $1, $2)
#     count[i]++
   }
}
END {
#  for(i in f3)
#    printf("[%d]: [%d]\n", i, count[i])
}

overcraft · March 7, 2024, 2:09pm

@vgersh99 great that's awesome and I accepted it as a solution.

One thing I was wondering is that thanks to the output I can now inspect the intervals, so one thing happening is this one:

Value 21820641 is found between 535988 and 576070

which, in principle, shouldn't be based on the conditional...

vgersh99 · March 7, 2024, 2:21pm

Yeah, it's comparing strings - need to cast values to int:
change:

if (i>=$1 && i<=$2) {

to either:

if (i+0>=$1+0 && i+0<=$2+0) {

or

if (int(i)>=int($1) && int(i)<=int($2)) {

overcraft · March 7, 2024, 2:24pm

@vgersh99 makes sense! Really appreciated the help, thanks a lot.

MadeInGermany · March 7, 2024, 7:38pm

In my test the previous solution printed some duplicates.(?)
Had to pipe it to sort -u

The following version of over.awk prints the data in the wanted format (I think):

FNR == 1 { next } # Skip 1st line (header)
{ sub(/\r$/, "") } # DOS text to Unix text
# Pass 1: store fields 1 (as index) and 2 (as value)
NR == FNR { f12[$1]=$2; next }
# Pass 2: process for field 3 only
NF == 3 {
  count=0
  for (f in f12) {
    if ($3 >= f+0 && $3 <= f12[f]+0) {
      printf("[%d] Value %d is found between %d and %d\n", ++count, $3, f, f12[f])
    }
  }
}

overcraft · March 7, 2024, 10:08pm

@MadeInGermany well spotted! I actually, just realized that over the whole file I get the same lines stacked twice e.g

first_set
Value 21656294 is found between 21655867 and 21656781
Value 21820641 is found between 21820636 and 21820669
Value 21820641 is found between 21778502 and 21828502
Value 21820641 is found between 21820613 and 21821539
Value 21820641 is found between 21820527 and 21820756
Value 21820641 is found between 21820530 and 21820674
second_set
Value 21656294 is found between 21655867 and 21656781
Value 21820641 is found between 21820636 and 21820669
Value 21820641 is found between 21778502 and 21828502
Value 21820641 is found between 21820613 and 21821539
Value 21820641 is found between 21820527 and 21820756
Value 21820641 is found between 21820530 and 21820674

Your fix does that trick and adds some nice feature like re-starting the numbering at new instances!

MadeInGermany · March 8, 2024, 10:44am

I have found the bug here:

Fix: the next must happen regardless of the NF==3 condition.
FNR==NR { if (NF==3) f3[$3]; next }