Hi there I have a file as such:
0 1868 21656294
207666 247863 21820641
535988 576070
1126564 1126946
1154056 1154132
1155526 1155964
1156013 1157103
1902995 1902997
.
.
.
where the first two columns indicate chromosome intervals and the third one the beginning/ending of a region of interest.
What I need to know is whether and how many times the two values in the third column appear in any interval defined in the first two, the problem of course is that the first two columns have 938089 pairs of values, so I was looking for a potential way to do it with awk
possibly unless there are better solutions.
Thanks in advance for you help!
A small MORE representative sample would help.
Also give the a revised data sample, could you provide a desired output given the input. pls?
Anything you've tried on your own yet?
1 Like
Unfortunately, I've looked up a bit of options but I'm not sure they are actually working well since the shell stuck at some point. Here is the reference at one of them.
Again, I kind of have the concept in my mind of what to do but due to my limited experience I resorted to look up/ask whether someone more practical already came up with something.
As per the mwe, I can attach the first 1000 lines of the file. I hope this helps!
upload.txt (15.6 KB)
P. S. any output similar to the one in the link above saying what value and in which interval is found will work, the main thing is to know the number of times this happens
@overcraft
well... there were some useful leads in this SO post.
Have you tried any of them?
If so, what did and what didn't work?
Here's my take with the simplified out:
[3rdColumnValue] [numberOfRangesItAppearsIn]
awk -f over.awk inputFile.txt inputFile.txt
where over.awk
is:
FNR==1{next}
FNR==NR && NF==3 { f3[$3];next}
{
for (i in f3)
if (i>=$1 && i<=$2)
count[i]++
}
END {
for(i in f3)
printf("[%d]: [%d]\n", i, count[i])
}
Given your upload.txt
:
awk -f over.awk upload.txt upload.txt
produces:
[21656294]: [1]
[21820641]: [1981]
given the data sample in SOpost.txt on SO:
tart End Value
199146 199389 772617
377581 379032 779286
277876 279322 779287
255497 255843 809151
224415 224896 809550
246516 246946 810776
700000 800000 810777
363597 364598 810802
365242 365735 810898
454121 548113
413324 844816
awk -f over.awk SOpost.txt SOpost.txt
produces:
[810777]: [2]
[810802]: [2]
[810776]: [2]
[809550]: [2]
[810898]: [2]
[779287]: [3]
[809151]: [2]
[772617]: [3]
[779286]: [3]
Are you looking for something else?
If so, what is it? And what's wrong with the SO post suggestions?
Looks like the counting is what you're after....
1 Like
@vgersh99 that's great! Thanks, I'm getting the values in the third column and how many times are repeated in the intervals specified in the first two columns over the whole document.
For instance, in my whole big file I get the following:
[21656294]: [29]
[21820641]: [91361]
However, is there a way to tell also the ranges where this happens like in the SO post? Possibly adding something like:
print "Value", i, "is found between", ... , "and", ...
but I don't know how to refer to the intervals from your script... So, in the end I would imagine and output as such:
[21656294]: [29]
Value 21656294 is found between <start_interval> and <end_interval>
.
.
.
[21820641]: [91361]
.
.
.
Although it would create an inconvenient stdout
I might redirect it to a file so that I can inspect those intervals and see how to better move forward for the next steps of the analyses.
@overcraft
I'm not going to gold-plate the initial implementation to fit your/SO's new output format, but I'll slightly change it.
FNR==1{next}
FNR==NR && NF==3 { f3[$3];next}
{
for (i in f3)
if (i>=$1 && i<=$2) {
printf("Value %d is found between %d and %d\n", i, $1, $2)
# count[i]++
}
}
END {
# for(i in f3)
# printf("[%d]: [%d]\n", i, count[i])
}
@vgersh99 great that's awesome and I accepted it as a solution.
One thing I was wondering is that thanks to the output I can now inspect the intervals, so one thing happening is this one:
Value 21820641 is found between 535988 and 576070
which, in principle, shouldn't be based on the conditional...
Yeah, it's comparing strings - need to cast values to int
:
change:
if (i>=$1 && i<=$2) {
to either:
if (i+0>=$1+0 && i+0<=$2+0) {
or
if (int(i)>=int($1) && int(i)<=int($2)) {
1 Like
@vgersh99 makes sense! Really appreciated the help, thanks a lot.
1 Like
In my test the previous solution printed some duplicates.(?)
Had to pipe it to sort -u
The following version of over.awk
prints the data in the wanted format (I think):
FNR == 1 { next } # Skip 1st line (header)
{ sub(/\r$/, "") } # DOS text to Unix text
# Pass 1: store fields 1 (as index) and 2 (as value)
NR == FNR { f12[$1]=$2; next }
# Pass 2: process for field 3 only
NF == 3 {
count=0
for (f in f12) {
if ($3 >= f+0 && $3 <= f12[f]+0) {
printf("[%d] Value %d is found between %d and %d\n", ++count, $3, f, f12[f])
}
}
}
1 Like
@MadeInGermany well spotted! I actually, just realized that over the whole file I get the same lines stacked twice e.g
first_set
Value 21656294 is found between 21655867 and 21656781
Value 21820641 is found between 21820636 and 21820669
Value 21820641 is found between 21778502 and 21828502
Value 21820641 is found between 21820613 and 21821539
Value 21820641 is found between 21820527 and 21820756
Value 21820641 is found between 21820530 and 21820674
second_set
Value 21656294 is found between 21655867 and 21656781
Value 21820641 is found between 21820636 and 21820669
Value 21820641 is found between 21778502 and 21828502
Value 21820641 is found between 21820613 and 21821539
Value 21820641 is found between 21820527 and 21820756
Value 21820641 is found between 21820530 and 21820674
Your fix does that trick and adds some nice feature like re-starting the numbering at new instances!
I have found the bug here:
Fix: the next
must happen regardless of the NF==3
condition.
FNR==NR { if (NF==3) f3[$3]; next }
2 Likes