The value in column 3 [correction: should be column 4] of file1 needs to be summed by the range defined in the file2 (sliding window), especially by column 2 (start position) of the range like this:
What overlapping problem? Using column 3, those do not overlap. Are we intended to count column 2 as well?
What I meant "overlapping" was for the range, for example:
this line N1 48 181 2 could overlap with two ranges:
N1 0 99 ?
N1 100 199 ?
so I just ignore columns 3 (181) and categorize it to range N1 0 99 . Are all of them N1?
No, N1 means chromosome N1, so that there are 50 different strings, N1, N19, Scaff01 ... Sorry, I should provide a better sample with at least two chromosomes.
Sure you want col 3? Not the count value in col 4? And, how are the count values shared between ranges? Are they evenly distributed?
Please expalin exactly how the result is computed, from what input, what algorithm.
Each N1 range is different without overlapping for sure, as they are evenly spaced except the last one. Say N1 has 7550bp long, that it is modulo-ed by 100, the last range would be N1 7500 7550.
If understand your question correctly, corona688.
Thanks RudiC, It should be column 4 as the "count" number. column 3 is the "end" coordinate.
Is everything sorted? Can we depend on N1, N2, N3 being nicely grouped and coming in the same order in both file1 and file2? Order of the ranges doesn't necessarily need sorted.
This is not a final solution (range end missing, empty intervals missing), but a test of an algorithm that shows severe discrepancies to your desired result. Could you pls check and explain the descrepancies?
Hi @RudiC, in the first example does not work as expected
Hi @yifangt,
The condition of the problem does not match the logic
Take the whole range from file2.range
The file has a range from 0 to 999 without gaps.
All values in file file1.table are in this range.
sum of counts in file 1 is equal 11
So in the output file all 11 should be presented
But in the expected result only 10 counts
It is asked by what algorithm the pattern N1 752 875 1 does not fall within the interval 700-999 ?
SUM[$1 OFS int($2/100)*100] #A very good trick to me for simple situations
...
for (s in SUM) {split (s, T, OFS)
if ($1 == T[1] && $2 >= T[2] && $2 <= T[3])
SUM += $4
The overlapping problem is quite complicated to me, which should be another topic I think.
Hi @nezbudka, the two input files were updated after the original message. Sorry for the confusion. It is asked by what algorithm the pattern N1 752 875 1 does not fall within the interval 700-999 ?
No, I simplified this scenario to the interval 700-799 based on column 2 only (752 ignoring 875).