How to sum value of a column by range defined in another file awk?

yifangt · March 1, 2019, 2:01pm

I have two files, file1.table is the count table, and the other is the range condition file2.range.

file1.table
chr start end  count
N1 0 48  1
N1 48  181  2
N1 181 193  0
N1 193 326  2
N1 326 457  0
N1 457 471  1
N1 471 590  2
N1 590 604  1
N1 604 752  1
N1 752 875  1

file2.range
chr start end
N1      0     99
N1    100    199
N1    200    299
N1    300    399
N1    400    499
N1    500    599
N1    600    699
N1    700    799
N1    800    899
N1    900    999

The value in column 3 [correction: should be column 4] of file1 needs to be summed by the range defined in the file2 (sliding window), especially by column 2 (start position) of the range like this:

chr start end  count
N1      0     99    3
N1    100    199    2
N1    200    299    0
N1    300    399    0
N1    400    499    3
N1    500    599    1
N1    600    699    1
N1    700    799    0
N1    800    899    0
N1    900    999    0

There is a overlapping problem with some rows such as

N1 48  181  2 
N1 181 193  0

but I simply ignore it by choosing the start position [48] only at this moment.
Thanks a lot!

Corona688 · March 1, 2019, 2:15pm

What overlapping problem? Using column 3, those do not overlap. Are we intended to count column 2 as well?

Are all of them N1?

yifangt · March 1, 2019, 2:57pm

What overlapping problem? Using column 3, those do not overlap. Are we intended to count column 2 as well?
What I meant "overlapping" was for the range, for example:

this line N1 48 181 2 could overlap with two ranges:

N1   0    99   ?
N1 100   199   ?

so I just ignore columns 3 (181) and categorize it to range N1 0 99 .
Are all of them N1?
No, N1 means chromosome N1, so that there are 50 different strings, N1, N19, Scaff01 ... Sorry, I should provide a better sample with at least two chromosomes.

file1.table
N1    0    48    0
N1    48    181    2
N1    181    193    0
N1    193    326    2
N1    326    457    0
N1    457    471    1
N1    471    590    2
N1    590    604    1
N1    604    752    0
N1    752    875    1
N2    0    580    0
N2    580    592    1
N2    592    713    2
N2    568    627    1
N2    627    698    2
N2    698    701    3
N2    701    717    2
N2    713    724    1
N2    717    726    3

file2.range
chr start end 
N1      0     99   
N1    100    199 
N1    200    299
N1    300    399 
N1    400    499 
N1    500    599 
N1    600    699 
N1    700    799 
N1    800    899 
N1    900    999
N2      0     99 
N2    100    199 
N2    200    299 
N2    300    399 
N2    400    499 
N2    500    599 
N2    600    699 
N2    700    799 
N2    800    899 
N2    900    999

And output:

chr start end  count 
N1      0     99    3 
N1    100    199    2 
N1    200    299    0 
N1    300    399    0 
N1    400    499    3 
N1    500    599    1 
N1    600    699    1 
N1    700    799    0 
N1    800    899    0 
N1    900    999    0
N2      0     99    0 
N2    100    199    0 
N2    200    299    0 
N2    300    399    0 
N2    400    499    0 
N2    500    599    3 
N2    600    699    5 
N2    700    799    6 
N2    800    899    0 
N2    900    999    0

Corona688 · March 1, 2019, 3:29pm

We're using more than just column 3 then, the whole range of every row must be considered to know when to ignore and when not to.

Does this mean there's different ranges for every N1 as well?

RudiC · March 1, 2019, 3:43pm

Sure you want col 3? Not the count value in col 4? And, how are the count values shared between ranges? Are they evenly distributed?
Please expalin exactly how the result is computed, from what input, what algorithm.

yifangt · March 1, 2019, 3:50pm

Each N1 range is different without overlapping for sure, as they are evenly spaced except the last one. Say N1 has 7550bp long, that it is modulo-ed by 100, the last range would be N1 7500 7550.
If understand your question correctly, corona688.
Thanks RudiC, It should be column 4 as the "count" number. column 3 is the "end" coordinate.

Corona688 · March 1, 2019, 3:53pm

Is everything sorted? Can we depend on N1, N2, N3 being nicely grouped and coming in the same order in both file1 and file2? Order of the ranges doesn't necessarily need sorted.

yifangt · March 1, 2019, 3:55pm

Yes, they are grouped and sorted nicely.

RudiC · March 1, 2019, 4:12pm

This is not a final solution (range end missing, empty intervals missing), but a test of an algorithm that shows severe discrepancies to your desired result. Could you pls check and explain the descrepancies?

awk '{SUM[$1 OFS int($2/100)*100] += $4} END {for (s in SUM) print s, SUM}' OFS="\t" file1 | sort
N1	0	2
N1	100	2
N1	300	0
N1	400	3
N1	500	1
N1	600	0
N1	700	1
N2	0	0
N2	500	4
N2	600	5
N2	700	6

Corona688 · March 1, 2019, 4:26pm

The overlapping problem is ambiguous. If two rows overlap, which one wins, for a given bucket?

RudiC · March 1, 2019, 4:34pm

Try

awk '
FNR == 1        {next
                }
NR == FNR       {SUM[$1 OFS $2 OFS $3]
                }
                {for (s in SUM) {split (s, T, OFS)
                                 if ($1 == T[1] &&
                                     $2 >= T[2] &&
                                     $2 <= T[3]) SUM += $4
                                }
                }
END             {for (s in SUM)  print s, SUM     # {split (s, T, OFS) redundant, I think
                                                     #}
                }
' OFS="\t" file2 file1 | sort
N1    0    99    2
N1    100    199    2
N1    200    299    0
N1    300    399    0
N1    400    499    3
N1    500    599    1
N1    600    699    0
N1    700    799    1
N1    800    899    0
N1    900    999    0
N2    0    99    0
N2    100    199    0
N2    200    299    0
N2    300    399    0
N2    400    499    0
N2    500    599    4
N2    600    699    5
N2    700    799    6
N2    800    899    0
N2    900    999    0

nezabudka · March 2, 2019, 1:39am

Hi @RudiC, in the first example does not work as expected
Hi @yifangt,
The condition of the problem does not match the logic
Take the whole range from file2.range
The file has a range from 0 to 999 without gaps.
All values in file file1.table are in this range.
sum of counts in file 1 is equal 11
So in the output file all 11 should be presented
But in the expected result only 10 counts
It is asked by what algorithm the pattern N1 752 875 1 does not fall within the interval 700-999 ?

expected result:
N1    600    699    1   
N1    700    799    0   
N1    800    899    0   
N1    900    999    0

file1.table
N1 752 875  1

yifangt · March 4, 2019, 12:32pm

Thanks RudiC!
I like these tricks:

SUM[$1 OFS int($2/100)*100]     #A very good trick to me for simple situations
...
for (s in SUM) {split (s, T, OFS) 
    if ($1 == T[1] && $2 >= T[2] && $2 <= T[3]) 
          SUM += $4

The overlapping problem is quite complicated to me, which should be another topic I think.
Hi @nezbudka, the two input files were updated after the original message. Sorry for the confusion.
It is asked by what algorithm the pattern N1 752 875 1 does not fall within the interval 700-999 ?
No, I simplified this scenario to the interval 700-799 based on column 2 only (752 ignoring 875).

Thanks again to all of you!