Help getting a code in awk - Want to know how much of the data is covered by entries

Flebman · August 14, 2019, 12:07pm

Here is my data structure.

# id1    id2    len   start    end
# 9     16792   5475   4181     4232
# 11    16792   2317   1086     1137
# 11    32879   2317      8       60
# 11    32858   2317     10       52
# 11    30670   2317     17       63
# 14    12645    532      3       67
# 14    12645    532    158      222
# 14    11879    532      3      223
# 18    23847    644     64      285
# 18    30160    644     98      285
# 18    30160    644    345      477
# 18    30160    644    516      644

I want to get the coverage of id1 based on its length (column len) considering all entries start and end values. The problem is that the multiple entries can have juxtapose values so considering the values in all entries would overrate the coverage. Also considering the smallest start value and biggest end value doesn't account for all since it can have gaps where not all length is represented.

My expected result should be something like this

 9 --- 50 / 5475  = 0.009
11 --- 106 / 2317 = 0.046
14 --- 220 / 532  = 0.414
18 --- 481 / 644  = 0.75

Corona688 · August 14, 2019, 1:37pm

If you don't want the smallest range, and don't want the biggest range, then what do you want? The average?

rdrtx1 · August 14, 2019, 1:46pm

awk '
NR > 1 {
   if (!id1[$2]++) {ids[idc++]=$2; len[$2]=$4;}
   for (i=$5; i<$6; i++) if (!value[$2,i]++) coverage[$2]++;
}
END {
   for (i=0; i<idc; i++)
      printf "%d --- %d / %d = %.3f\n", ids,
             coverage[ids], len[ids],
             (coverage[ids] / len[ids]);
}
' data

Note: for first line there is only one range of coverage. Check the range in output shown.

Flebman · August 14, 2019, 1:54pm

Thanks for the help rdrtx1.
It worked great.