Unique entries based on a range of numbers.

Hi,

I have a matrix like this:

Algorithm	predicted_gene	start_point	end_point
A 	x	65	85
B	x	70	80
C	x	75	85
D	x	10	20
B	y	125	130
C	y	120	140
D	y	200	210

Here there are four tab-separated columns. The first column is the used algorithm for prediction, and there are 4 of them A-D. The second column are the predicted targets (which actually are genes), x and y. The third and fourth column indicate the start and the end of the predicted site in the sequence of the genes.

I'd need to unique the entries in column 2, based on the common range in the columns 3 and 4, something like this:

Algorithm	predicted_gene	start_point	end_point	Number_of_algorithms_predicting_this_site
A, B, C	x	65	85	3
D	x	10	20	1
B, C	y	120	140	2
D	y	200	210	1

Here, for example, at the first line we have algorithms A, B and C which predict the gene x, and the predicted positions all fall into the same site, i.e. the position 70-80 for algorithm B and 75-85 for algorithm C are both located inside the same predicted position by algorithm A, which is 65-85; and the last column indicates how many algorithms predicted this position. On the contrary, the predicted site by algorithm D for the entry x does not coincide with the others, so is presented in a separate line. The results for the entry y are explained in the same way.

Hope this is clear.

Thank you in advanced

Your description is not clear to me. Please explain how do you expect above blue color highlighted result. Please explain your algorithm.

I apologize if it was not clear, I'll modify the original post.

Here is an awk approach:

awk '
        BEGIN {
                print "Algorithm\tPrediction\tLower ragne"
        }
        function checkIDX(a)
        {
                n = split ( I[a], T, "," )
                for ( i = 1; i <= n; i++ )
                {
                        if ( T == $1 )
                                F = 1
                }
                return F
        }
        NR > 1 {
                F = 0
                if ( $2 in A )
                {
                        split ( A[$2], R )
                        if ( $3 >= R[2] && $4 <= R[3] )
                        {
                                L[$2]++
                                if ( checkIDX($2) != 1 )
                                        I[$2] = I[$2] OFS $1
                        }
                        if ( $3 <= R[2] && $4 >= R[3] )
                        {
                                L[$2]++
                                if ( checkIDX($2) != 1 )
                                        I[$2] = I[$2] OFS $1
                                A[$2] = $2 "\t" $3 "\t" $4
                        }
                        if ( ( $3 > R[3] ) || ( $4 < R[2] ) )
                        {
                                print I[$2] "\t" A[$2] "\t" L[$2]
                                A[$2] = $2 "\t" $3 "\t" $4
                                L[$2] = 1
                                I[$2] = $1
                        }
                }
                if ( ! ( $2 in A ) )
                {
                        A[$2] = $2 "\t" $3 "\t" $4
                        L[$2]++
                        I[$2] = $1
                }
        }
        END {
                for ( k in I )
                {
                        print I[k] "\t" A[k] "\t" L[k]
                }
        }
' OFS=, file

Input

Algorithm       prediction
A       x       65      85
B       x       70      80
C       x       75      85
D       x       10      20
B       y       125     130
C       y       120     140
D       y       200     210

Output

Algorithm       Prediction      Lower ragne
A,B,C   x       65      85      3
B,C     y       120     140     2
D       x       10      20      1
D       y       200     210     1
2 Likes

Thank you Yoda for your time. Actually I edited my first post, since it was said not to be clear. In the input file I have four columns, with four headers, and in the output there is one more column, so five columns, and all are tab-delimited. Could you please modify your script based on this? Thanks

Add your headers in BEGIN block edit print statement, try to learn ... after providing 99.99% of code.. if you can't edit small header information means, what I can tell. Please don't expect others to complete your task.. put little effort.

1 Like

Perl approach:

#!/usr/bin/perl
use strict;
use warnings;

open my $input, "<", "$ARGV[0]" or die "cannot open file: $ARGV[0]";

my %ranges;
while (my $line = <$input>) {
  next if $. == 1;
  chomp $line;
  my ($alg, $pred, $lower, $upper) = split /[ \t]+/, $line;
  my $range = (grep {$lower>=(split /:/, $_)[0] && $lower<=(split /:/, $_)[1]} keys %ranges)[0];
  if ( !$range ) {
    push @{$ranges{"$lower:$upper"}{algs}}, $alg;
    $ranges{"$lower:$upper"}{pred} = $pred;
    search_and_include($lower, $upper, \%ranges);
  } else {
    push @{$ranges{$range}{algs}}, $alg;
    $ranges{$range}{pred} = $pred;
  }
}

foreach my $range (keys %ranges) {
  print "Algorithm\tpredicted_gene\tstart_point\tend_point\tNumber_of_algorithms_predicting_this_site\n";
  my $algs = join ", ", @{$ranges{$range}{algs}};
  my $algs_count = scalar @{$ranges{$range}{algs}};
  my ($lower, $upper) = split /:/, $range;
  print join "\t", $algs, $ranges{$range}{pred}, $lower, $upper, $algs_count;
  print "\n";
}

sub search_and_include {
  my ($lower_inc, $upper_inc, $ranges) = @_;
  foreach my $range (keys %ranges) {
    my ($lower, $upper) = split /:/, $range;
    if ($lower >= $lower_inc && $upper <= $upper_inc && ($lower ne $lower_inc || $upper ne $upper_inc)) {
      push @{$ranges{"$lower_inc:$upper_inc"}{algs}}, @{$ranges{$range}{algs}};
      delete $ranges{$range};
    }
  }
}

Run it like this:

./script.pl file
1 Like

Hi Yoda,
I tried your script, and it does work perfectly on the simplified sample I presented here. However, on my true samples it won't be precise, not including all the numbers in the range. I've attached an example of the input and the expected output. Could you please have a look at it?

Thank you very much in advanced.

Hi,

thank you very much for the help. Your script is working well, with two minor problems: first, in the output file, it adds the header for every row, and second, although it works very well on the simplified example, on the true samples it won't. I've attached tow files, the input and the expected output. I'd appreciate it if you could modify the script.

Thank you very much in advanced.