Perl to adjust coordinates based on repeat string

cmccabe · August 18, 2018, 12:28pm

In the file below I am trying to count the given repeats of A,T,C,G in each string of letters. Each sequence is below the > and it is possible for a string of repeats to wrap from the line above. For example, in the first line the last letter is a T and the next lines has 3 more. I think the below would work, but I am also trying to count the position range of the repeat using the range= , where the first # represents the leftmost (in the first line that is aaa) and the second # rightmost (in the first line that is taa). So using the 4T as an example the output is in the example output.

The t{4} is the repeat that can change, for example if I am after 7g then that would be g{7} ... the lower case letters in the sequence are counted along with the capital letters if they satisfy the criteria. I think both would be captured in the order there are seen, as that is important to know. For example, 4t occurs at chr2:166911127-166911130 ... even though there are 6t in that strech only the 4t satisfy the criteria and are counted. An example output is in the output for two sequences. Thank you :).

file

>hg19_ncbiRefSeq_Gene range=chr2:166911123-166911301 5'pad=25 3'pad=25 strand=- repeatMasking=none
aaattttttggatgcttgttttcagATACACCTTCACAGGAATATATACT
TTTGAATCACTTATAAAAATTATTGCAAGGGGATTCTGTTTAGAAGATTT
TACTTTCCTTCGGGATCCATGGAACTGGCTCGATTTCACTGTCATTACAT
TTGCgtaagtgccttttttgaaactttaa
>hg19_ncbiRefSeq_Gene range=chr2:166909337-166909478 5'pad=25 3'pad=25 strand=- repeatMasking=none
tttgtgtgtgaactccctattacagGTACGTCACAGAGTTTGTGGACCTG
GGCAATGTCTCGGCATTGAGAACATTCAGAGTTCTCCGAGCATTGAAGAC

example output

TTTT chr2:166911173-166911176

description

the first T is 50 in so that is added to the 166911123 and that is the new value after the : and the last T is 53 so that is added to the 166911123 and that is the new value after the -.

perl

perl -076 -nE 'chomp; s/(.+)// && say qq{>$1}; s/\s//g; say $1 while /(t{4})/gi' file

output for two sequences

tttt chr2=166911127-166911130
TTTT chr2:166911173-166911176

Aia · August 20, 2018, 12:18am

In case that it is useful.

#!/usr/bin/perl
use strict;
use warnings;

my @fasta = ();
while(<>) {
    chomp;
    if(/^>/){
        range_match(\@fasta) if @fasta;
        @fasta = ();
        push @fasta, $_;
        next;
    }
    $fasta[1] .= $_;
}
range_match(\@fasta);

sub range_match {
    my $fref = shift;
    my ($header, $seq) = @{$fref};
    my ($mark, $beginning, $end) = split /[:-]/, (split/\s+/, $header)[1];

    while($seq =~ /[tT]{4}/g) {
        my $first = "@-" + 1;
        my $last = "@+";
        printf "%s %s:%s-%s\n", $&, $mark, ($beginning + $first), ($beginning + $last);
    }
}

$ ./read_fasta.pl fasta.file
tttt range=chr2:166911127-166911130
tttt range=chr2:166911142-166911145
TTTT range=chr2:166911173-166911176
TTTT range=chr2:166911221-166911224
tttt range=chr2:166911287-166911290

cmccabe · August 20, 2018, 9:33am

Thank you @Aia, your perl is much more complete than mine. Can the header be included with each sequence and if there is no repeat in that sequence Nothing Detected results? Thank you very much :).

>hg19_ncbiRefSeq_Gene range=chr2:166911123-166911301 5'pad=25 3'pad=25 strand=- repeatMasking=none
tttt range=chr2:166911127-166911130
tttt range=chr2:166911142-166911145
TTTT range=chr2:166911173-166911176
TTTT range=chr2:166911221-166911224
tttt range=chr2:166911287-166911290
>hg19_ncbiRefSeq_Gene range=chr2:166909337-166909478 5'pad=25 3'pad=25 strand=- repeatMasking=none
Nothing Detected

Aia · August 20, 2018, 7:00pm

Two changes, then. Make the subroutine range_match() to return matches and create a display() subroutine.

#!/usr/bin/perl
use strict;
use warnings;

my @fasta = ();
my @m_results = ();

while(<>) {
    chomp;
    if(/^>/){
        if (@fasta) {
            @m_results = range_match(\@fasta);
            display();
        }
        @fasta = ();
        push @fasta, $_;
        next;
    }
    $fasta[1] .= $_;
}
@m_results = range_match(\@fasta);
display();

sub range_match {
    my $fref = shift;
    my @matches;
    my ($header, $seq) = @{$fref};
    my ($mark, $beginning, $end) = split /[:-]/, (split/\s+/, $header)[1];

    while($seq =~ /[tT]{4}/g) {
        my $first = "@-" + 1;
        my $last = "@+";
        push @matches, sprintf "%s %s:%s-%s\n", $&, $mark, ($beginning + $first), ($beginning + $last);
    }
    return @matches;
}

sub display {
    print "$fasta[0]\n";
    print @m_results ? @m_results : "Nothing Detected\n";
    @m_results = ();
}

cmccabe · August 20, 2018, 8:14pm

The line in green, specifically the tT{4} , is what I manually change to capture the different stretches of aA, tT, cC, gG stretches from 6-25 . Is there a way to loop through the below possibilities, each changing automatically? It is always these combinations of letters only the length of the repeat changes. So, if I was capturing [aA]{8}/g for 8a/A or [aA]{18}/g for 18a/A . Thank you very much for your help I really appreciate it:).

That is:

[aA] {6 7 8 9 10 11 12 13 14 15 16 17 18 19 18 20 21 22 23 24 25}
[tT] {6 7 8 9 10 11 12 13 14 15 16 17 18 19 18 20 21 22 23 24 25}
[cC] {6 7 8 9 10 11 12 13 14 15 16 17 18 19 18 20 21 22 23 24 25}
[gG] {6 7 8 9 10 11 12 13 14 15 16 17 18 19 18 20 21 22 23 24 25}

while($seq =~ /[tT]{4}/g)

Aia · August 20, 2018, 9:26pm

Maybe

#!/usr/bin/perl
use strict;
use warnings;

my @fasta = ();
my @m_results = ();
my @regexs;
my @rows = ('a', 't', 'c', 'g');
my @colums = (6 .. 25);

for my $r (@rows) {
    for my $c (@colums) {
        push @regexs, "$r\{$c\}";
    }
}

while(<>) {
    chomp;
    if(/^>/){
        if (@fasta) {
            @m_results = range_match(\@fasta);
            display();
        }
        @fasta = ();
        push @fasta, $_;
        next;
    }
    $fasta[1] .= $_;
}
@m_results = range_match(\@fasta);
display();

sub range_match {
    my $fref = shift;
    my @matches;
    my ($header, $seq) = @{$fref};
    my ($mark, $beginning, $end) = split /[:-]/, (split/\s+/, $header)[1];

    for my $regex (@regexs) {
        while($seq =~ /$regex/ig) {
            my $first = "@-" + 1;
            my $last = "@+";
            push @matches, sprintf "%s %s:%s-%s\n", $&, $mark, ($beginning + $first), ($beginning + $last);
        }
    }
    return @matches;
}

sub display {
    print "$fasta[0]\n";
    print @m_results ? @m_results : "Nothing Detected\n";
    @m_results = ();
}

cmccabe · August 21, 2018, 3:17pm

Thank you very much, works great.... very nice :).

------ Post updated at 02:17 PM ------

Looking into the data there are a couple of modifications that I am unsure how to incorporate. The streches are detected and the loop is amazing and very helpful. The math is slightly different and the repeat stretch is opposite, hopefully the examples will help to show but if strand=- then the bases in blue (rightmost) are the start or first number after the : and the repeats in green are opposite:

that is:
A or a is T or t
T or t is A or a
C or c is G or g
G or g is C or c

Thank you very much :).

input

>hg19_ncbiRefSeq_SCN1A range=chr2:166905371-166905484 5'pad=25 3'pad=25 strand=- repeatMasking=none
catacgactttcttttttcaaacagGATATCATTATTTCCTGGAGGGTTT
TTTAGATGCACTACTATGTGGAAATAGCTCTGATGCAGGgtaagtcaata
atttgtgtgtatct
>hg19_ncbiRefSeq_SCN1A range=chr2:166895908-166896131 5'pad=25 3'pad=25 strand=- repeatMasking=none
tagattaattttgttttgatcttagGTTTTCACTGGGATCTTTACAGCAG
AAATGTTTCTGAAAATTATTGCCATGGATCCTTACTATTATTTCCAAGAA
GGCTGGAATATCTTTGACGGTTTTATTGTGACGCTTAGCCTGGTAGAACT
TGGACTCGCCAATGTGGAAGGATTATCTGTTCTCCGTTCATTTCGATTGg
taaaaaaaaaaaaaaaaagcacca
>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
TTTTTT range=chr2:166168546-166168551
>hg19_ncbiRefSeq_SLC2A1 range=chr1:43393251-43393504 5'pad=25 3'pad=25 strand=- repeatMasking=none
No Homopolymers Detected

desired output

>hg19_ncbiRefSeq_SCN1A range=chr2:166905371-166905484 5'pad=25 3'pad=25 strand=- repeatMasking=none
AAAAAA chr2:166905432-166905437
aaaaaa chr2:166905467-166905472
>hg19_ncbiRefSeq_SCN1A range=chr2:166895908-166896131 5'pad=25 3'pad=25 strand=- repeatMasking=none
ttttttttttttttt range=chr2:166896110-166896124
tttttttttttttttt range=chr2:166896110-166896125
ttttttttttttttttt range=chr2:166896110-166896126
>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
TTTTTT range=chr2:166168545-166168550
>hg19_ncbiRefSeq_SLC2A1 range=chr1:43393251-43393504 5'pad=25 3'pad=25 strand=- repeatMasking=none
Nothing Detected

In the line with > if the strand=+ then everything is good except that the calculated range=chr2:166170446-166170451 is 1 digit off. An example is

>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
TTTTTT range=chr2:166168546-166168551

is

>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
TTTTTT range=chr2:166168545-166168550

In the line with > if the strand=- then the math is slightly different in that the first # is really the end position. An example is,

>hg19_ncbiRefSeq_SCN1A range=chr2:166905371-166905484 5'pad=25 3'pad=25 strand=- repeatMasking=none
catacgactttcttttttcaaacagGATATCATTATTTCCTGGAGGGTTT
TTTAGATGCACTACTATGTGGAAATAGCTCTGATGCAGGgtaagtcaata
atttgtgtgtatct

is

>hg19_ncbiRefSeq_SCN1A range=chr2:166905371-166905484 5'pad=25 3'pad=25 strand=- repeatMasking=none
AAAAAA chr2:166905432-166905437
aaaaaa chr2:166905467-166905472

If Nothing Detected then the line is as is. Thank you again :).

Aia · August 22, 2018, 12:34am

cmccabe:

In the line with > if the strand=+ then everything is good except that the calculated range=chr2:166170446-166170451 is 1 digit off. An example is
>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
TTTTTT range=chr2:166168546-166168551
is
>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
TTTTTT range=chr2:166168545-166168550

Does the sequence starts at 0 or does it start at 1?
Could you post the sequence that produces that result?

cmccabe:

>hg19_ncbiRefSeq_SCN1A range=chr2:166905371-166905484 5'pad=25 3'pad=25 strand=- repeatMasking=none
catacgactttcttttttcaaacagGATATCATTATTTCCTGGAGGGTTT
TTTAGATGCACTACTATGTGGAAATAGCTCTGATGCAGGgtaagtcaata
atttgtgtgtatct

is

>hg19_ncbiRefSeq_SCN1A range=chr2:166905371-166905484 5'pad=25 3'pad=25 strand=- repeatMasking=none
AAAAAA chr2:166905432-166905437
aaaaaa chr2:166905467-166905472

[...]

I suppose I am not understanding. I cannot see how the above example AAAAAA or aaaaaa can be a match on the sequence

catacgactttcttttttcaaacagGATATCATTATTTCCTGGAGGGTTT
TTTAGATGCACTACTATGTGGAAATAGCTCTGATGCAGGgtaagtcaata
atttgtgtgtatct

cmccabe · August 22, 2018, 8:44am

1-based coordinate system where the t in green is 166168510 and the t in blue is 166168623 . For a sequence that has strand=+ in the > line, the above will always be true and as you move to the right in the sequence the numbers increase from left to right as in the example below.

input

>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
tttctattcctttctctttaaatagGTCACTTTTATTTTTTAGAGGGGCA
AAATGATGCTCTGCTTTGTGGCAACAGCTCAGATGCAGGgtaagtgatgc
ttcctactgagttt

current output

>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
TTTTTT range=chr2:166168546-166168551

output with correct coordinates (the first T in the repeat is 35 away and there are 5 more repeat T in the sequence).

>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
TTTTTT range=chr2:166168545-166168550

If in the line with > strand=- , then the last t in green is 166905371 and the first c in blue is 166905484 , the same 1-based coordinate system is applied, but the opposite orientation is true for strand=- and a compliment rule is applied and the numbers decrease from left to right as in the example.

input

>hg19_ncbiRefSeq_SCN1A range=chr2:166905371-166905484 5'pad=25 3'pad=25 strand=- repeatMasking=none
catacgactttcttttttcaaacagGATATCATTATTTCCTGGAGGGTTTTTTAGATGCACTACTATGTGGAAATAGCTCTGATGCAGGgtaagtcaata
atttgtgtgtatct

Compliment rules

Original repeat is A or a Compliment is T or t
Original repeat is T or t is Compliment is A or a
Original repeat is C or c is Compliment is G or g
Original repeat is G or g is Compliment is C or c

Since the first 6t in the repeat is at 166905472 and there are 5 more t after 166905467 . The first coordinate (166905472), is the end, and the last coordinate (166905467) is the start. To further complicate things since the stand=- the 6t is really 6a following the Compliment rules.

Since the next 6T in the repeat is at 166905437 and there are 5 more t after 166905432 . The first coordinate (166905437), is the end, and the last coordinate (166905432) is the start. To further complicate things since the stand=- the 6t is really 6A following the Compliment rules.

desired output

>hg19_ncbiRefSeq_SCN1A range=chr2:166905371-166905484 5'pad=25 3'pad=25 strand=- repeatMasking=none
AAAAAA chr2:166905432-166905437 --- these are the 6T in red
aaaaaa chr2:166905467-166905472  ---- the are the 6t in purple

The strand=- is a different process and unfortunately the rules that it uses only make it trickier.

I hope this helps and Thank you very much, I really appreciate all your help :).

Aia · August 22, 2018, 8:04pm

cat example.fasta 
>hg19_ncbiRefSeq_Gene range=chr2:166911123-166911301 5'pad=25 3'pad=25 strand=- repeatMasking=none
aaattttttggatgcttgttttcagATACACCTTCACAGGAATATATACT
TTTGAATCACTTATAAAAATTATTGCAAGGGGATTCTGTTTAGAAGATTT
TACTTTCCTTCGGGATCCATGGAACTGGCTCGATTTCACTGTCATTACAT
TTGCgtaagtgccttttttgaaactttaa
>hg19_ncbiRefSeq_Gene range=chr2:166909337-166909478 5'pad=25 3'pad=25 strand=- repeatMasking=none
tttgtgtgtgaactccctattacagGTACGTCACAGAGTTTGTGGACCTG
GGCAATGTCTCGGCATTGAGAACATTCAGAGTTCTCCGAGCATTGAAGAC
>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
tttctattcctttctctttaaatagGTCACTTTTATTTTTTAGAGGGGCA
AAATGATGCTCTGCTTTGTGGCAACAGCTCAGATGCAGGgtaagtgatgc
ttcctactgagttt
>hg19_ncbiRefSeq_SCN1A range=chr2:166905371-166905484 5'pad=25 3'pad=25 strand=- repeatMasking=none
catacgactttcttttttcaaacagGATATCATTATTTCCTGGAGGGTTTTTTAGATGCACTACTATGTGGAAATAGCTCTGATGCAGGgtaagtcaata
atttgtgtgtatct

#!/usr/bin/perl
use strict;
use warnings;

my @fasta = ();
my @m_results = ();
my @regexs;
my @rows = ('a', 't', 'c', 'g');
my @colums = (6 .. 25);

for my $r (@rows) {
    for my $c (@colums) {
        push @regexs, "$r\{$c\}";
    }
}

while(<>) {
    chomp;
    if(/^>/){
        if (@fasta) {
            @m_results = range_match(\@fasta);
            display();
        }
        @fasta = ();
        push @fasta, $_;
        next;
    }
    $fasta[1] .= $_;
}
@m_results = range_match(\@fasta);
display();

sub range_match {
    my $fref = shift;
    my @matches;
    my ($header, $seq) = @{$fref};
    my @hd_tokens = split /\s+/, $header;
    my ($mark, $beginning, $end) = split /[:-]/, $hd_tokens[1];

    if ($hd_tokens[4] =~ /=-/) {
        $seq =~ tr/[TtAaCcGg]/[AaTtGgCc]/;
        $seq = reverse $seq;
    }

    for my $regex (@regexs) {
        while($seq =~ /$regex/ig) {
            my $first = $-[0];
            my $last = $+[0] -1;
            push @matches, sprintf "%s %s:%s-%s\n", 
                $&, $mark, ($beginning + $first), ($beginning + $last);
        }
    }
    return @matches;
}

sub display {
    print "$fasta[0]\n";
    print @m_results ? @m_results : "Nothing Detected\n";
    @m_results = ();
}

Output:

 ./read_fasta.pl example.fasta
>hg19_ncbiRefSeq_Gene range=chr2:166911123-166911301 5'pad=25 3'pad=25 strand=- repeatMasking=none
aaaaaa range=chr2:166911133-166911138
aaaaaa range=chr2:166911293-166911298
>hg19_ncbiRefSeq_Gene range=chr2:166909337-166909478 5'pad=25 3'pad=25 strand=- repeatMasking=none
Nothing Detected
>hg19_ncbiRefSeq_SCN2A range=chr2:166168510-166168623 5'pad=25 3'pad=25 strand=+ repeatMasking=none
TTTTTT range=chr2:166168545-166168550
>hg19_ncbiRefSeq_SCN1A range=chr2:166905371-166905484 5'pad=25 3'pad=25 strand=- repeatMasking=none
AAAAAA range=chr2:166905432-166905437
aaaaaa range=chr2:166905467-166905472

cmccabe · August 24, 2018, 8:35am

Thank you very much, works great :).