In the file below I am trying to count the given repeats of A,T,C,G
in each string of letters. Each sequence is below the >
and it is possible for a string of repeats to wrap from the line above. For example, in the first line the last letter is a T
and the next lines has 3 more. I think the below would work, but I am also trying to count the position range of the repeat using the range=
, where the first # represents the leftmost (in the first line that is aaa) and the second # rightmost (in the first line that is taa). So using the 4T
as an example the output is in the example output.
The t{4}
is the repeat that can change, for example if I am after 7g
then that would be g{7}
... the lower case letters in the sequence are counted along with the capital letters if they satisfy the criteria. I think both would be captured in the order there are seen, as that is important to know. For example, 4t
occurs at chr2:166911127-166911130
... even though there are 6t
in that strech only the 4t
satisfy the criteria and are counted. An example output is in the output for two sequences. Thank you :).
file
>hg19_ncbiRefSeq_Gene range=chr2:166911123-166911301 5'pad=25 3'pad=25 strand=- repeatMasking=none
aaattttttggatgcttgttttcagATACACCTTCACAGGAATATATACT
TTTGAATCACTTATAAAAATTATTGCAAGGGGATTCTGTTTAGAAGATTT
TACTTTCCTTCGGGATCCATGGAACTGGCTCGATTTCACTGTCATTACAT
TTGCgtaagtgccttttttgaaactttaa
>hg19_ncbiRefSeq_Gene range=chr2:166909337-166909478 5'pad=25 3'pad=25 strand=- repeatMasking=none
tttgtgtgtgaactccctattacagGTACGTCACAGAGTTTGTGGACCTG
GGCAATGTCTCGGCATTGAGAACATTCAGAGTTCTCCGAGCATTGAAGAC
example output
TTTT chr2:166911173-166911176
description
the first T is 50 in so that is added to the 166911123 and that is the new value after the : and the last T is 53 so that is added to the 166911123 and that is the new value after the -.
perl
perl -076 -nE 'chomp; s/(.+)// && say qq{>$1}; s/\s//g; say $1 while /(t{4})/gi' file
output for two sequences
tttt chr2=166911127-166911130
TTTT chr2:166911173-166911176