How to find a specific sequence pattern in a fasta file?

dineshkumarsrk · November 29, 2019, 3:56am

I have to mine the following sequence pattern from a large fasta file namely gene.fasta (contains multiple fasta sequences) along with the flanking sequences of 5 bases at starting position and ending position,

AAGCZ-N16-AAGCZ
Z represents A, C or G (Except T)
N16 represents any of the four bases including A, T, G or C and these four base combination should have the length of 16 bases.

I have a fasta file as follows,
gene.fasta

>dox
ATGCTATGATAGTAGTAGATAGAGAGAGAGATAGATAGAGAGATAGATAG
>cyclin
ATGAGATAGAAAGCCATAGATAGTAGTAGATAAGCATATAGATAGATAGTAGTGATA
>fyl
TAGTAGTAGATAGATAGATGCGTACTGCTGATGATAGATGATAGATAGATAGATTAGAT
>tubulin
TAGATAGAAAGCAATAGATAGAACAAGATAAGCCTAGTCGTAGATGATAGATAG

The expected output should be like this,

>org1_1
ATAGAAAGCCATAGATAGTAGTAGATAAGCATATAG
>org1_2
ATAGAAAGCAATAGATAGAACAAGATAAGCCTAGTC

In order to mine the above mentioned pattern, I have used grep command but, I do not know how to specify only 3 bases for Z and also I could not specify N16 criteria in the grep command line. In addition to this, I do not know how to mine the 5 of the flanking bases along with the pattern. So kindly help me in this regard.
Thank you in advance.

RudiC · November 29, 2019, 6:06am

How about (given there are only A, C, G, T in the fasta sequences, so no test for other chars needed)

grep -oE ".{5}AAGC[^T].{16}AAGC[^T].{5}" file
ATAGAAAGCCATAGATAGTAGTAGATAAGCATATAG
ATAGAAAGCAATAGATAGAACAAGATAAGCCTAGTC

MadeInGermany · November 29, 2019, 12:40pm

If you also want to print the >id then you can do it with perl:

perl -ne '/^>.*/ and $id=$&; /.{5}AAGC[^T].{16}AAGC[^T].{5}/ and printf "%s\n%s\n",$id,$&' gene.fasta

Scrutinizer · November 29, 2019, 1:38pm

Can it occur only once per sequence? If not what should happen with multiple matches per sequence?