I have to mine the following sequence pattern from a large fasta file namely gene.fasta (contains multiple fasta sequences) along with the flanking sequences of 5 bases at starting position and ending position,
AAGCZ-N16-AAGCZ
Z represents A, C or G (Except T)
N16 represents any of the four bases including A, T, G or C and these four base combination should have the length of 16 bases.
I have a fasta file as follows,
gene.fasta
>dox
ATGCTATGATAGTAGTAGATAGAGAGAGAGATAGATAGAGAGATAGATAG
>cyclin
ATGAGATAGAAAGCCATAGATAGTAGTAGATAAGCATATAGATAGATAGTAGTGATA
>fyl
TAGTAGTAGATAGATAGATGCGTACTGCTGATGATAGATGATAGATAGATAGATTAGAT
>tubulin
TAGATAGAAAGCAATAGATAGAACAAGATAAGCCTAGTCGTAGATGATAGATAG
The expected output should be like this,
>org1_1
ATAGAAAGCCATAGATAGTAGTAGATAAGCATATAG
>org1_2
ATAGAAAGCAATAGATAGAACAAGATAAGCCTAGTC
In order to mine the above mentioned pattern, I have used grep command but, I do not know how to specify only 3 bases for Z and also I could not specify N16 criteria in the grep command line. In addition to this, I do not know how to mine the 5 of the flanking bases along with the pattern. So kindly help me in this regard.
Thank you in advance.