Parsing a fasta sequence with start and end coordinates

empyrean · April 15, 2011, 1:29am

Hi.. I have a seperate chromosome sequences and i wanted to parse some regions of chromosome based on start site and end site.. how can i achieve this?

For Example Chr 1 is in following format

I need regions from 2 - 10 should give me AATTCCAAA

and in a similar way 15- 25 should give me AAGATTGCAT

and from 27 - 30 should give me AGTT

How can i do it either in perl or bioperl or awk or any other way?

yinyuemi · April 15, 2011, 1:43am

awk -v start=2 -v end=10 -v chr=chr1 '$0~chr{getline seq; print substr(seq,start,end-start+1)}' sequence
AATTCCAAA

awk -v start=15 -v end=25 -v chr=chr1 '$0~chr{getline seq; print substr(seq,start,end-start+1)}' sequence
AAGATTGCATC

empyrean · April 15, 2011, 1:47am

Thanks for the reply.. i am pretty new to awk programming.. so i have chromosome 1 in a fasta file format and where should i give it as input?

michaelrozar17 · April 15, 2011, 1:50am

Using cut command

cut -c2-10 inputfile

empyrean · April 15, 2011, 1:56am

cut command is not working properly.. its splicing whole file in to 10 frament length lines

yinyuemi · April 15, 2011, 1:59am

I think it should be ok if you use the whole fasta file as the input file:

awk '{CODE}' fasta

empyrean · April 15, 2011, 2:08am

No its not giving correct results.. I have the fasta file of 300,000 bp long.. but i need the sequences for some specific sites.. The above code in awk only giving the sequence of one line no matter how much length you give.. Also if the start site is after the first line, we are not getting any information about it..

yinyuemi · April 15, 2011, 2:15am

Could you upload a little of fasta file as example, and your expecting output?

michaelrozar17 · April 15, 2011, 2:23am

Yes.The given cut command is intended to work as you said. You would need to change the start and end character index accordingly. If you need to only match certain lines to extract characters, provide more information.

cut -c15-25 inputfile