Extracting DNA sequences from GenBank files using Perl

akreibich07 · June 23, 2009, 3:28pm

Hi all,

Using Perl, I need to extract DNA bases from a GenBank file for a given plant species. A sample GenBank file is here...

This is saved on my computer as NC_001666.gb. I also have a file that is saved on my computer as NC_001666.txt. This text file has a list of all the genes and their positions in the species corresponding to NC_001666 (corn). Here is a sample of how the text file is formatted...

rbcL (56874..58304)
atpB -(54618..56114)
atpE -(54208..54621)
trnM (54020..54092)
trnV -(53158..53834)

For example, if in my command prompt I give input of the program name, the species number that I want, and the specific gene from that species whose DNA sequence I want:

perl nucleotide_bases.pl NC_001666 trnM

The program would go into NC_001666.txt, find trnM, see that it has a range from 54020 to 54092 and is on the positive strand(no negative sign). The program then goes into NC_001666.gb, goes to the long list of DNA bases at the bottom and starts at position 54020 and returns all base letters through 54092 (inclusively). So for this specific trnM, the output would be:

gcctacttaactcagtggttagagtattgctttcatacggcgggagtcattggttcaaatccaatagtaggta

If a gene has a negative next to the position range (meaning it's on the negative strand of DNA), the output should be reversed, starting from the higher position, going to the lower. Also, when a negative is there, in that output, all A's should be switched to T's, and all G's to C's and vice versa.

Also, if a gene appears more than once in a text file, give an error message that it appears more than once, and end the program.

If I could get a Perl script to return this information for any species (NC_number) I want, and any gene from that species that I want, it would be a great help in the research I am conducting. Thank you all for your time, and any help on how to write this script would be appreciated.

-akreibich07

KevinADC · June 23, 2009, 4:30pm

What you want is "Beginning Perl for Bioinformatics" you can purchase on amazon.com. Also look into BioPerl

Neo · June 23, 2009, 4:53pm

Excellent thanks!

I thought this article on the BioPERL wiki was great, How Perl Saved the Human Genome Project.

KevinADC · June 24, 2009, 2:42am

Thanks, but apparently the OP did not like my suggestions, he has recently posted the same question on more perl forums without replying to this thread. Oh well, some people just want the code. :o

otheus · June 24, 2009, 4:54am

I'll try to help with some code (one of those days):

($root,$gene) = (@ARGV);
open(TXT,$root.".txt") || die "Cannot open $root.txt";
open(GB,  $root.".gb") || die "Cannot open $root.gb";

$found=0;
while (<TXT>) { 
    ($start,$stop) = m/^$gene \((\d+)..(\d+)\)/io || next;
    $found = 1; last;
}
die "Did not find gene $gene in $root.txt" unless $found;

$found = 0;
while (<DB>) { 
  chop; 
  @x=split;
  if ($x[0] >= $start) { 
      # start-logic here;
  }
}
while (<DB>) { 
  chop; 
  @x=split;
  if ($x[0] < $stop) { 
      # stop-logic here
  }
}
# print logic here

akreibich07 · June 24, 2009, 1:57pm

Thank you for the start to the code. I should be able to take it from here.

And to KevinADC...I apologize for the annoyances of asking for so much code. I am enrolled to take a more advanced Perl class that starts this next fall, but I was just attempting to jump-start my research as much as possible. I've been looking for a colleague to hire who's not busy with other work, but it's been hard (not much of this campus knows Perl). I work at UW-Madison, where Java and C++ are the languages of choice. Again, my apologies, and thank you for your patience.

-akreibich07