Perl issue - please help!

akreibich07 · May 20, 2009, 2:58pm

Hello. I've been writing some code in Perl to read in strings from html files and have been having issues. In the html file, each "paragraph" is a certain file on the website. I need to find every one of the files that is a certain type, in this case, having green color....therefore bgcolor=#ddffff. Then once I find all of those, I'm having problems, because I find them and it only returns that line. I need my code to return the entire paragraph, because the string I need to return is in each paragraph that contains #ddffff and is usually approx. 7 lines below. Example:

</tr>
<tr bgcolor="#ddffff"><td><a target=_top href=http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=34305><font color="green" size=-1>Lotus japonicus</font></a></td>
<td><font size=-1>�</font></td>
<td><a target=_top href=http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome&cmd=Retrieve&dopt=Overview&list_uids=15617><font size=-1>NC_002694</font></a></td>
<td><font size=-1>150519�nt�</font></td>
<td><a target=_top href=http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome&cmd=Retrieve&dopt=Protein+Table&list_uids=15617><font size=-1>82</font></a></td>
<td><a target=_top href=http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome&cmd=Retrieve&dopt=Structural+RNA+Table&list_uids=15617><font size=-1>45</font></a></td>
<td><a target=_top href=http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene&Cmd=Search&TermToSearch=NC_002694[accn]><font size=-1>128</font></a></td>
<td><font size=-1>Mar 1 2001</font></td>
<td><font size=-1>Jan 30 2008</font></td>

This is one of the "paragraphs" I would need, because it does in fact have bgcolor="#ddffff". From this paragraph, I then need to return and print the NC_'number' that is in the middle of it. How do I do this when the string matching of "#ddffff" only returns the line that that text is specifically on. Any help would be great!

doutdes · May 20, 2009, 3:27pm

Well, I'll say that a possible workaround is to first look for the the first paragraph, then save it in an array and finally do the grep

If the grep returns nothing, then proceed to fill the array with the next paragraph and repeat the grep

Doing those processes in a for, that goes through all the lines seems feasible.

just a thought:rolleyes:

akreibich07 · May 20, 2009, 5:04pm

#!/usr/bin/perl
my $data_file = 'genomehtml2.txt';
open DATA, "$data_file" or die "can't open $data_file $!";
my @array_of_lines = <DATA>;
foreach my $line (@array_of_lines)
{
if ($line =~ m/#ddffff/i)
{
print "This line: $line\n";
}
}
close(DATA);

This is what I have so far...and this is returning the first line of each paragraph that has "#ddffff". I just don't know where to put in the code to get the NC numbers...I also have some code I've tried using grep:

#!/usr/bin/perl
my $data_file = 'genomehtml2.txt';
open DATA, "$data_file" or die "can't open $data_file $!";
my @array_of_lines = <DATA>;
my @grepColor = grep(/#ddffff/, @array_of_lines)
my @grepFiles = grep(/NC_/, @array_of_lines)

I don't really know where to go with this one as much......any coding ideas?

quine · May 20, 2009, 7:16pm

I believe the structure <DATA> only returns ONE line at a time...

You need to put it in a loop...

$state = 0;
$line = <DATA>;
$state = 1 if $line =~ /#ddffff/i;
while (<DATA>){
$keep_line = $_ if ($state && $_ =~ /NC_/);
# Now do something with $keep_line to persist it...
$state = 1 if $_ =~ /#ddffff/i;
$state = 0 if $_ =~ /^\s+&/;
}

Assuming blank lines between paragraphs, set a $state variable to 1 (some TRUE) value if you encounter a /#bbffff/ line... Now with the state set to TRUE, look for your NC_ pattern and save the line. The next blank, set state back to zero so that the next paragraph will NOT be searched unless you find #ddffff etc. There's probably a more elegant way to do it, but this should get you started...

KevinADC · May 20, 2009, 7:25pm

Your html sample is pretty small, but see how this works:

my @NC = ();
my $data_file = 'genomehtml2.txt';
open (my $IN, $data_file) or die "can't open $data_file $!";
OUTTER: while(<$IN>){
   if(/<tr bgcolor="#ddffff">/){
      INNER: while(<$IN>) {
         if(/\b(NC_\d+)\b/){
            push @NC, $1;
            next OUTTER;
         }
      }
   }
}
print "$_\n" for @NC;

akreibich07 · May 20, 2009, 9:03pm

Hey thanks guys. KevinADC, I just tried your code and it worked great, but could you put up what exactly you were thinking when you put it together. Just wanted to know as a learning experience. I understand a large majority of it, but a full description would be great. Thanks!

ghostdog74 · May 20, 2009, 9:13pm

another way

my @NC = ();
my $data_file = 'file';
open (my $IN, $data_file) or die "can't open $data_file $!";
while(<$IN>){   
   if(/<tr bgcolor="#ddffff">/){ $f=1;}
   if ($f && /\b(NC_\d+)\b/){
     push @NC, $1;    
     $f=0;
   }
}
print "$_\n" for @NC;

KevinADC · May 21, 2009, 12:09am

The code I posted is very simple if you know what a label is (OUTTER and INNER in the code I posted). If the code finds the first pattern it enters the inner "while" and searches each line until it finds the second pattern, if it does it pushes it into the array and then starts again in the outter loop (next OUTTER).

Its very similar to what ghostdog posted using a binary flag ($f) but they way I did it is "flagless".