Regular Expression matching in PERL

Legend986 · February 26, 2008, 1:49am

I am trying to read a file and capture particular lines into different strings:

LENGTH: Some Content here

TEXT: Some Content Here

COMMENT: Some Content Here

I want to be able to get (LENGTH: .... ) into one array and so on... I'm trying to use PERL in slurp mode but for some reason I'm having trouble. Can someone suggest me a better way?

Yogesh_Sawant · February 26, 2008, 1:58am

instead of slurping the whole file in one array and then filtering out the lines, run a loop over the file and read one line at a time. here yoy can put the lines in respective arrays as per the category, with the help of regex
something like this would help you:

while (<>) {
    if (m/^\s*LENGTH:\s*/) {
        push (@length_array, $_);
    }
    elsif (m/^\s*TEXT:\s*/) {
        push (@text_array, $_);
    }
    elsif (m/^\s*COMMENT:\s*/) {
        push (@comment_array, $_);
    }
}

Legend986 · February 26, 2008, 2:01am

Thanks. Well, I have done that in my php version of the same code but heard that perl is really strong when it comes to regex so wanted to try out something new. Actually the problem is something like this:

LENGTH: ......................................................
..................................................................
..................................................................

...................................................................
..................................................................

SUBJECT: .......................................................

COMMENT: .....................................................
....................................................................

As you can observe, the data that I want is not limited to one line but rather spans multiple lines. Do you have any suggestion on how to solve this problem?

Yogesh_Sawant · February 26, 2008, 2:07am

the code that i posted above won't work, since what you want is something like a multi-line regex

Legend986 · February 26, 2008, 2:09am

Yes. Incidentally, my php version was almost similar to what you posted but I was just hoping there was a multi line solution to the problem. Do you have any suggestions please?

Yogesh_Sawant · February 26, 2008, 3:28am

check if this works for you:

{
    local $/;  # reset the input record separator
    $all_lines = <INPUT_FILE>;  # Slurp the whole file in a string
}
while ($all_lines =~ m/LENGTH:(.*?)(SUBJECT|COMMENT)/g) {
    push (@length_array, $1);
}
while ($all_lines =~ m/SUBJECT:(.*?)(LENGTH|COMMENT)/g) {
    push (@subject_array, $1);
}
while ($all_lines =~ m/COMMENT:(.*?)(LENGTH|SUBJECT)/g) {
    push (@comment_array, $1);
}

idea is to slurp the file in a string instead of in an array, and then take out the required strings from it using regex

Legend986 · February 26, 2008, 3:37am

Actually I was doing something on a similar lines:

$capture[0] = "LENGTH:";
$capture[1] = "COMMENT:";
$capture[2] = "BODY:";
$capture[3] = "AVATAR:";
$capture[4] = "POST:";
$capture[5] = "SUBJECT:";
$capture[6] = "DATE:";
$capture[7] = ""; 

open(DATA, "filename.txt");
$line = <DATA>;


if($line =~ /$capture[0](.*?)$capture[1]/sgm) {
        $solution[0] = $1;
}
if($line =~ /$capture[1](.*?)$capture[2]/sgm) {
        $solution[1] = $1;
}
if($line =~ /$capture[2](.*?)$capture[3]/sgm) {
        $solution[2] = $1;
        }
        if($line =~ /$capture[3](.*?)$capture[4]/sgm) {
                $solution[3] = $1;
        }
        if($line =~ /$capture[4](.*?)$capture[5]/sgm) {
                $solution[4] = $1;
        }
        if($line =~ /$capture[5](.*?)$capture[6]/sgm) {
                $solution[5] = $1;
        }
if($line =~ /$capture[6](.*?)$capture[7]/sgm) {
                $solution[6] = $1;
        }


print trim($solution[0])."\n";
print trim($solution[1])."\n";
print trim($solution[2])."\n";
print trim($solution[3])."\n";
print trim($solution[4])."\n";
print trim($solution[5])."\n";
print trim($solution[6])."\n";

For some reason, it prints only the odd number of lines or even number of lines depending on how the ordering is. Well, I see why that is happening but not sure how to solve it... Anyways I will try to incorporate your logic now...

EDIT: Well, works like a charm if I embed your logic into mine I just changed the if into a while... Great! Thanks a lot for your help...

Legend986 · February 26, 2008, 3:43am

The final thing I was wondering was, how would I have to alter the code to make it work on something that has a lot of such patterns. I mean a series of SUBJECT, COMMENT, LENGTH::SUBJECT, COMMENT, LENGTH and each being regarded as one chunk...

Feliix1956 · May 27, 2009, 3:16pm

I know its quite late to reply but this is how I would do what is described here:

#open file, read only
open(DATA, "<filename.txt");

open(SUBJECT, ">subject.txt");
open(COMMENT, ">comment.txt");
open(LENGTH, ">length.txt");

my $filetoprint = "";

#start a run through the file
while(<DATA>)
{
 #grab next line
 my $line = $_;
 # trim line breaks from $line and return it to the variable
 chomp($line);

 # Check start of line
 if ($line =~ m/^SUBJECT(.+)/)
 {
  # set variable indicator to Subject
  $filetoprint = "Subject";
  # remove first word from $line by passing the matched portion back into it
  $line "$1";
 }

 # Check start of line
 if ($line =~ m/^LENGTH(.+)/)
 {
  # set variable indicator to Length
  $filetoprint = "Length";
  # remove first word from $line by passing the matched portion back into it
  $line "$1";
 }


 # Check start of line
 if ($line =~ m/^COMMENT(.+)/)
 {
  # set variable indicator to Comment
  $filetoprint = "Comment";
  # remove first word from $line by passing the matched portion back into it
  $line "$1";
 }

# if there has been a previous match (this line or any following print out to the appropriate file
 if ($filetoprint eq "Subject") {print SUBJECT "$line\n";}
 if ($filetoprint eq "Comment") {print COMMENT "$line\n";}
 if ($filetoprint eq "Length")  {print LENGTH  "$line\n";}

}

close SUBJECT;
close COMMENT;
close LENGTH ;

Hope this helps anyone with a similar problem. you can also add a "terminating" string by writing a regular expression match for the desired character/string then set $filetoprint back to "" and printing anything from the line leading up to the match into the output file so it isnt lost.

to discern between one block and another you could add a variable that you increase by 1 each time you match a new chunk indicator (like for example a subject line) then you could add the number to the beginning of the line in the output file.

An advanced version might be to store the data in an array of hashes, reference the array by the number that iterates while reading the file and store the data from each line in the named part of the hash corresponding to the data type. eg in pseudo code:

if ($filetoprint eq detail)
{
 #print the detail content to the detail element of the current hash in the array
 $arrayofhashes[$i]->[detail] = "${$arrayofhashes[$i]->[detail]}$line\n";
}
etc

then you can count the array and print out in the format you want for webmail or forum software