Speed up this script!

I have a script that processes a fair amount of data -- say, 25-50 megs per run. I'd like ideas on speeding it up. The code is actually just a preprocessor -- I'm using another language to do the heavy lifting. But as it happens, the preprocessing takes much more time than the final processing so I'm optimizing this rather than that.

Here's the code. The basic idea is that, for each line of input (redirected to stdin), the program checks to see if the sequence number is in $mult and, if so, prints a line asking the other program to validate that sequence:

#!/usr/bin/perl -w

open(MULT, "mult.txt") or die("Can't find list of multiplicative sequences in mult.txt");
my $terminator = $/;
undef $/;
$mult = <MULT>;
$/ = $terminator;

// Print application-specific code -- snipped for brevity

$total = 0;
while(<>) {
	if (m/(A\d\d\d\d\d\d) ,((-?\d+,)*-?\d+),/) {
		$nm = $1;
		$seq = $2;
		if ($mult =~ /$nm/) { # Replace this line?
			print "go(\"$nm\", [$seq]);\n";
			$total++;
		}
	} else {
		print "print(\"Error reading line: $_\");\n";
	}
}

// Print application-specific code -- snipped for brevity

The file mult.txt is a short file of about a thousand lines, each of which is guaranteed to contain at most (exactly?) one line of the form A\d\d\d\d\d\d; the rest of the line is irrelevant here.

My thought for optimizing this: make an array of the \d\d\d\d\d\d values, sort, and do a binary search rather than a regular expression at the spot marked "Replace this line?". But I'm not sure how to go about that, or even if that's the 'right' optimization. Thoughts?

Also, any suggestions on making better idiomatic use of Perl would be appreciated. I'm not at all accustomed to the language.

Create a hash of arrays - each array being one line of your mult.txt file.

You are searching 1000 entries with a regex - regex is a linear search, resulting in 500 lookups per average per line of stdin.

Here is Perl Programming's take on what you want to do:
Hashes of Arrays (Programming Perl)

OK, I'll try that.

Not much of a speed thing, however.

my $terminator = $/;
undef $/;
$mult = <MULT>;
$/ = $terminator;

According 'man perlvar' this is a no no...
The proper method would be to keep it local($/) to the smallest block... ie:

{  # Begin localization block
   local($/);
  $mult = <MULT>;
} # End localization block

Hash it!

For a simple hash example check out a recent thread of mine, It's simple so hopefully easy to understand and is similar to your needs... Delete block of text in one file based on list in another file

Also for better assistance a snippit of 'mult.txt' and a snippit of data would be very helpful in providing good useful information.

-Enjoy
fh : )_~

---------- Post updated at 06:23 PM ---------- Previous update was at 12:14 AM ----------

Thought I would tweek this a bit for ya!

I am new to Perl, My first line of Perl was just over a week ago.. (08/26/2009)
Any comments are very welcome!

3 examples depending on what you really want/need!

Edit:

NOTE:
After some thought I felt it better to modify Example 2 for cases of dirty data...

I am ASSUMING your data looks something like:

A123456 ,789,543,MoreData
A654320 ,789,543,MoreData
A024689 ,789,543,MoreData

I am ASSUMING your mult.txt is something like this:

A123456
A654321
A024689
A987654

Example 1, As close to your original as possible without waste.

#!/usr/bin/perl

use strict;
use warnings;

my $total;
my $multfile;
my %multhash;

my @Atmp;                  # for debugging & education purposes

open($multfile, "<", "mult.txt") or die("Can't find list of multiplicative sequences in mult.txt");
while (<$multfile>) {
  chomp;
  next if /^$/;            # skip blank lines
  $multhash{ $_ } = $_;    # add to hash, using element as the key & data
}
close($multfile);

@Atmp = (keys %multhash);  # for debugging & education purposes
print "@Atmp\n";           # for debugging & education purposes

$total = 0;
while(<>) {
  if (m/(A\d\d\d\d\d\d) ,((-?\d+,)*-?\d+),/) {
    if (exists $multhash{ $1 }) {
      print "go(\"$1\", [$2]);\n";
      $total++;
    }
  } else {
    print "print(\"Error reading line: $_\");\n";
  }
}
print "Total=$total\n";

Example 2, A bit cleaner

#!/usr/bin/perl

use strict;
use warnings;

my $total;
my $multfile;
my %multhash;

open($multfile, "<", "mult.txt") or die("Can't find list of multiplicative sequences in mult.txt");
while (<$multfile>) {
  chomp;
  next if /^$/;            # skip blank lines
  $multhash{ $_ } = $_;    # add to hash, using element as the key & data
}
close($multfile);

$total = 0;
while(<>) {
  if (m/(A\d\d\d\d\d\d) ,((-?\d+,)*-?\d+),/ && exists $multhash{ $1 }) {
    print "go(\"$1\", [$2]);\n";
    $total++;
  }
}
print "Total=$total\n";

Example 3, Lean and mean with the need for speed!
NOTE: The regex changes!

#!/usr/bin/perl

use strict;
use warnings;

my $multfile;
my %multhash;

open($multfile, "<", "mult.txt") or die("Can't find list of multiplicative sequences in mult.txt");
while (<$multfile>) {
  chomp;
  next if /^$/;            # skip blank lines
  $multhash{ $_ } = $_;    # add to hash, using element as the key & data
}
close($multfile);

while(<>) {
  m/(A\d{6}) ,(\d+,\d+)/;
  print "go(\"$1\", [$2]);\n" if exists $multhash{ $1 }
}

Hope this gets things going a bit faster for ya!

-Enjoy
fh : )_~