I have a script that processes a fair amount of data -- say, 25-50 megs per run. I'd like ideas on speeding it up. The code is actually just a preprocessor -- I'm using another language to do the heavy lifting. But as it happens, the preprocessing takes much more time than the final processing so I'm optimizing this rather than that.
Here's the code. The basic idea is that, for each line of input (redirected to stdin), the program checks to see if the sequence number is in $mult and, if so, prints a line asking the other program to validate that sequence:
#!/usr/bin/perl -w
open(MULT, "mult.txt") or die("Can't find list of multiplicative sequences in mult.txt");
my $terminator = $/;
undef $/;
$mult = <MULT>;
$/ = $terminator;
// Print application-specific code -- snipped for brevity
$total = 0;
while(<>) {
if (m/(A\d\d\d\d\d\d) ,((-?\d+,)*-?\d+),/) {
$nm = $1;
$seq = $2;
if ($mult =~ /$nm/) { # Replace this line?
print "go(\"$nm\", [$seq]);\n";
$total++;
}
} else {
print "print(\"Error reading line: $_\");\n";
}
}
// Print application-specific code -- snipped for brevity
The file mult.txt is a short file of about a thousand lines, each of which is guaranteed to contain at most (exactly?) one line of the form A\d\d\d\d\d\d; the rest of the line is irrelevant here.
My thought for optimizing this: make an array of the \d\d\d\d\d\d values, sort, and do a binary search rather than a regular expression at the spot marked "Replace this line?". But I'm not sure how to go about that, or even if that's the 'right' optimization. Thoughts?
Also, any suggestions on making better idiomatic use of Perl would be appreciated. I'm not at all accustomed to the language.