Perl Array / pattern match large CPU usage

Donkey25 · October 5, 2009, 11:01am

Hi,

I have one file in this format

20  value1
33   value2
56   value3

I have another file in this format:

34,30-SEP-09,57,100237775,33614510126,2,34
34,30-SEP-09,57,100237775,33620766654,2,34
34,30-SEP-09,108,100237775,33628458122,2,34
34,30-SEP-09,130,100237775,33635266741,2,254
34,30-SEP-09,135,100237775,33634650517,2,254
34,30-SEP-09,149,100237775,33660985888,2,34

What I want to do is for each value in the first column of the first file (lets call it x) I want to print the last column in the 2nd file if the 5th column begins with x.

So for the above, the first value in the first column of the first file is 20. None of the column4 values in the 2nd file begin with 20 so I don't want to print anything. For 33 from the first file I would want to print the value in the last column of the 2nd file for each line e.g 34,34,34,254,254,34.

I have this, it works but my CPU usage for this process is about 30%! Is there a way to make this more efficient?

#!/usr/local/bin/perl

use lib "/usr/local/include/modules";
use Getopt::Long;
use File::Copy;
use JDFunction;
use JDProcess;
use JDInput;
#use strict;
use FileHandle;
use IO;
use POSIX qw(strftime);
use Time::Local;

$file_name="/export/home/file2";
open(DATA, $file_name) || die ("Could not open file2!");
@my_data=<DATA>;
close(DATA);

$first_file="/export/home/file1";
open $FIRST_DATA, "<", $first_file or die "Could not open first data file";
while ( $line = <$FIRST_DATA> )
{
chomp ($line);
($code, $value) = split(/\t/, $line);

                foreach $line (@my_data)
                {
                chomp($line);
                ($id,$day,$ref,$valuey,$valuez,$evt_type,$subcode)=split(/,/,$line);

                        if ( $valuez =~ /^$code/ )
                        {
                        print "$code, $subcode\n";
                        }
                }

}
close $FIRST_DATA;

If you know a way to make this more efficient I'd be grateful!

Thanks

thegeek · October 5, 2009, 11:38am

It is not working straight forward.
After doing a couple of changes like, changing the extracting the first 2 chars from valuez, instead of whole valuez, and placing valuez in the regex.
Your program does not seem to take like that much process ?!
If it takes also, what is the issue ?
If it takes time, you can ask for optimization -- it is not so ?
I think you are missed something while extracting the required code or some misunderstanding is there ?

Donkey25 · October 5, 2009, 12:04pm

Hi,

Thanks for your response.

I can't extract the first 2 characters of valuez as sometimes the value in the first field of the first file can be up to 4 digits long.

The 2nd file I'm processing is quite big - around 64,000 files.

I'm not sure what you mean by asking for optimisation - is that a perl option?

Thanks again for your help

radoulov · October 5, 2009, 3:05pm

What you describe could be easily achieved with the following code, try to see if it's faster.

#! /usr/bin/env perl

use warnings;
use strict;

my %f1;

die "usage: $0 <file1> <file2>\n" unless @ARGV == 2;

my ( $f1, $f2 ) = @ARGV;

open my $F1, '<', $f1 or die "open $f1: $!\n";

$f1{ (split)[0] } = undef while <$F1>;

close $F1 or warn "close $f1: $!\n";

open my $F2, '<', $f2 or die "open $f2: $!\n";

while (<$F2>) {
    my @fields = split ',';
    grep $fields[4] =~ /^$_/, keys %f1 and print $fields[-1];
}

close $F2 or warn "close $f2: $!\n";

jp2542a · October 5, 2009, 8:22pm

Put this into a file called other.awk

(NR == FNR) {
        a[$1] = $1
        print
}

(NR != FNR) {
        split($0, b,",")
        for( i in a)
        {
        pat = "^" a
        if (  b[5] ~ pat)
         print b[7]
        }
}

Then use this command line prototype:

awk -f other.awk <value-file> <data-file>

Donkey25 · October 7, 2009, 12:00pm

Hi,

Radoulov, I'm not quite sure how that code works. I'm trying to modify it so I can also print the first values from File1 (you seem to have entered these into a hash and then use this hash to search the second file).

Can you please tell me how I can modify it to print the fields[-1] value and the keys value?

Many thanks

radoulov · October 7, 2009, 2:19pm

Like this?

#! /usr/bin/env perl

use warnings;
use strict;

my %f1;

die "usage: $0 <file1> <file2>\n" unless @ARGV == 2;

my ( $f1, $f2 ) = @ARGV;

open my $F1, '<', $f1 or die "open $f1: $!\n";

$f1{ (split)[0] } = undef while <$F1>;

close $F1 or warn "close $f1: $!\n";

open my $F2, '<', $f2 or die "open $f2: $!\n";

while (<$F2>) {
    my @fields = split ',';
    map { $fields[4] =~ /^$_/ and print $_, " ", $fields[-1] } keys %f1;
}

close $F2 or warn "close $f2: $!\n";