Perl sum really inefficient!!

Donkey25 · May 5, 2009, 11:22am

Hi all,

I have a file like the following:

ID,
2,Andrew,0,1,2,3,4,2,5,6,7,7,9,3,4,5,34,3,2,1,5,6,78,89,8,7,6......................
4,James,0,6,7,0,5,6,4,7,8,9,6,46,6,3,2,5,6,87,0,341,0,5,2,5,6....................
END,

(there are more entires on each line but to keep it simple I've left them off).

What I want to do is to sum every other value after the name on each line e.g. for Andrew I want to sum 0,2,4,5 etc then I want to sum the others e.g 1,3,2,6 etc and then print out the ID value, the name and the two totals.

e.g. 2,Andrew,164,133

I currently have the following:

$input_file3="$results_path/count_file.csv";
open(DAT3, $input_file3) || print "Could not open count file!";
@raw_data3=<DAT3>;
close(DAT3);

foreach $line (@raw_data3)
{
chop($line);
($VAR,$Name,$S1,$F1,$S2,$F2,$S3,$F3,$S4,$F4,$S5,$F5,$S6,$F6,$S7,$F7,$S8,$F8,$S9,$F9,$S10,$F10,$S11,$F11,$S12,$F12,$S13,$F13,$S14,
$F14,$S15,$F15,$S16,$F16,$S17,$F17,$S18,$F18,$S19,$F19,$S20,$F20,$S21,$F21,$S22,$F22,$S23,$F23,$S24,$F24)=split(/,/,$line);

if ($VAR eq "ID" || $VAR eq "END")
{
`echo "ignoring this line"`
}
else
{
$suc = $S1 + $S2 + $S3 + $S4 + $S5 + $S6 + $S7 + $S8 + $S9 + $S10 +$S11 + $S12 + $S13 + $S14 + $S15 + $S16 + $S17 + $S18 + $S19
+ $S20 + $S21 + $S22 + $S23 + $S24;

$fail = $F1 + $F2 + $F3 + $F4 + $F5 +$F6 + $F7 + $F8 + $F9 + $F10 + $F11 + $F12 + $F13 + $F14 + $F15 + $F16 + $F17 + $F18 +$F19 +
$F20 + $F21 + $F22 +$F23 +$F24;

`echo "$CC,$Name,$suc,$fail" >> $tmp_path/suc_and_fail`;
}
}

The above works but it consumes a huge amount of memory and about 25% of my CPU for about 20 mins! The input files are quite big (approx 30,000 lines). Is there a more efficient way to do the above?

Thanks!

pludi · May 5, 2009, 11:39am

First, for future reference, please put your formatted code inside

tags

Second, there are quite a few things wrong with your code.
open(DAT3, $input_file3) || print "Could not open count file!";
@raw_data3=;
close(DAT3);[/code]
Instead of reading the whole file at once, process it line by line. This will save you a huge amount of memory and time (since the OS won't have to allocate that memory)

($VAR,$Name,$S1,$F1,$S2,$F2,$S3,$F3,$S4,$F4,$S5,$F5,$S6,$F6,$S7,$F7,$S8,$F8,$S9,$F9,$S10,$F10,$S11,$ F11,$S12,$F12,$S13,$F13,$S14,
$F14,$S15,$F15,$S16,$F16,$S17,$F17,$S18,$F18,$S19,$F19,$S20,$F20,$S21,$F21,$S22,$F22,$S23,$F23,$S24, $F24)=split(/,/,$line);

Why don't you just split into an array? That way your code would still work if you ever need more fields, without needing a rewrite.

$suc = $S1 + $S2 + $S3 + $S4 + $S5 + $S6 + $S7 + $S8 + $S9 + $S10 +$S11 + $S12 + $S13 + $S14 + $S15 + $S16 + $S17 + $S18 + $S19
+ $S20 + $S21 + $S22 + $S23 + $S24;

$fail = $F1 + $F2 + $F3 + $F4 + $F5 +$F6 + $F7 + $F8 + $F9 + $F10 + $F11 + $F12 + $F13 + $F14 + $F15 + $F16 + $F17 + $F18 +$F19 +
$F20 + $F21 + $F22 +$F23 +$F24;

See above, with an array those could be reduced to two for loops (for maintainability)

`echo "$CC,$Name,$suc,$fail" >> $tmp_path/suc_and_fail`

This way, Perl has to create a shell process which runs echo, has to open the file for appending, and close it again. If you open the file inside Perl before you start processing, write directly to it, and close it afterwards you'll probably shave off even more seconds.

Franklin52 · May 5, 2009, 11:58am

Or you can give awk a try: :rolleyes:

awk -F, '{
for(i=3;i<=NF;i++) {
  if(i%2){s1+=$i} else{s2+=$i}}
}
{ print $1"," $2","s1","s2;s1=s2=0
}' file

Donkey25 · May 5, 2009, 12:03pm

Many thanks for your response, point noted on the code tags, your post is much more readable than mine!!

How would I go about processing that file one line at a time rather than reading it all in at once?

Thanks Again

pludi · May 5, 2009, 12:10pm

Simple put:

open $fh, "file" or die "Couldn't open file: $!";
while($line = <$fh>){
    chomp $line;
    # Do whatever you have to
}
close $fh;

KevinADC · May 5, 2009, 6:50pm

use strict;
use warnings;
my $tmp_path = 'path/to/file';
my $results_path = 'path/to/file';
my ($suc,$fail) = (0,0);
my $CC = 'whatever';
my $input_file3 = "$results_path/count_file.csv";
open(my $IN, "<", $input_file3) or die "Could not open count file: $!";
open(my $OUT, ">", "$tmp_path/suc_and_fail") or die "Could not open suc_and_fail file: $!"; 
while (my $line = <$IN>){
   chomp($line);
   my @t = split(/,/,$line);
   next if ($t[0] eq "ID" || $t[0] eq "END");
   for (my $i = 2; $i < $#t; $i+=2){
      $suc += $t[$i];
   }
   for (my $j = 3; $j <= $#t; $j+=2){
      $fail += $t[$j];
   }
   print $OUT "$CC,$t[1],$suc,$fail\n";
}

ghostdog74 · May 5, 2009, 9:41pm

@kevin, your sum for fail seems different with awk result of franklin. pls confirm.

@OP , if Perl is not a must, here's an alternative with Python

#!/usr/bin/python
cc="whatever"
for line in open("file"):
    if not ( line.startswith("ID") or  line.startswith("END") ):
        line=line.strip().split(",")
        tag,rest = line[:2],line[2:]
        print "%s,%s,%s,%s" % (cc,','.join(tag), sum(map(int,rest[0::2])),sum(map(int,rest[1::2])) )

output:

# ./test.py
whatever,2,Andrew,164,133
whatever,4,James,52,520

summer_cherry · May 6, 2009, 2:02am

open $fh,"<","yourfile";
while(<$fh>){
	my (@t1,@t2,$s1,$s2);
	my @tmp=split(",",$_);
	my @temp=@tmp[2..$#tmp];
	while($#temp >= 0){
		push @t1,(pop @temp);
		push @t2,(pop @temp);
	}
	map {$s1+=$_} @t1;
	map {$s2+=$_} @t2;
	print $tmp[0],",",$tmp[1],",",$s1,",",$s2,"\n";
}

KevinADC · May 6, 2009, 3:48pm

My code looks like it is working properly. Assumes there are 48 number fields to sum (24 odd and 24 even). But maybe I am missing something.

use strict;
use warnings;
my ($suc,$fail) = (0,0);
my $CC = 'whatever';
while (my $line = <DATA>){
   chomp($line);
   my @t = split(/,/,$line);
   next if ($t[0] eq "ID" || $t[0] eq "END");
   for (my $i = 2; $i < $#t; $i+=2){
      $suc += $t[$i];
      print "<$suc> += $t[$i]\n";
   }
   for (my $j = 3; $j <= $#t; $j+=2){
      $fail += $t[$j];
      print "<$fail> += $t[$j]\n";
   }
   print "\n\n$CC,$t[1],$suc,$fail";
}
__DATA__
2,Andrew,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,19,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47

KevinADC · May 6, 2009, 3:50pm

summer_cherry:

open $fh,"<","yourfile";
while(<$fh>){
	my (@t1,@t2,$s1,$s2);
	my @tmp=split(",",$_);
	my @temp=@tmp[2..$#tmp];
	while($#temp >= 0){
		push @t1,(pop @temp);
		push @t2,(pop @temp);
	}
	map {$s1+=$_} @t1;
	map {$s2+=$_} @t2;
	print $tmp[0],",",$tmp[1],",",$s1,",",$s2,"\n";
}

The OP was looking for more efficient code, not less efficient.

Donkey25 · May 12, 2009, 9:15am

Thanks Pludi that's sorted it out nicely