Summing column value - using PERL

ganapati · February 24, 2010, 2:53am

I'm new to perl programming. I've a csv file as below.

20100221, abc_1, 200
20100221, abc_4, 350
20100221, opq_3, 200
20100221, abc_5, 220
20100221, xyz_1, 500
20100221, abc_2, 500
20100221, abc_3, 100
20100221, xyz_2, 700
20100221, opq_2, 350
20100221, xyz_3, 100
20100221, opq_1, 230

I'm trying to get output as below, by adding matching values:

Output:

20100221, abc, 1370
20100221, xyz, 1300
20100221, opq, 780

I've started coding as below, but not able to continue due to lack of experience.

use strict;
use warnings;

my $inputfilename  = $ARGV[0]; ### CSV file name. ###

open( my $in_fh, "<", $inputfilename ) or die "Can't open $inputfilename : $!";

while (my $line = <$in_fh>) {
  chomp($line);
  my @columns = split(',', $line, 2);

Could any of you help me to accomplish my requirement?

Thanks and Regards / Lokesha

abubacker · February 24, 2010, 4:08am

Try this

use strict;
use warnings;

my $inputfilename  = "file1.txt" ;

open( my $in_fh, "<", $inputfilename ) or die "Can't open $inputfilename : $!";
my @columns  ;
my %store ;
while (my $line = <$in_fh>) {
  chomp($line);
#@columns = split(',', $line, 2);
        $line =~ /^(.*?)\s*,(.*?)\s*,(.*?)\s*$/ ;
        my $no = $1 ;
        my $cut = $2 ;
        my $cnt = $3 ;
        $cut =~ s/^(.*?)_[0-9]*$/$1/;
        $store{$no}{$cut}+=$cnt ;
  }

foreach my $no ( keys%store)
{

foreach my $cut ( keys%{$store{$no}})
{
    print "$no $cut $store{$no}{$cut} \n";
}

}

dennis.jacob · February 24, 2010, 4:17am

One using awk:

awk -F, '{ A[$1 substr($2,1,4)]+=$3; }END{ for(i in A) print i,A; }' file

One in perl,

while (<DATA>)
{
@A=split ",|_",$_;
$h{"$A[0],$A[1]"} += $A[3];
}
while (($k,$v)= each(%h)) { print join ",",( $k,$v,"\n"); }

__DATA__
20100221, abc_1, 200
20100221, abc_4, 350
20100221, opq_3, 200
20100221, abc_5, 220
20100221, xyz_1, 500
20100221, abc_2, 500
20100221, abc_3, 100
20100221, xyz_2, 700
20100221, opq_2, 350
20100221, xyz_3, 100
20100221, opq_1, 230

ganapati · February 24, 2010, 5:46am

Thanks for the reply to all, and for the provided different solutions.

I tried abubacker's solution and got the below result with warning messages and one unwanted additional line, as below:

Use of uninitialized value in substitution (s///) at GEN_Applications_Duration.pl line 17, <$in_fh> line 12.
Use of uninitialized value in hash element at GEN_Applications_Duration.pl line 18, <$in_fh> line 12.
Use of uninitialized value in hash element at GEN_Applications_Duration.pl line 18, <$in_fh> line 12.
Use of uninitialized value in addition (+) at GEN_Applications_Duration.pl line 18, <$in_fh> line 12.
 opq  0  ## Extra outpu line, which is unwanted ##
20100221  xyz 1300
20100221  opq 780
20100221  abc 1370

How to eliminate warninngs and extra output line. Also, output should be comma seperated as below:

20100221,  xyz, 1300

Could you please helm me again?

abubacker · February 24, 2010, 5:55am

I guess that you're file would have some empty lines ,

 
use strict;
use warnings;

my $inputfilename  = "file1.txt" ;

open( my $in_fh, "<", $inputfilename ) or die "Can't open $inputfilename : $!";
my @columns  ;
my %store ;
while (my $line = <$in_fh>) {
  chomp($line);
#@columns = split(',', $line, 2);
  if ( $line !~ /^$/ )   # checks for the non empty line 
  {
        $line =~ /^(.*?)\s*,(.*?)\s*,(.*?)\s*$/ ;
        my $no = $1 ;
        my $cut = $2 ;
        my $cnt = $3 ;
        $cut =~ s/^(.*?)_[0-9]*$/$1/;
        $store{$no}{$cut}+=$cnt ;
  }
  }

foreach my $no ( keys%store)
{

foreach my $cut ( keys%{$store{$no}})
{
    print "$no $cut $store{$no}{$cut} \n";
}

}

So please try this ,
If you're very sure that your file does not have empty line , then make
sure it always has the proper format ex :20100221, abc_3, 100

ganapati · February 24, 2010, 8:32am

Wareh wahh!!! I'm thrilled
Tons of thanks to abubacker, it works exactly as expected.

Added the feature to this code, is it possible to handle multiple input files with different dates as below?

input file name: input_20100221.csv
20100221, abc_1, 200
20100221, abc_4, 300
20100221, opq_3, 200
20100221, abc_5, 200
20100221, xyz_1, 500
20100221, abc_2, 500
20100221, abc_3, 100
20100221, xyz_2, 700
20100221, opq_2, 300
20100221, xyz_3, 100
20100221, opq_1, 200

input file name: input_20100222.csv
20100222, abc_1, 100
20100222, abc_4, 200
20100222, opq_3, 200
20100222, abc_5, 200
20100222, xyz_1, 100
20100222, abc_2, 200
20100222, abc_3, 100
20100222, xyz_2, 200
20100222, opq_2, 800
20100222, xyz_3, 600
20100222, opq_1, 700

input file name: input_20100224.csv
20100224, abc_1, 600
20100224, abc_4, 400
20100224, opq_3, 200
20100224, abc_5, 300
20100224, xyz_1, 300
20100224, abc_2, 200
20100224, abc_3, 700
20100224, xyz_2, 200
20100224, opq_2, 200
20100224, xyz_3, 900
20100224, opq_1, 800

Output should be exactly as below:

Rundate,  abc,   opq, xyz  ### header, only one time
20100221, 1300,  700, 1300
20100222,  800, 1700,  900
20100224, 2200, 1200, 1400

Cheers ~~

kcoder24 · February 24, 2010, 12:01pm

What if you read from stdin instead a named file? I mean

cat file1 file2 | perl -e '.......'

summer_cherry · February 24, 2010, 8:23pm

perl:

use strict vars;
my %hash;
while(<DATA>){
	chmop;
	if(/([0-9]*),\s*([^_]*)_.*,\s*([0-9]*)$/){
		$hash{$1}->{$2}+=$3;
	}
}
foreach my $key (keys %hash){
	my %tmp = %{$hash{$key}};
	foreach my $inkey(keys %tmp){
		print $key,", ",$inkey,", ",$tmp{$inkey},"\n";
	}
}
__DATA__
20100221, abc_1, 200
20100221, abc_4, 350
20100221, opq_3, 200
20100221, abc_5, 220
20100221, xyz_1, 500
20100221, abc_2, 500
20100221, abc_3, 100
20100221, xyz_2, 700
20100221, opq_2, 350
20100221, xyz_3, 100
20100221, opq_1, 230

ganapati · February 25, 2010, 1:55am

Thanks for your solutions to 'summer cherry' and 'kcoder24'.

I've already implemented "abubacker's" solution. Now, Added the feature to his code, is it possible to handle multiple input files with different dates as below?

input file name: input_20100221.csv
20100221, abc_1, 200
20100221, abc_4, 300
20100221, opq_3, 200
20100221, abc_5, 200
20100221, xyz_1, 500
20100221, abc_2, 500
20100221, abc_3, 100
20100221, xyz_2, 700
20100221, opq_2, 300
20100221, xyz_3, 100
20100221, opq_1, 200

input file name: input_20100222.csv
20100222, abc_1, 100
20100222, abc_4, 200
20100222, opq_3, 200
20100222, abc_5, 200
20100222, xyz_1, 100
20100222, abc_2, 200
20100222, abc_3, 100
20100222, xyz_2, 200
20100222, opq_2, 800
20100222, xyz_3, 600
20100222, opq_1, 700

input file name: input_20100224.csv
20100224, abc_1, 600
20100224, abc_4, 400
20100224, opq_3, 200
20100224, abc_5, 300
20100224, xyz_1, 300
20100224, abc_2, 200
20100224, abc_3, 700
20100224, xyz_2, 200
20100224, opq_2, 200
20100224, xyz_3, 900
20100224, opq_1, 800

Output should be exactly as below:

Rundate,  abc,   opq, xyz  ### header, only one time
20100221, 1300,  700, 1300
20100222,  800, 1700,  900
20100224, 2200, 1200, 1400

Any suggestion would be great help for me.