awk command to manipulate csv file in UNIX

Prit_Siv · January 29, 2016, 3:40pm

Hi, I am new to awk and unix programming and trying to manipulate a csv file.

My current csv file looks like this:

col1,col2,col3,col4,col5,col4,col5,col6,col7,col8
223,V,c,2,4,f,r,,y,z
223,V,c,3,2,f,r,,y,z
223,V,c,1,4,f,r,,y,z
223,V,c,4,3,f,r,,y,z
227,V,c,3,1,f,r,,y,z
227,V,c,4,1,f,r,,y,z
228,V,c,5,5,f,r,,y,z
228,V,c,2,4,f,r,,y,z
228,V,c,1,7,f,r,,y,z
228,V,c,1,8,f,r,,y,z
228,V,c,1,1,f,r,,y,z
228,V,c,1,4,f,r,,y,z
228,V,c,1,5,f,r,,y,z

Desired output csv:

col1,Sum of col4,Sum of col5
223,10,13
227,7,2
228,12,34

I tried a bunch of awk commands but nothing seemed to work

Is there any awk solution for this?

Thank you in advance.

Yoda · January 29, 2016, 4:06pm

awk -F, '
        {
                A_4[$1] += $4
                A_5[$1] += $5
        }
        END {
                for ( k in A_4 )
                        print k, A_4[k], A_5[k]
        }
' OFS=, file

Prit_Siv · January 29, 2016, 4:58pm

its not working correctly, in the middle of the file it prints the header in one of the rows, and the first line is ,0,0

my csv file has headers so i added NR==1

still no luck

Don_Cragun · January 29, 2016, 5:05pm

Yoda's suggestion looks good except for the way it handles the input and output header lines. You might want to try this slight modification:

awk -F, '
NR == 1 {
	OFS = ","
	print $1, "Sum of " $4, "Sum of " $5
	next
}
{	s4[$1] += $4
	s5[$1] += $5
}
END {	for(i in s4)
		print i, s4, s5
}' file.csv

Note that the output produced by this script could (except for the header line) have the output lines in random order. If you need the output to be sorted by increasing numeric order of the 1st field, or if you want the output to be in the same order as the input (assuming all lines with the same 1st field value are grouped together as in your sample input), the script would need changed to consistently produce the output order you want.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

RudiC · January 30, 2016, 5:14am

To retain the col1's contents' order, try

awk '
NR == 1 {print "col1,Sum of col4,Sum of col5"
         next
        }
$1 != LAST &&
NR > 2  {print LAST, SUM4, SUM5
         SUM4 = SUM5 = 0
        }
        {SUM4 += $4
         SUM5 += $5
         LAST  = $1
        }
END     {print LAST, SUM4, SUM5
        }
' FS=, OFS=, file
col1,Sum of col4,Sum of col5
223,10,13
227,7,2
228,12,34

Aia · January 30, 2016, 12:58pm

A few bits of Perl.

perl -nalF, -e '                                  # set Perl to make tokens out of lines
BEGIN{$, = $" = ","}                              # output separators
$c[$o++] = $F[0] unless $col1{$F[0]};             # create a queue order
$col1{$F[0]}->[0] += $F[3];                       # add to column 4
$col1{$F[0]}->[1] += $F[4];                       # add to column 5
END{
    $col1{$c[0]}=["Sum of col4", "Sum of col5"];  # set header
    for(@c){print $_,@{$col1{$_}}}                # display tally
}' file.csv

col1,Sum of col4,Sum of col5
223,10,13
227,7,2
228,12,34

Prit_Siv · February 3, 2016, 12:49pm

Thank you so much! Both RudiC and Aia's solutions work, howeverwith the awk solution, the second line prints

,0,0

and with the perl solution the last line prints

,0,0

Can someone explain what might be causing this?

RudiC · February 3, 2016, 1:07pm

Difficult to believe. The proposal was tested with your sample in post#1 before posted with the given result, and a repeated test now yields the same.
So - the difference must lie in

the awk version - which is yours?
the input file / file structure
correctness of copy and paste the program
Please post the conditions that you ran the program in.

---------- Post updated at 19:07 ---------- Previous update was at 19:07 ----------

One suspicion: an empty line in (at the end of?) the input file. Result with an empty line:

col1,Sum of col4,Sum of col5
223,10,13
227,7,2
228,12,34
,0,0

Aia · February 3, 2016, 8:05pm

prit siv:

Thank you so much! Both RudiC and Aia's solutions work, howeverwith the awk solution, the second line prints
,0,0
and with the perl solution the last line prints
,0,0
Can someone explain what might be causing this?

That's the result of your data having empty line(s).