Calculation in Multiple files using awk

johnwilliams.sp · August 6, 2010, 4:55am

Hi All,

I have some 10 files named samp1.csv, samp2.csv,... samp10.csv

Each file having the same number of fields like,

Count, field1, field2, field3.

And a source.csv file which has three fields field1, field2, field3.

Now, i want to find the total count by taking the field1, field2, field3 as uniqueness.

Eg:

source.csv
a, b, c
b, c, a
d, e, f

samp1.csv
10, a, b, c
5, b, c, a

samp2.csv
12, a, b, c
4, b, c, a

samp3.csv
2, a, b, c
7, d, e, f
Output should be,

24, a, b, c
9, b, c, a
7, d, e, f

I am trying to achieve this. Will this be done using awk.

Let me know your inputs. Thanks.

Christoph_Spohr · August 6, 2010, 6:36am

Hi,

try

awk -F, '{gsub(/ */,"",$0);split($0,t,",");a[t[2]", "t[3]", "t[4]]+=$1}END{for (i in a) print a,i}' file1 file2 file3

Output:

24 a, b, c
7 d, e, f
9 b, c, a

HTH Chris

johnwilliams.sp · August 6, 2010, 7:29am

Hi Chris,

Can you please explain the code.

Christoph_Spohr · August 6, 2010, 7:41am

O.k.

awk -F,  # make "," the new field delimiter
'{gsub(/ */,"",$0); # the format of your data is not consistent
                    # -> delete all blanks
split($0,t,",");    # split the current line at each comma
                    # store the values in an array called t
a[t[2]", "t[3]", "t[4]]+=$1} # now the values in t don't contain any spaces of commas any more
                    # so i construct a new string t[2]", "t[3]", "t[4] of the elements in t
                    # and use it as the name for the hash entry
                    # if this string is encountered again, add 1 to the value associated with the hash name
END{for (i in a) print a,i}' # when all files are processed
                    # print out first the number stored in the hash
                    # then the corresponding hash string
file1 file2 file3   # names of the three processed files

johnwilliams.sp · August 8, 2010, 11:53pm

Hi Chris,

I have one query here. In a csv file all fields are delimited by comma(,).

But in the code you have given,

awk -F, ' '

why does it not take each individaual field in variable $0,$1,$2.

Instead you are splitting using a split function.

Let me know if my query is vague.

Christoph_Spohr · August 9, 2010, 1:55am

Simply because it didn't work that way, so i tried something else. You can't use $2..$4 to create the hash table value as your input file contains different amounts of spaces. You would get different entries for different amounts of spaces. So at first i have to delete unnecessary spaces, but then i cannot use the old field variables anymore. That's why i have to split the new version of $0 again. But i am not awk-expert.

johnwilliams.sp · August 9, 2010, 2:02am

Its works for me anyway. Just curious of knowing it.

Thanks for your help.

rdcwayx · August 9, 2010, 3:01am

In your sample files, there are some hide spaces (guess you typed them carelessly), that's why it makes a little complex in Christoph Spohr's code.

Below command should be simple, and maybe works with your real data files.

awk -F , '{a[$2 FS $3 FS $4]+=$1}END{for (i in a) print a,i}' sample*.csv

johnwilliams.sp · August 12, 2010, 2:05am

Hi,

Now is it possible to check whether array index has a value in it or empty.

In the code given below,

awk -F , '{a[$2 FS $3 FS $4]+=$1}END{for (i in a) print a,i}' sample*.csv

I want to check whether index $2 FS $3 FS $4 is already having value or present, before summing another value to it.

Is this possible in awk?

---------- Post updated at 11:35 AM ---------- Previous update was at 09:09 AM ----------

Hi,

I have got a option to test for the index in nawk like,

if ($2","$3","$4 in a)

which checks for the existence of index in array.

Do we have something similar in awk?