Sum fields of different files using awk

rogeriog.em · July 23, 2013, 11:13am

I'm trying to sum each field of the second column over many different files.
For example:

file1:                file2:
1  5                 1  5
2  6                 2  4
3  5                 3  3

To get:

I found answer when there are only 2 files as input:

cat file1 | awk '{n=$2; getline <"file2"; print NR " " n+$2}' > file3

But I have many files, how can I do that?

Thanks,

Yoda · July 23, 2013, 11:27am

awk '{A[$1]+=$2}END{for(k in A) print k,A[k]}' file*

RudiC · July 23, 2013, 11:28am

Depends how your files are structured, and how many of them are present. Try sth like (untested):

awk '{RES[$1]+=$2} END {for (n in RES) print n, RES[n]}' file*

RavinderSingh13 · July 23, 2013, 11:41am

Hello Yoda,

sorry to bother you, could you please explain the command provide by you.

awk '{A[$1]+=$2}END{for(k in A) print k,A[k]}' file*

Thanks,
R. Singh

Yoda · July 23, 2013, 11:57am

awk '
        # Create an associative array: A for which value is sum of $2 and indexed by $1
        {
                A[$1] += $2
        }

        # End Block
        END {
        # For each element in associative array: A
                for ( k in A )
                        # Print index & value of
                        print k, A[k]
        }
# Path name expansion (aka globbing) will help open & read all files with file name prefixed: file
' file*

rogeriog.em · July 23, 2013, 12:13pm

Thanks Yoda and RudiC, both codes work in the input I provide. But is there a way to do it disregarding the first column values. I mean:

file 1                file 2
2  5                 4  5
5  6                 5  4
3  5                 8  3

Output:

file 3
10
10
8

Also my real data is ordered as:

-48.000   1.2
-47.990   1.5
....
25.000    0.033
25.010    0.023

When I run these codes it seems to sum the values of second column properly but they go out of order. Is there a way to generate them in order or to put them in order again?

Thank you very much.

Yoda · July 23, 2013, 12:19pm

If you want to disregard the first column, just print the sum:

print A[k]

By default, the order in which a for (i in array) loop scans an array is not defined; it is generally based upon the internal implementation of arrays inside awk.

You might have to use an indexed array to preserve the order.

rogeriog.em · July 23, 2013, 12:43pm

yoda:

If you want to disregard the first column, just print the sum:
print A[k]
By default, the order in which a for (i in array) loop scans an array is not defined; it is generally based upon the internal implementation of arrays inside awk.

You might have to use an indexed array to preserve the order.

Oh yeah, thanks Yoda, I missed this one. Pretty obvious. Thanks for your dedication.

RavinderSingh13 · July 23, 2013, 12:46pm

Hello Yoda,

Could you please explain the use of END here as if we are not using END it is giving some thing else result, will be grateful to you if you throw some light on same.

 
awk '
        # Create an associative array: A for which value is sum of $2 and indexed by $1
        {
                A[$1] += $2
        }

        # End Block
        END {
        # For each element in associative array: A
                for ( k in A )
                        # Print index & value of
                        print k, A[k]
        }
# Path name expansion (aka globbing) will help open & read all files with file name prefixed: file
' file*

Thanks,
R. Singh

Yoda · July 23, 2013, 1:00pm

BEGIN and END are special awk patterns.

They are usually used for startup and cleanup actions respectively.

A BEGIN rule is executed only once before the first input record is read. Likewise, an END rule is executed once only, after all the input is read.

I recommend reading: GNU Awk User's Guide or AWK Manual

RudiC · July 23, 2013, 4:39pm

rogeriog.em:

Thanks Yoda and RudiC, both codes work in the input I provide. But is there a way to do it disregarding the first column values. I mean:
file 1                file 2
2  5                 4  5
5  6                 5  4
3  5                 8  3
Output:
file 3
10
10
8
Also my real data is ordered as:
-48.000   1.2
-47.990   1.5
....
25.000    0.033
25.010    0.023
When I run these codes it seems to sum the values of second column properly but they go out of order. Is there a way to generate them in order or to put them in order again?

Thank you very much.

So you want it based on line no., not on the col1 value? Try

awk '{RES[FNR]+=$2} max<FNR {max=FNR} END {for (i=1; i<=max; i++) print RES}' file1 file2
10
10
8