Sum of two columns in a file

Jram · August 11, 2012, 3:05pm

Hi,

I need to do a sum of two columns in a file where delimiter is |^

input

1|^2|^3|^4|^50|^2|^3|^100
2|^3|^4|^6|^100|^7|^2|^50
3|^4|^2|^3|^50|^6|^3|^50
4|^2|^5|^7|^25|^2|^8|^25

Output required:

 Sum of 2 columns 5 & 8 which is 450

RudiC · August 11, 2012, 4:01pm

Try

awk 'BEGIN {FS="\|\^"} {sum+=$5+$8} END {print sum}' infile

Jram · August 12, 2012, 1:54pm

Hi RudiC,

Tried as suggested..

following is the issue

[root@host-6-113 ~]# awk 'BEGIN {FS="\|\^"} {sum+=$5+$8} END {print sum}' infile
awk: warning: escape sequence `\|' treated as plain `|'
awk: warning: escape sequence `\^' treated as plain `^'
0

---------- Post updated at 11:24 PM ---------- Previous update was at 10:50 PM ----------

Rudic,

tweaked the code you've provided.. It is giving me the desired result. Thanks !!

[root@host-6-113 ~]# awk 'BEGIN {FS="[|^]"} {sum+=$9+$15} END {print sum}' infile
450

But it looks like it is considering the delimiter |^ as two strings. Please suggest alternatives if any..

Don_Cragun · August 12, 2012, 3:54pm

Try

awk -F'[.|.][.^.]' ' {sum+=$5+$8} END {print sum}' infile

The '[!^]' is rejected by some versions of awk, but when it is accepted it is saying the field separator is a "|" or a "^". You want it to be a "|" immediately followed by a "^".

agama · August 12, 2012, 4:44pm

Given awk's ability to evaluate a number up to the first non-digit, the carrot (^) can be used as the simple field separator:

awk -F "^" '{s += $5 + $8;} END { print s; }' input-file

RudiC · August 13, 2012, 2:57am

jram:

... following is the issue

[root@host-6-113 ~]# awk 'BEGIN {FS="\|\^"} {sum+=$5+$8} END {print sum}' infile
awk: warning: escape sequence `\|' treated as plain `|'
awk: warning: escape sequence `\^' treated as plain `^'
0

---------- Post updated at 11:24 PM ---------- Previous update was at 10:50 PM ----------

[root@host-6-113 ~]# awk 'BEGIN {FS="[|^]"} {sum+=$9+$15} END {print sum}' infile
450

There does not seem to be an issue for me - awk gave warnings but treated the FS as requested. The 0 result I guess was due to 0 input, sum (0)=0; in your second post you suddenly summed columns (9 & 15), totally different to your first example (5 & 8).

Don_Cragun · August 13, 2012, 1:32pm

According to POSIX, -F "|^" or any other mechanism that sets FS to the two characters <vertical-line><circumflex> produces undefined results because <vertical-line> is special in an extended regular expression (ERE) except when it appears in a bracket expression. Also in an ERE ^ is an anchor (even when it is not the first character in the ERE) except when it appears in a bracket expression, in a collating symbol ([.^.]), or equivalence class ([=^=]). So, any of the solutions here that set FS to a string starting with "|" is not portable. And any of the solution here that set FS to a string ending with "^" is not portable. Setting FS to "[|^]" is valid, but makes both <vertical-line> and <circumflex> field separators, not the two character string "|^". (This is why the solutions that set FS to "[|^]" had to sum fields 9 and 15 instead of fields 5 and 8. Note that this solution won't work if there are any <vertical-line> or <circumflex> characters in the input that aren't adjacent pairs and it that order).

Since "^" is an anchor in an ERE, setting FS to "^" produces undefined results. (It is anchoring the empty expression following the <circumflex> to the beginning of the string. But since this isn't a replacement [where an empty expression is replaced by the previously matched expression], the standards don't specify what this means and the results are not portable.)

I'm sure the solutions given all work on one or more implementations. Several of them fail with syntax errors when using Mac OS X's awk.

To portably search for "|^" as a field delimiter and sum the numeric values in the 5th and 8th columns, I think you need something like:

awk -F "[|][[.^.]]" ' {sum+=$5+$8}
END {print "Sum of 2 columns 5 & 8 which is",sum}' input

which should work on any system that meets the POSIX awk utility requirements.

alister · August 13, 2012, 2:17pm

That's incorrect. In a POSIX ERE, ^ is only an anchor when it is the first character in an expression.

\|^ should be portable, but, sadly, some implementations botch ^ handling. Some (perhaps all?) mawk implementations definitely do not handle this correctly.

Regards,
Alister

Don_Cragun · August 13, 2012, 2:53pm

No. What you say is true for a BRE; not for an ERE. Quoting from IEEE Std. 1003.1-2008, P190, L6174-6183:

9.4.9 ERE Expression Anchoring
An ERE can be limited to matching strings that begin or end a line; this
is called ��anchoring''. The <circumflex> and <dollar-sign> special
characters shall be considered ERE anchors when used anywhere outside
a bracket expression. This shall have the following effects:

A <circumflex> ('�') outside a bracket expression shall anchor the
expression or subexpression it begins to the beginning of a string; such
an expression or subexpression can match only a sequence starting at
the first character of a string. For example, the EREs "�ab" and "(�ab)"
match "ab" in the string "abcdef", but fail to match in the string
"cdefab", and the ERE "a�b" is valid, but can never match because the
'a' prevents the expression "�b" from matching starting at the first
character.

... ... ...

alister · August 13, 2012, 3:44pm

I stand corrected. Thank you.

Also, apologies to mawk devs

Regards,
Alister

Jram · August 17, 2012, 6:28am

Hi,

Thanks for your response..

Now my file is having around 300 columns and awk is giving an error message "awk cannot have more than 199 fields" while summing up the columns 80 and 138.

Please suggest

Thanks, Jram

Don_Cragun · August 17, 2012, 1:47pm

What system are you using?

I see nothing in the POSIX standard that would allow this limitation.
Input files for awk are text files. Text files can't contain null bytes. If there are any characters in a text file, the last character in the file must be a <newline> character and no line in a text file (including its trailing <newline> character can be more than {LINE_MAX} bytes long. The minimum value for {LINE_MAX} on conforming implementations is 2,048, so in theory, I believe the standard requires awk implementations to support at least 2,049 fields (more if the implementation's value of {LINE_MAX} is larger than the minimum).

Assuming your input line are not longer than {LINE_MAX} bytes long, the following should work around your awk's limit on the number of fields:

awk -F"\n" '# Call getfield(N) to extract field N from $0 with "|^"
# as the field separator.  f1toN is a local temp variable.
function getfield(n,f1toN) {
        if(match($0,"(\([^^]*\)[.|.][.^.]){"n"}") > 0) {
                f1toN=substr($0,RSTART,RLENGTH-2)
                match(f1toN,"(\([^^]*\)[.|.][.^.]){1,"n-1"}")
                return substr(f1toN,RLENGTH+1)
        }
        return ""
}
        {sum+=(getfield(80)+getfield(138))}
END     {print "Sum of 2 columns 80 & 138 which is",sum
}' in

Setting FS to <newline> means each input line will be treated as a single field. The getfield(n) function will extract field n from that line (assuming the field delimiter is still "|^") by storing the 1st n fields in f1toN and then extracting just the nth field from the end of that. (This could be done with a single call to substr() by calculating the start and end points of the desired field using the RSTART and RLENGTH from the two calls to match(). But I'll leave that as an exercise for the reader.) Note that the "n" and "n-1", respectively, in the two calls to match() are not inside a "..." string!