Sum column values based in common identifier in 1st column.

sargotrons · October 21, 2014, 8:59am

Hi,
I have a table to be imported for R as matrix or data.frame but I first need to edit it because I've got several lines with the same identifier (1st column), so I want to sum the each column (2nd -nth) of each identifier (1st column)

The input is for example, after sorted:

K00001 1 1 4 3 5
K00001 2 4 4 3 3
K00001 8 9 1 5 2
K00006 7 2 3 2 8
K00006 3 4 6 6 3
K00006 5 1 9 7 7
K00008 1 1 2 1 1
K00011 14 18 18 12 15

I want an output as follow:

K00001 11 14 9 11 10
K00006 15 7 18 15 18
K00008 1 1 2 1 1
K00011 14 18 18 12 15

How can do this in awk? I queried some threads with similar task but all of those were about just a sum of one given column.

Please help. Thanks in advance !

RudiC · October 21, 2014, 9:24am

Any attempts from your side?

sargotrons · October 21, 2014, 9:39am

Yes I tried to modify this example I found here in this forum,

awk '{A[$1]+=$2;next}END{for(i in A){print i,A}}'

I'm pretty newbie, and mostly the way I'm learning is by googling similar examples and adapt to my objectives.

Regarding the example above I think it can't be used because

A[$1]+=$2

.

SriniShoo · October 21, 2014, 9:50am

awk '{a[$1] += $2; b[$1] += $3; c[$1] += $4; d[$1] += $5; e[$1] += $6} END {for(x in a) print x, a[x], b[x], c[x], d[x], e[x]}' file

---------- Post updated at 09:50 AM ---------- Previous update was at 09:48 AM ----------

or since the input is already sorted and if you want the sorted order in the output

awk 'a1==$1 {a2+=$2; a3+=$3; a4+=$4; a5+=$5; a6+=$6; next} {print a1, a2, a3, a4, a5, a6; a1=$1; a2=$2; a3=$3; a4=$4; a5=$5; a6=$6} END {print a1, a2, a3, a4, a5, a6}' file

Akshay_Hegde · October 21, 2014, 9:53am

@SriniShoo : One array is enough I think

Try

akshay@nio:/tmp$ cat infile
K00001 1 1 4 3 5
K00001 2 4 4 3 3
K00001 8 9 1 5 2
K00006 7 2 3 2 8
K00006 3 4 6 6 3
K00006 5 1 9 7 7
K00008 1 1 2 1 1
K00011 14 18 18 12 15

awk '
	function p(v, s)
	{
		for(i=2;i<=NF;i++)
		{
		  if(v == "push"){
			A[$1,i]+=$i
		  }
		  else{	
			if(($1,i) in  A)
			  s = (length(s) ? s OFS:"") A[$1,i]
			  delete A[$1,i]
		  }
		}
		return s
	}

 FNR==NR{ p("push"); next }

        { if(length(s=p()))print $1,s }

    ' infile infile

Resulting

K00001 11 14 9 11 10
K00006 15 7 18 15 18
K00008 1 1 2 1 1
K00011 14 18 18 12 15

OR

awk '{
	for(i=2; i<=NF; i++)
	A[$1,i]+=$i; 
	max = ( NF > max ) ? NF : max
     }
  END{
	for(i in A)
	{
	  if(i in A)
	  {
		split(i,X,SUBSEP)
		s = X[1]
		for(j=2; j<=max; j++)
		{
			if((X[1],j) in A)
			{
				s = s OFS A[X[1],j] 
				delete A[X[1],j]
			} 
		}
			
		print s
	  }
	}
     }'  infile

Resulting (if you don't care order)

K00008 1 1 2 1 1
K00006 15 7 18 15 18
K00001 11 14 9 11 10
K00011 14 18 18 12 15

sargotrons · October 21, 2014, 10:03am

Thanks, I will have a look to see i can understand the logic of the array, that seems more pratical since my actual input file has 26 columns.

Thank you both for the fast help! I might be back with some doubts about the logic or semantics of the array, so I wont indicate this topic for closure, although is solved!

RudiC · October 22, 2014, 4:39am

Try also

awk     'function SP()  {n=split ($0, T); ID=$1}
         function PR()  {printf "%s", ID; for (i=2; i<=n; i++) printf "\t%s", T; printf "\n"}

         NR==1          {SP();next}
         $1 != ID       {PR(); SP(); next}
                        {for (i=2; i<=NF; i++) T+=$i}
         END            {PR()}
        ' file
K00001  11      14      9       11      10
K00006  15      7       18      15      18
K00008  1       1       2       1       1
K00011  14      18      18      12      15

Scrutinizer · October 22, 2014, 2:20pm

Same idea, small variation:

awk '
  function pr() {
    for(i=1; i<=NF; i++) $i=T
    print
  }

  $1!=p {
    n=$0
    p=$1
    if(NR>1) pr()
    split(n,T)
    next
  }
 
  {
    for(i=2; i<=NF; i++) T+=$i
  }

  END{
    pr()
  }
' OFS='\t' file

RavinderSingh13 · October 24, 2014, 3:50pm

Hello,

One more solution on same. This solution will take care of sequences of first column, lets say the first column values are not in a sequence as follows then following may help.

Input file is as follows.

cat get_sum.txt 
K00001 1 1 4 3 5
K00001 2 4 4 3 3
K00001 8 9 1 5 2
K00006 7 2 3 2 8
K00006 3 4 6 6 3
K00006 5 1 9 7 7
K00008 1 1 2 1 1
K00011 14 18 18 12 15
K00001 2 3 4 5 6
K00008 5 4 3 2 1

Code is as follows.

awk  '{for(i=1;i<=NF;i++){if(i==1){Q=$1} else {A[Q OFS i]+=$i}}} END{for(y in A){W=y;gsub(/[[:space:]].*/,X,W);while(p<=NF){D[W]=D[W] OFS A[W OFS p];p++};while(k<=NF){delete A[W OFS k];k++};p="";k=""} { for(t in D){print t OFS D[t]}}}' get_sum.txt

EDIT: Adding a non single liner approach for same.

awk  '
{for(i=1;i<=NF;i++)
	{if(i==1)
		{Q=$1} 
	else 
		{A[Q OFS i]+=$i}}
	} 
END{
for(y in A)
{W=y;
	gsub(/[[:space:]].*/,X,W);
	while(p<=NF){
			D[W]=D[W] OFS A[W OFS p];p++
		    }
	while(k<=NF){
			delete A[W OFS k];k++};p="";k=""
		    } 
{for(t in D)
	{print t OFS D[t]}
}
}' get_sum.txt

Output will be as follows.

K00001    13 17 13 16 16                            
K00011    14 18 18 12 15                            
K00006    15 7 18 15 18                            
K00008    6 5 5 3 2

Thanks,
R. Singh