Hi,
I have a table to be imported for R as matrix or data.frame but I first need to edit it because I've got several lines with the same identifier (1st column), so I want to sum the each column (2nd -nth) of each identifier (1st column)
The input is for example, after sorted:
K00001 1 1 4 3 5
K00001 2 4 4 3 3
K00001 8 9 1 5 2
K00006 7 2 3 2 8
K00006 3 4 6 6 3
K00006 5 1 9 7 7
K00008 1 1 2 1 1
K00011 14 18 18 12 15
I want an output as follow:
K00001 11 14 9 11 10
K00006 15 7 18 15 18
K00008 1 1 2 1 1
K00011 14 18 18 12 15
How can do this in awk? I queried some threads with similar task but all of those were about just a sum of one given column.
Please help. Thanks in advance !
RudiC
October 21, 2014, 9:24am
2
Any attempts from your side?
Yes I tried to modify this example I found here in this forum,
awk '{A[$1]+=$2;next}END{for(i in A){print i,A}}'
I'm pretty newbie, and mostly the way I'm learning is by googling similar examples and adapt to my objectives.
Regarding the example above I think it can't be used because
A[$1]+=$2
.
awk '{a[$1] += $2; b[$1] += $3; c[$1] += $4; d[$1] += $5; e[$1] += $6} END {for(x in a) print x, a[x], b[x], c[x], d[x], e[x]}' file
---------- Post updated at 09:50 AM ---------- Previous update was at 09:48 AM ----------
or since the input is already sorted and if you want the sorted order in the output
awk 'a1==$1 {a2+=$2; a3+=$3; a4+=$4; a5+=$5; a6+=$6; next} {print a1, a2, a3, a4, a5, a6; a1=$1; a2=$2; a3=$3; a4=$4; a5=$5; a6=$6} END {print a1, a2, a3, a4, a5, a6}' file
1 Like
@SriniShoo : One array is enough I think
Try
akshay@nio:/tmp$ cat infile
K00001 1 1 4 3 5
K00001 2 4 4 3 3
K00001 8 9 1 5 2
K00006 7 2 3 2 8
K00006 3 4 6 6 3
K00006 5 1 9 7 7
K00008 1 1 2 1 1
K00011 14 18 18 12 15
awk '
function p(v, s)
{
for(i=2;i<=NF;i++)
{
if(v == "push"){
A[$1,i]+=$i
}
else{
if(($1,i) in A)
s = (length(s) ? s OFS:"") A[$1,i]
delete A[$1,i]
}
}
return s
}
FNR==NR{ p("push"); next }
{ if(length(s=p()))print $1,s }
' infile infile
Resulting
K00001 11 14 9 11 10
K00006 15 7 18 15 18
K00008 1 1 2 1 1
K00011 14 18 18 12 15
OR
awk '{
for(i=2; i<=NF; i++)
A[$1,i]+=$i;
max = ( NF > max ) ? NF : max
}
END{
for(i in A)
{
if(i in A)
{
split(i,X,SUBSEP)
s = X[1]
for(j=2; j<=max; j++)
{
if((X[1],j) in A)
{
s = s OFS A[X[1],j]
delete A[X[1],j]
}
}
print s
}
}
}' infile
Resulting (if you don't care order)
K00008 1 1 2 1 1
K00006 15 7 18 15 18
K00001 11 14 9 11 10
K00011 14 18 18 12 15
1 Like
Thanks, I will have a look to see i can understand the logic of the array, that seems more pratical since my actual input file has 26 columns.
Thank you both for the fast help! I might be back with some doubts about the logic or semantics of the array, so I wont indicate this topic for closure, although is solved!
RudiC
October 22, 2014, 4:39am
7
Try also
awk 'function SP() {n=split ($0, T); ID=$1}
function PR() {printf "%s", ID; for (i=2; i<=n; i++) printf "\t%s", T; printf "\n"}
NR==1 {SP();next}
$1 != ID {PR(); SP(); next}
{for (i=2; i<=NF; i++) T+=$i}
END {PR()}
' file
K00001 11 14 9 11 10
K00006 15 7 18 15 18
K00008 1 1 2 1 1
K00011 14 18 18 12 15
1 Like
Same idea, small variation:
awk '
function pr() {
for(i=1; i<=NF; i++) $i=T
print
}
$1!=p {
n=$0
p=$1
if(NR>1) pr()
split(n,T)
next
}
{
for(i=2; i<=NF; i++) T+=$i
}
END{
pr()
}
' OFS='\t' file
Hello,
One more solution on same. This solution will take care of sequences of first column, lets say the first column values are not in a sequence as follows then following may help.
Input file is as follows.
cat get_sum.txt
K00001 1 1 4 3 5
K00001 2 4 4 3 3
K00001 8 9 1 5 2
K00006 7 2 3 2 8
K00006 3 4 6 6 3
K00006 5 1 9 7 7
K00008 1 1 2 1 1
K00011 14 18 18 12 15
K00001 2 3 4 5 6
K00008 5 4 3 2 1
Code is as follows.
awk '{for(i=1;i<=NF;i++){if(i==1){Q=$1} else {A[Q OFS i]+=$i}}} END{for(y in A){W=y;gsub(/[[:space:]].*/,X,W);while(p<=NF){D[W]=D[W] OFS A[W OFS p];p++};while(k<=NF){delete A[W OFS k];k++};p="";k=""} { for(t in D){print t OFS D[t]}}}' get_sum.txt
EDIT: Adding a non single liner approach for same.
awk '
{for(i=1;i<=NF;i++)
{if(i==1)
{Q=$1}
else
{A[Q OFS i]+=$i}}
}
END{
for(y in A)
{W=y;
gsub(/[[:space:]].*/,X,W);
while(p<=NF){
D[W]=D[W] OFS A[W OFS p];p++
}
while(k<=NF){
delete A[W OFS k];k++};p="";k=""
}
{for(t in D)
{print t OFS D[t]}
}
}' get_sum.txt
Output will be as follows.
K00001 13 17 13 16 16
K00011 14 18 18 12 15
K00006 15 7 18 15 18
K00008 6 5 5 3 2
Thanks,
R. Singh