Adding the values of repeated ids

ashmit99 · July 26, 2017, 7:20am

File1 consist two columns, first some weired ids and second the numbers

t|v203.1@t|k88711.1	0.1
t|v190.1@t|k90369.1	0.01
t|v203.1@t|k88711.1	0.5
t|v322.1@t|k88711.1	0.2
t|v207.1@t|k90369.1	0.11
t|v326.1@t|k85939.1	0.5
t|v207.1@t|k90369.1	0.7
t|v207.1@t|k90369.1	0.3
t|v326.1@t|k89421.1	0.33

I want to add up the values of repeated ids and print 4 column File2
desired output File2

t|v203.1@t|k88711.1	0.1	0.1+0.5	0.6
t|v190.1@t|k90369.1	0.01	0.01	0.01
t|v322.1@t|k88711.1	0.2	0.2	0.2
t|v207.1@t|k90369.1	0.11	0.11+0.7+0.3	1.11
t|v326.1@t|k85939.1	0.5	0.5	0.5
t|v326.1@t|k89421.1	0.33	0.33	0.33

In the desired output file first and second columns are from file1, the third column shows what is going to be added for repeated ids, the fourth column is the values obtained by addition. The ids repeated will merge to one id as row 1 and 3 from File1 merged to row1 in File2.

rdrtx1 · July 26, 2017, 11:53am

awk '
{ b[$1]=$0; c[$1]=(! c[$1] ? _ : c[$1] "+") $NF; d[$1]+=$NF; }
END { for (i in b) print b "\t" c "\t" d; }
' infile

MadeInGermany · July 26, 2017, 3:20pm

The col#2 differs from the requirement?
(Also the b[] stores the $1 string twice, once as the index and another time as part of its value.)
The col#2 seems redundant anyway, so it is omitted in the following solution:

awk '
{
  if ($1 in sum) {
    sum[$1]+=$2; str[$1]=(str[$1] "+" $2)
  } else {
    sum[$1]=str[$1]=$2
  }
}
END {
  for (i in sum) printf "%s\t%s\t%s\n", i, str, sum
}
' infile

Because of the explicit if , adding the col#2 is simple - left as an exercise.

drl · July 27, 2017, 7:16am

Hi.

If you can live without the output field that indicates the detailed arithmetic, then this fairly simple command to sort, categorize, and sum associated values may be useful:

datamash --sort --group=1 sum 2

Here is a complete demonstastion script with output:

#!/usr/bin/env bash

# @(#) s1       Demonstrate arithmetic on group components, datamash.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C dixf datamash

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

pl " Results:"
datamash --sort --group=1 sum 2 < $FILE

pl " Details for datamash:"
dixf datamash

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.8 (jessie) 
bash GNU bash 4.3.30
dixf (local) 1.49
datamash (GNU datamash) 1.0.6

-----
 Input data file data1:
t|v203.1@t|k88711.1     0.1
t|v190.1@t|k90369.1     0.01
t|v203.1@t|k88711.1     0.5
t|v322.1@t|k88711.1     0.2
t|v207.1@t|k90369.1     0.11
t|v326.1@t|k85939.1     0.5
t|v207.1@t|k90369.1     0.7
t|v207.1@t|k90369.1     0.3
t|v326.1@t|k89421.1     0.33

-----
 Results:
t|v190.1@t|k90369.1     0.01
t|v203.1@t|k88711.1     0.6
t|v207.1@t|k90369.1     1.11
t|v322.1@t|k88711.1     0.2
t|v326.1@t|k85939.1     0.5
t|v326.1@t|k89421.1     0.33

-----
 Details for datamash:
datamash        command-line calculations (man)
Path    : /usr/bin/datamash
Version : 1.0.6
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Help    : probably available with -h,--help
Repo    : Debian 8.8 (jessie) 
Home    : https://savannah.gnu.org/projects/datamash/ (pm)

As noted, datamash can be found in some repositories and also at the gnu site.

Best wishes ... cheers, drl