Sum elements of 2 arrays excluding labels

Geneanalyst · March 24, 2018, 8:49pm

I'm looking for an efficient way to sum elements from 2 arrays using AWK and preserve header as well as sample names in the output array. I have Ubuntu 16.04 LTS. For example;

ARRAY 1

SAMPLE    DERIVED    ANCESTRAL
Sample1    14352    0
Sample2    14352    0
Sample3    14352    0
Sample4    9880    4472
Sample5    9786    4566
Sample6    9846    4506
Sample7    9787    4565
Sample8    9800    4552
Sample9    9764    4588
Sample10    9760    4592
Sample11    9691    4661
Sample12    9798    4554
Sample13    9740    4612

ARRAY 2:

SAMPLE    DERIVED    ANCESTRAL
Sample1    14352    0
Sample2    14352    0
Sample3    14352    0
Sample4    13674    678
Sample5    13749    603
Sample6    13701    651
Sample7    13682    670
Sample8    13677    675
Sample9    13684    668
Sample10    13674    678
Sample11    13642    710
Sample12    13679    673
Sample13    13713    639

DESIRED OUTPUT ARRAY:

SAMPLE    TOTAL DERIVED    TOTAL ANCESTRAL
Sample1    28704    0
Sample2    28704    0
Sample3    28704    0
Sample4    23554    5150
Sample5    23535    5169
Sample6    23547    5157
Sample7    23469    5235
Sample8    23477    5227
Sample9    23448    5256
Sample10    23434    5270
Sample11    23333    5371
Sample12    23477    5227
Sample13    23453    5251

RavinderSingh13 · March 24, 2018, 11:34pm

Hello Geneanalyst,

Could you please try following and let me know if this helps you.

 awk 'BEGIN{print "SAMPLE    TOTAL DERIVED    TOTAL ANCESTRAL"} FNR==NR{a[$1]=$2;b[$1]=$3;next} FNR>1{$2=$1 in a?$2+a[$1]:$2;$3=$1 in b?$3+b[$1]:$3;print}' array1  array2 | column -t

Where I am considering that array1 and array2 are Input_file(s) names which you mentioned here.

Thanks,
R. Singh

Geneanalyst · March 25, 2018, 12:32am

Thanks Ravinder, works good!

What if instead of 2 arrays one needs to sum 3 arrays; array1 array2 array3.

Scrutinizer · March 25, 2018, 3:03am

Hi, try:

awk '{split($0,F)} getline<f>0 && NR>1{$2+=F[2]; $3+=F[3]}1' f=array1 array2

Or you could try an approach like this which should work with 2 or more arrays:

awk '
  FNR==1 {
    if(NR==1)
      print
    next
  }
  {
    S[FNR]=$1
    D[$1]+=$2
    A[$1]+=$3
  }
  END {
    for(i=2; i<=FNR; i++)
      print S, D, A
  }
' array1 array2 ... arrayn

RudiC · March 25, 2018, 3:13am

Try also

awk '
BEGIN   {FMT = "%-13s%-17s%-17s\n"
         printf FMT, "SAMPLE", "TOTAL DERIVED", "TOTAL ANCESTRAL"
        }
FNR > 1 {SEQ[FNR]  = $1
         SUMD[$1] += $2
         SUMA[$1] += $3
        }
END     {for (i=2; i<=FNR; i++) printf FMT, SEQ, SUMD[SEQ], SUMA[SEQ]
        }
'  array[12]
SAMPLE       TOTAL DERIVED    TOTAL ANCESTRAL  
Sample1      28704            0                
Sample2      28704            0                
Sample3      28704            0                
Sample4      23554            5150             
Sample5      23535            5169             
Sample6      23547            5157             
Sample7      23469            5235             
Sample8      23477            5227             
Sample9      23448            5256             
Sample10     23434            5270             
Sample11     23333            5371             
Sample12     23477            5227             
Sample13     23453            5251

Geneanalyst · March 27, 2018, 12:50am

rudic:

Try also

awk '
BEGIN   {FMT = "%-13s%-17s%-17s\n"
   printf FMT, "SAMPLE", "TOTAL DERIVED", "TOTAL ANCESTRAL"
   }
FNR > 1 {SEQ[FNR]  = $1
   SUMD[$1] += $2
   SUMA[$1] += $3
   }
END     {for (i=2; i<=FNR; i++) printf FMT, SEQ, SUMD[SEQ], SUMA[SEQ]
   }
'  array[12]
SAMPLE       TOTAL DERIVED    TOTAL ANCESTRAL  
Sample1      28704            0                
Sample2      28704            0                
Sample3      28704            0                
Sample4      23554            5150             
Sample5      23535            5169             
Sample6      23547            5157             
Sample7      23469            5235             
Sample8      23477            5227             
Sample9      23448            5256             
Sample10     23434            5270             
Sample11     23333            5371             
Sample12     23477            5227             
Sample13     23453            5251

Hey RudiC,

Could you explain the various steps of your code whenever you have time.

Don_Cragun · March 28, 2018, 6:34am

Here is a copy of RudiC's code with comments added describing what each section of code is doing:

awk '	# Invoke awk and start script to be run by awk.
BEGIN	{# Before any lines are read from any input file, set the format string
	 # to be used by all print statements in this script and print the output
	 # header line.
	 FMT = "%-13s%-17s%-17s\n"
	 printf FMT, "SAMPLE", "TOTAL DERIVED", "TOTAL ANCESTRAL"
	}
FNR > 1	{# For all lines in each input file except the header line, save the
	 # sample name associated with that input line and accumulate the derived
	 # and ancestral values associated with that sample name.
	 SEQ[FNR]  = $1
	 SUMD[$1] += $2
	 SUMA[$1] += $3
	}
END	{# After all input files have been processed, for each line found in the
	 # last input file, print the sample name and the accumulated derived
	 # and ancestral totals.
	 for (i=2; i<=FNR; i++) printf FMT, SEQ, SUMD[SEQ], SUMA[SEQ]
	}
' array[12]	# End the awk script and list of files to be processed.

drl · March 29, 2018, 7:41am

Hi.

I like awk solutions. However, I also like packaged solutions. In this case GNU datamash can do the grouping and summing with:

datamash -g 1 sum 2,3

which will sum fields 2 and 3 for items in groups of field 1.

However simple as this appears, there are additional complexities. First, datamash , as with many standard utilities, likes TAB-delimited files by default. Although headers can be ignored, we can combine replacing runs of spaces with a TAB as well as deleting headers with a sed operation. So we can append all modified input files to a single input file, which is also what datamash likes.

As you can imagine, it is best and easiest when the lines for the group operation are collected together. There is a datamash option for such sorting, but your choice of group names are mixed alphabetic and numeric -- perhaps called a hybrid string. A program that can handle that is msort .

This data preparation can be combined into a loop that can handle a number of data files. Here we have added 3 additional data files as an illustration. The script uses as input all file names that begin with the string data -- data1, data2, etc.

Then we can run the command as noted above.

If we want to make the output pretty, we can add a header, and use a simple perl script called align , which aligns fields automatically, but can also be directed to align left, center, right, etc.

With all that in mind, here is a script that shows these operations and the results:

#!/usr/bin/env bash

# @(#) s2       Demonstrate grouping, summing fields, many files, datamash.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C specimen datamash msort align

FILES=${1-data*}
E=expected-output

rm all-data
pl " Input data files $FILES:"
head -n 5  $FILES

pl " Sample of file collection, TABBED, stripped header, etc.:"
for file in data*
do
  sed '1d;2,$s/  */\t/g' $file >> all-data
done 
specimen 4:4:4 all-data

pl " Expected output:"
cat $E

pl " Results:"
echo "SAMPLE    TOTAL DERIVED   TOTAL ANCESTRAL" > f1
msort -j -q -l -n 1,1 -c hybrid all-data |
datamash -g 1 sum 2,3 |
tee -a f1

pl " Beautify results:"
align -alrr f1

exit 0

producing:

$ ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-5-amd64, x86_64
Distribution        : Debian 8.9 (jessie) 
bash GNU bash 4.3.30
specimen (local) 1.17
datamash (GNU datamash) 1.2
msort 8.53
align 1.7.0

-----
 Input data files data*:
==> data1 <==
SAMPLE    DERIVED    ANCESTRAL
Sample1    14352    0
Sample2    14352    0
Sample3    14352    0
Sample4    9880    4472

==> data2 <==
SAMPLE    DERIVED    ANCESTRAL
Sample1    14352    0
Sample2    14352    0
Sample3    14352    0
Sample4    13674    678

==> data3 <==
SAMPLE    DERIVED    ANCESTRAL
Sample14        -1      -1

==> data4 <==
SAMPLE    DERIVED    ANCESTRAL
Sample14        -3      -3

==> data5 <==
SAMPLE    DERIVED    ANCESTRAL
Sample14        4       4

-----
 Sample of file collection, TABBED, stripped header, etc.:
Edges: 4:4:4 of 29 lines in file "all-data"
Sample1 14352   0
Sample2 14352   0
Sample3 14352   0
Sample4 9880    4472
   ---
Sample1 14352   0
Sample2 14352   0
Sample3 14352   0
Sample4 13674   678
   ---
Sample13        13713   639
Sample14        -1      -1
Sample14        -3      -3
Sample14        4       4

-----
 Expected output:
SAMPLE    TOTAL DERIVED    TOTAL ANCESTRAL
Sample1    28704    0
Sample2    28704    0
Sample3    28704    0
Sample4    23554    5150
Sample5    23535    5169
Sample6    23547    5157
Sample7    23469    5235
Sample8    23477    5227
Sample9    23448    5256
Sample10    23434    5270
Sample11    23333    5371
Sample12    23477    5227
Sample13        23453   5251
Sample14        0       0

-----
 Results:
Sample1 28704   0
Sample2 28704   0
Sample3 28704   0
Sample4 23554   5150
Sample5 23535   5169
Sample6 23547   5157
Sample7 23469   5235
Sample8 23477   5227
Sample9 23448   5256
Sample10        23434   5270
Sample11        23333   5371
Sample12        23477   5227
Sample13        23453   5251
Sample14        0       0

-----
 Beautify results:
SAMPLE   TOTAL DERIVED TOTAL ANCESTRAL
Sample1          28704               0
Sample2          28704               0
Sample3          28704               0
Sample4          23554            5150
Sample5          23535            5169
Sample6          23547            5157
Sample7          23469            5235
Sample8          23477            5227
Sample9          23448            5256
Sample10         23434            5270
Sample11         23333            5371
Sample12         23477            5227
Sample13         23453            5251
Sample14             0               0

Here are some details about the utilities used:

datamash        command-line calculations (man)
Path    : /usr/local/bin/datamash
Version : 1.2
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYS ...)
Help    : probably available with -h,--help
Repo    : Debian 8.9 (jessie) 
Home    : https://savannah.gnu.org/projects/datamash/ (pm)
Home    : http://www.gnu.org/software/datamash (doc)

msort   sort records in complex ways (man)
Path    : /usr/bin/msort
Version : 8.53
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYS ...)
Repo    : Debian 8.9 (jessie) 
Home    : http://www.billposer.org/Software/msort.html (pm)
Home    : http://billposer.org/Software/msort.html (doc)

align   Align columns of text. (what)
Path    : ~/p/stm/common/scripts/align
Version : 1.7.0
Length  : 270 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/perl
Home    : http://kinzler.com/me/align/ (doc)
Modules : (for perl codes)
 Getopt::Std    1.10

Best wishes ... cheers, drl

Geneanalyst · March 29, 2018, 2:37pm

drl:

Hi.