Hi.
I like awk
solutions. However, I also like packaged solutions. In this case GNU datamash
can do the grouping and summing with:
datamash -g 1 sum 2,3
which will sum fields 2 and 3 for items in groups of field 1.
However simple as this appears, there are additional complexities. First, datamash
, as with many standard utilities, likes TAB-delimited files by default. Although headers can be ignored, we can combine replacing runs of spaces with a TAB as well as deleting headers with a sed
operation. So we can append all modified input files to a single input file, which is also what datamash
likes.
As you can imagine, it is best and easiest when the lines for the group operation are collected together. There is a datamash
option for such sorting, but your choice of group names are mixed alphabetic and numeric -- perhaps called a hybrid string. A program that can handle that is msort
.
This data preparation can be combined into a loop that can handle a number of data files. Here we have added 3 additional data files as an illustration. The script uses as input all file names that begin with the string data -- data1, data2, etc.
Then we can run the command as noted above.
If we want to make the output pretty, we can add a header, and use a simple perl script called align
, which aligns fields automatically, but can also be directed to align left, center, right, etc.
With all that in mind, here is a script that shows these operations and the results:
#!/usr/bin/env bash
# @(#) s2 Demonstrate grouping, summing fields, many files, datamash.
# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C specimen datamash msort align
FILES=${1-data*}
E=expected-output
rm all-data
pl " Input data files $FILES:"
head -n 5 $FILES
pl " Sample of file collection, TABBED, stripped header, etc.:"
for file in data*
do
sed '1d;2,$s/ */\t/g' $file >> all-data
done
specimen 4:4:4 all-data
pl " Expected output:"
cat $E
pl " Results:"
echo "SAMPLE TOTAL DERIVED TOTAL ANCESTRAL" > f1
msort -j -q -l -n 1,1 -c hybrid all-data |
datamash -g 1 sum 2,3 |
tee -a f1
pl " Beautify results:"
align -alrr f1
exit 0
producing:
$ ./s2
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-5-amd64, x86_64
Distribution : Debian 8.9 (jessie)
bash GNU bash 4.3.30
specimen (local) 1.17
datamash (GNU datamash) 1.2
msort 8.53
align 1.7.0
-----
Input data files data*:
==> data1 <==
SAMPLE DERIVED ANCESTRAL
Sample1 14352 0
Sample2 14352 0
Sample3 14352 0
Sample4 9880 4472
==> data2 <==
SAMPLE DERIVED ANCESTRAL
Sample1 14352 0
Sample2 14352 0
Sample3 14352 0
Sample4 13674 678
==> data3 <==
SAMPLE DERIVED ANCESTRAL
Sample14 -1 -1
==> data4 <==
SAMPLE DERIVED ANCESTRAL
Sample14 -3 -3
==> data5 <==
SAMPLE DERIVED ANCESTRAL
Sample14 4 4
-----
Sample of file collection, TABBED, stripped header, etc.:
Edges: 4:4:4 of 29 lines in file "all-data"
Sample1 14352 0
Sample2 14352 0
Sample3 14352 0
Sample4 9880 4472
---
Sample1 14352 0
Sample2 14352 0
Sample3 14352 0
Sample4 13674 678
---
Sample13 13713 639
Sample14 -1 -1
Sample14 -3 -3
Sample14 4 4
-----
Expected output:
SAMPLE TOTAL DERIVED TOTAL ANCESTRAL
Sample1 28704 0
Sample2 28704 0
Sample3 28704 0
Sample4 23554 5150
Sample5 23535 5169
Sample6 23547 5157
Sample7 23469 5235
Sample8 23477 5227
Sample9 23448 5256
Sample10 23434 5270
Sample11 23333 5371
Sample12 23477 5227
Sample13 23453 5251
Sample14 0 0
-----
Results:
Sample1 28704 0
Sample2 28704 0
Sample3 28704 0
Sample4 23554 5150
Sample5 23535 5169
Sample6 23547 5157
Sample7 23469 5235
Sample8 23477 5227
Sample9 23448 5256
Sample10 23434 5270
Sample11 23333 5371
Sample12 23477 5227
Sample13 23453 5251
Sample14 0 0
-----
Beautify results:
SAMPLE TOTAL DERIVED TOTAL ANCESTRAL
Sample1 28704 0
Sample2 28704 0
Sample3 28704 0
Sample4 23554 5150
Sample5 23535 5169
Sample6 23547 5157
Sample7 23469 5235
Sample8 23477 5227
Sample9 23448 5256
Sample10 23434 5270
Sample11 23333 5371
Sample12 23477 5227
Sample13 23453 5251
Sample14 0 0
Here are some details about the utilities used:
datamash command-line calculations (man)
Path : /usr/local/bin/datamash
Version : 1.2
Type : ELF 64-bit LSB executable, x86-64, version 1 (SYS ...)
Help : probably available with -h,--help
Repo : Debian 8.9 (jessie)
Home : https://savannah.gnu.org/projects/datamash/ (pm)
Home : http://www.gnu.org/software/datamash (doc)
msort sort records in complex ways (man)
Path : /usr/bin/msort
Version : 8.53
Type : ELF 64-bit LSB executable, x86-64, version 1 (SYS ...)
Repo : Debian 8.9 (jessie)
Home : http://www.billposer.org/Software/msort.html (pm)
Home : http://billposer.org/Software/msort.html (doc)
align Align columns of text. (what)
Path : ~/p/stm/common/scripts/align
Version : 1.7.0
Length : 270 lines
Type : Perl script, ASCII text executable
Shebang : #!/usr/bin/perl
Home : http://kinzler.com/me/align/ (doc)
Modules : (for perl codes)
Getopt::Std 1.10
Best wishes ... cheers, drl