I am looking to a solution to the following problem. I have a very large file that looks something like this:
Each group of three numbers on each line are three probabilities that sum to one.
I want to output the maximum for each group of three. So desired output would be:
or
I've written the following kludge which works just fine on a small subset of this data but does not scale up well (because it involves an ugly triple nested for loop):
#!/bin/sh
length=$(awk '{n++}END{print n}' file )
width=$(awk 'NR == 1 { print NF }' file )
width=$( expr $width / 3 )
for ((i=1; i <= length; i++))
do
for ((j=1; j <= width; j++))
do
a=$(expr 1 + \( \( $j - 1 \) \* 3 \) )
b=$(expr $a + 2)
awk -v i=$i -v a=$a -v b=$b 'NR == i { for (k = a; k <= b; k++ )if ( $k > max) max = $k } END { print max}' file >> out
done
done
Does anyone know of a solution that will scale up well? Can it be accomplished with basic unix utilities/shell scripting?
It's not the ugly triply-nested for-loop that's making it slow, it's the running of multiple external processes per loop. It's wasteful to run awk, grep, sed, and so forth for individual lines -- they are efficient when run on batches of data, but take time to load and quit. Imagine being only allowed to say one word per telephone call... Or in your case, having to say 10,000 words per phone call but only one of them with any meaning.
You're also using externals like expr when your shell (probably) supports better ways of doing math. expr is inefficient for the reason listed above but some shells have nothing better.
How one could do this more efficiently depends on what utilities are available and what system you have. what are they?