Script for finding standard deviation

RJ17 · September 11, 2008, 9:55am

I have a CSV file that looks like

 
0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0
10,11,7,0,4,12,2,3,7,0,11,3,12,4,0,5,5,4,5,0,8,6,12,0,9,3,3,0,2,7,8
19,11,7,0,4,14,16,10,8,2,13,7,15,6,0,76,6,4,10,0,18,10,17,1,11,3,3,0,9,9,8
22,11,13,1,5,14,16,10,9,10,13,7,16,6,0,59,6,4,10,0,18,13,17,1,11,3,3,0,12,9,10
22,11,13,1,5,14,16,10,9,10,13,7,16,6,22,90,6,4,10,0,18,13,17,1,11,3,4,0,12,9,10
41,18,27,9,27,41,59,20,27,54,63,34,28,43,40,131,7,8,19,0,62,16,30,23,25,3,4,9,24,12,19
42,18,27,9,27,41,59,20,27,55,68,36,28,46,41,132,7,8,19,13,64,16,31,25,25,3,4,9,24,12,19
125,124,78,62,97,87,145,70,87,119,150,124,99,95,41,175,85,58,57,88,142,83,92,102,107,80,45,64,64,94, 89
125,126,78,62,99,87,145,70,87,119,161,124,99,95,41,175,85,58,58,88,142,84,112,103,108,80,68,64,65,98 ,89
189,254,164,153,192,153,230,132,188,163,210,210,167,198,93,235,146,110,97,130,211,107,181,140,151,11 9,105,105,178,126,165
189,324,168,192,194,159,233,132,192,169,244,210,167,201,103,235,147,152,180,181,213,107,192,190,212, 119,119,126,195,126,166
189,324,168,255,194,225,233,141,192,230,244,260,167,201,172,283,181,206,217,216,261,107,192,235,212, 119,169,197,264,189,229
366,438,315,319,382,287,398,320,416,382,407,397,342,448,276,392,297,368,237,347,336,332,384,405,412, 284,329,350,396,326,356

I need to find the stadard deviation for each individual row. Here is the code I have so far. I can't get the square root to work and also I can't get any floating point numbers.

 
for i in `cat file.csv ` 
do
     x1=0
     x2=0
     sigma=0
     IFS=, 
     for f in $i 
          do  
          let x1=$x1+$f
          let x2=$f*$f+$x2
     done 
     let x1=$x1/30
     let x2=$x2/30
     let sigma=sqrt($x2-$x1*$x1)
     echo "Mean = " $x1
     echo "Standard Deviation = " $sigma
done

jim_mcnamara · September 11, 2008, 10:09am

The shell does only integer arithmetic operations. You need to use awk or perl or some other envrionment that supports FP operations.

RJ17 · September 11, 2008, 10:14am

Ok. Can anyone help me rewrite my above script into awk or perl. awk would be preferred? Thanks.

jim_mcnamara · September 11, 2008, 10:22am

using your algorithm.... in awk which supports FP.

awk -F','  '{ sum=0; sumsq=0;
                for(i=1; i<=NF;i++) {sum+=$i; sumsq+=$i*$i}
                printf("mean=%f  stddev= %f\n", sum/NF, sqrt(sumsq - (sum*sum)) )
              } ' file.csv

joeyg · September 11, 2008, 10:38am

Depending on the accuracy required, you might consider
(a) For each of your values, multiplying by 100 or 1000 prior to beginning any math. Then know that you have to remove the extra digits and they are after the decimal point. For example 3/2 = 1 in integer, but 300/2 = 150 or adjusted 1.50
(b) An approximation for square root can be done in two parts. First off, add up all odd numbers until you are greater than the starting number. For example, sqrt of 10 would give you 1+3+5+7 and those four pieces are greater than the 10 you started with, so sqrt=3 (one less) as integer. Perhaps easier to see in the following [to get the integer part]
1 sqrt = 1+3 (more), so one digit is 1
2 sqrt = 1+3 (more), so 1
3 sqrt = 1+3 (more), so 1
4 sqrt = 1+3+5 (more), so 2 (again think one less)
5 sqrt = 1+3+5 (more), so 2
...
9 sqrt = 1+3+5+7 (more), so 3
To get to the decimal part there is another strange methodology involving looking at remaindors. In short the sqrt of 5 starts off with a 2 as seen above. Adding 1+3+5=9 and that is 4 too many (9-5). My last number in the 1+3+5 was a 5 and if I have 4 too many, I only needed a 1 (5-4=1). Take the 1 and the 5 and do 1/5 = .2
Add the first 2 to the .2 and you get 2.2 vs. actual of 2.23

For 8, start with the 2 as the integer. That is 1 too many (9-8). My last number was 5 again (in 1+3+5), so I only needed 4. Take that 4 and 5 to get to 4/5 = .8
Add the first 2 to this .8 and you get 2.8 vs actual 2.82

This is normally within a couple hundredths of the pure answer.

***
And I knew by the time I could write all that up, someone would have a program solution. But what the heck, if you can follow the logic of what I wrote for approximating sqrt, then you might agree it to be a cool function!

RJ17 · September 11, 2008, 10:59am

jim mcnamara:

using your algorithm.... in awk which supports FP.

awk -F','  '{ sum=0; sumsq=0;
   for(i=1; i<=NF;i++) {sum+=$i; sumsq+=$i*$i}
   printf("mean=%f  stddev= %f\n", sum/NF, sqrt(sumsq - (sum*sum)) )
   } ' file.csv

This awk code throws the following error.
awk: The sqrt parameter to a math library function is not in the domain.

This means that the portion to find the average works fine but because sqrt throws an error the std deviation does not work. I think this is because sumsq - (sum*sum) is a negative number.

jim_mcnamara · September 11, 2008, 12:27pm

You are correct. I copied your algorithm - it needs checks.

awk -F','  '{ sum=0; sumsq=0;
                for(i=1; i<=NF;i++) {sum+=$i; sumsq+=$i*$i}
                printf("mean=%f  stddev= %f\n", sum/NF, 
                sqrt(  ( (sumsq - (sum*sum))< 0) 
                           ? sumsq - (sum*sum)*-1 : sumsq -(sum*sum) )
              } ' file.csv

This should prevent domain errors.... the fact that there are a lot of zero values means the sum of squares can be very small number. You could also use a function like this placed at the top of the awk code block
function abs(n) { return (n <0)? n*=-1 : n}

awk -F','  '{ function abs(n) { return (n <0)? n*=-1 : n}

                sum=0; sumsq=0;
                for(i=1; i<=NF;i++) {sum+=$i; sumsq+=$i*$i}
                printf("mean=%f  stddev= %f\n", sum/NF, sqrt(abs(sumsq - (sum*sum))) )
              } ' file.csv

RJ17 · September 11, 2008, 1:04pm

Jim, thanks for all your help.
Here is the final awk code I am using. It is subject to rounding errors but it is close enough for my needs.

 
awk -F','  '{ sum=0; sumsq=0;
for(i=1; i<=NF;i++) {sum+=$i; sumsq+=$i*$i} 
printf("sumsq=%f  sum=%f  mean=%f  stddev= %f\n",sumsq, sum, sum/NF, sqrt((sumsq-(NF*((sum/NF))*(sum/NF)))/NF) ) } ' test