Standardization

Hi,

I am trying to do standardization (subtracting the mean from original value and dividing by stdev) using an automated script.
I have a file with 10000 rows. I need to initially calculate mean and stdev for each column and then using those values I have to obatin the new standardized values.. I can do this pretty easily in excel.. But I am looking for an automated script.

input

DOTR1	10.29006	10.06744	10.47105	10.05041	10.18407	9.770205	10.90548	10.75112
RCC2	6.699481	7.240353	7.263434	6.654058	6.86063	7.151931	6.796337	6.78525
HHPA6	7.31182	7.547056	8.338827	7.278408	7.545548	7.409964	7.149899	7.300342
PAX8	8.336847	8.651292	8.493323	8.5056	8.445139	8.651406	8.664237	8.56571
ACA1A	4.233111	4.320666	4.232803	4.390224	4.269969	4.314899	4.264211	4.142419
UBA7	8.196608	8.164725	7.361889	8.055019	8.882745	7.6884	7.835754	8.354209
OOA	5.098222	5.212986	5.301191	5.211401	5.13133	5.153725	5.269111	5.195991
ACX1	4.875679	5.01305	4.921618	4.930978	4.899562	4.92918	4.970339	4.986362

The mean for column 1 is 6.880 and stdev is 2.066

I will now subtract the mean from my observation and divide by stdev to (10.29006-6.880)/2.066. I will do this on all subsequent observations row-wise in column 1. For column 2 again I will find its mean and corresponding stdev and follow the same procedure.

Thanks,

In bash, you could read each record into a simple array with read -a, put the first column values in a simple array and put the numeric values in an associative array that mimics a two dimensional array by keying it with "row:col". Then you can walk the rows column by column and do your calculations and modifications. When done, you regenerate the file from the arrays for rows 0-n, columns 0-m. I suppose awk can do this, too.

Hi,

Thanks for the reply.. But I am not familiar with bash scripting as I am with awk, but anyhow i will try the algorithm you mentioned.

Thanks

You could use one dimensional arrays tricked into another dimension by concatenation: $row$col, as $col is always one digit. typeset'ing variables can prevent repeated conversion from numeric to character form and back. For problems like this, life is simpler in PERL, C/C++, JAVA.

Hello
with examples and numbers you provide i wrote this code .
my shell is bourne . save the code in a file called diya123 and chmod it to 755 :

chmod 755 diya123

save your data in a file and , i assume that program and file both are in currrent working directory , invoke the program with

./diya123 filename
case $# in
        1) ;;
        *) echo 'Usage: diya123 file' 1>&2; exit 1;;
esac

test -r "$1" || { echo "diya123: can't open '$1'" 1>&2; exit 2; }

( cat  "$1" | awk '
                { for (i=2; i<=9; i++) sum+=$i; }
                END { printf "mean"; for (i=2; i<=9; i++) printf "\t%.6f", sum/NR ; printf "\n"; }
                ' | tee /dev/tty
 cat "$1" ) |

                awk '
                        NR == 1 { for(i=2;i<=9;i++) mean=$i }
                        NR == 2 { for(i=2;i<=9;i++) obser=$i }
                        NR != 1 { for(i=2;i<=9;i++) std += ($i-mean) ** 2  }

                        END {
                                printf "stddev";
                                for(i=2;i<=9;i++) {
                                        stddev=sqrt(std/(NR-2));
                                        printf "\t%.6f", stddev;
                                }
                                printf "\n";

                                printf "sub";
                                for(i=2;i<=9;i++)
                                        printf "\t%.6f", (obser - mean)/stddev;
                                printf "\n";
                        }'

Hi mstafreshi,

Thank you so much.. I tried using your code on the example data I provided, but in the output it prints only the first row..

Here is the output

[dvaka@cabiopeds new]$ vi test_file
[diya@cab new]$ vi diya123
[diya@cab new]$ chmod 755 diya123
[diya@cab new]$ ./diya123 test_file
mean    6.880228        7.027196        7.048017        6.884512        7.0273746.883714        6.981921        7.010175
stddev  2.066795        2.005844        2.108336        1.966369        2.1192771.922232        2.183782        2.202813
sub     1.649817        1.515693        1.623571        1.610022        1.4895161.501635        1.796681        1.698258
[diya@cab new]$

How can I get the result for all 10,000 rows.

Thanks,

Hello
I don't know your bussiness and i has not work with excell !
I just test this script with one line of example you provided . can this scripts works as expected?

case $# in
        1) ;;
        *) echo 'Usage: diya123 file' 1>&2; exit 1;;
esac

test -r "$1" || { echo "diya123: can't open '$1'" 1>&2; exit 2; }

( cat  "$1" | awk '
                { for (i=2; i<=9; i++) sum+=$i; }
                END { printf "mean4col"; for (i=2; i<=9; i++) printf "\t%.6f", sum/NR; printf "\n"}
                '
 cat "$1" ) | ( # tee /dev/tty
                awk '
                        NR == 1 { print; for(i=2;i<=9;i++) mean=$i }
                        NR != 1 { for(i=2;i<=9;i++) std += ($i-mean) ** 2  }

                        END {
                                printf "stddev4col";
                                for(i=2;i<=9;i++) {
                                        stddev=sqrt(std/(NR-2));
                                        printf "\t%.6f", stddev;
                                }
                                printf "\n";
                        }'; cat "$1" )| awk '
                                NR == 1 { print; for(i=2;i<=9;i++) mean=$i }
                                NR == 2 { print; for(i=2;i<=9;i++) std=$i }

                                NR != 1 && NR != 2 {
                                        printf "sub[%s]", $1;
                                        for(i=2;i<=9;i++)
                                                printf "\t%.6f", ($i - mean)/std;
                                        printf "\n";
                                }'

What platform are you on? I usually flee sh and csh for ksh93 or bash.