Hi,
I am trying to do standardization (subtracting the mean from original value and dividing by stdev) using an automated script.
I have a file with 10000 rows. I need to initially calculate mean and stdev for each column and then using those values I have to obatin the new standardized values.. I can do this pretty easily in excel.. But I am looking for an automated script.
input
DOTR1 10.29006 10.06744 10.47105 10.05041 10.18407 9.770205 10.90548 10.75112
RCC2 6.699481 7.240353 7.263434 6.654058 6.86063 7.151931 6.796337 6.78525
HHPA6 7.31182 7.547056 8.338827 7.278408 7.545548 7.409964 7.149899 7.300342
PAX8 8.336847 8.651292 8.493323 8.5056 8.445139 8.651406 8.664237 8.56571
ACA1A 4.233111 4.320666 4.232803 4.390224 4.269969 4.314899 4.264211 4.142419
UBA7 8.196608 8.164725 7.361889 8.055019 8.882745 7.6884 7.835754 8.354209
OOA 5.098222 5.212986 5.301191 5.211401 5.13133 5.153725 5.269111 5.195991
ACX1 4.875679 5.01305 4.921618 4.930978 4.899562 4.92918 4.970339 4.986362
The mean for column 1 is 6.880 and stdev is 2.066
I will now subtract the mean from my observation and divide by stdev to (10.29006-6.880)/2.066. I will do this on all subsequent observations row-wise in column 1. For column 2 again I will find its mean and corresponding stdev and follow the same procedure.
Thanks,
In bash, you could read each record into a simple array with read -a, put the first column values in a simple array and put the numeric values in an associative array that mimics a two dimensional array by keying it with "row:col". Then you can walk the rows column by column and do your calculations and modifications. When done, you regenerate the file from the arrays for rows 0-n, columns 0-m. I suppose awk can do this, too.
Hi,
Thanks for the reply.. But I am not familiar with bash scripting as I am with awk, but anyhow i will try the algorithm you mentioned.
Thanks
You could use one dimensional arrays tricked into another dimension by concatenation: $row$col, as $col is always one digit. typeset'ing variables can prevent repeated conversion from numeric to character form and back. For problems like this, life is simpler in PERL, C/C++, JAVA.
Hello
with examples and numbers you provide i wrote this code .
my shell is bourne . save the code in a file called diya123 and chmod it to 755 :
chmod 755 diya123
save your data in a file and , i assume that program and file both are in currrent working directory , invoke the program with
./diya123 filename
case $# in
1) ;;
*) echo 'Usage: diya123 file' 1>&2; exit 1;;
esac
test -r "$1" || { echo "diya123: can't open '$1'" 1>&2; exit 2; }
( cat "$1" | awk '
{ for (i=2; i<=9; i++) sum+=$i; }
END { printf "mean"; for (i=2; i<=9; i++) printf "\t%.6f", sum/NR ; printf "\n"; }
' | tee /dev/tty
cat "$1" ) |
awk '
NR == 1 { for(i=2;i<=9;i++) mean=$i }
NR == 2 { for(i=2;i<=9;i++) obser=$i }
NR != 1 { for(i=2;i<=9;i++) std += ($i-mean) ** 2 }
END {
printf "stddev";
for(i=2;i<=9;i++) {
stddev=sqrt(std/(NR-2));
printf "\t%.6f", stddev;
}
printf "\n";
printf "sub";
for(i=2;i<=9;i++)
printf "\t%.6f", (obser - mean)/stddev;
printf "\n";
}'
Hi mstafreshi,
Thank you so much.. I tried using your code on the example data I provided, but in the output it prints only the first row..
Here is the output
[dvaka@cabiopeds new]$ vi test_file
[diya@cab new]$ vi diya123
[diya@cab new]$ chmod 755 diya123
[diya@cab new]$ ./diya123 test_file
mean 6.880228 7.027196 7.048017 6.884512 7.0273746.883714 6.981921 7.010175
stddev 2.066795 2.005844 2.108336 1.966369 2.1192771.922232 2.183782 2.202813
sub 1.649817 1.515693 1.623571 1.610022 1.4895161.501635 1.796681 1.698258
[diya@cab new]$
How can I get the result for all 10,000 rows.
Thanks,
Hello
I don't know your bussiness and i has not work with excell !
I just test this script with one line of example you provided . can this scripts works as expected?
case $# in
1) ;;
*) echo 'Usage: diya123 file' 1>&2; exit 1;;
esac
test -r "$1" || { echo "diya123: can't open '$1'" 1>&2; exit 2; }
( cat "$1" | awk '
{ for (i=2; i<=9; i++) sum+=$i; }
END { printf "mean4col"; for (i=2; i<=9; i++) printf "\t%.6f", sum/NR; printf "\n"}
'
cat "$1" ) | ( # tee /dev/tty
awk '
NR == 1 { print; for(i=2;i<=9;i++) mean=$i }
NR != 1 { for(i=2;i<=9;i++) std += ($i-mean) ** 2 }
END {
printf "stddev4col";
for(i=2;i<=9;i++) {
stddev=sqrt(std/(NR-2));
printf "\t%.6f", stddev;
}
printf "\n";
}'; cat "$1" )| awk '
NR == 1 { print; for(i=2;i<=9;i++) mean=$i }
NR == 2 { print; for(i=2;i<=9;i++) std=$i }
NR != 1 && NR != 2 {
printf "sub[%s]", $1;
for(i=2;i<=9;i++)
printf "\t%.6f", ($i - mean)/std;
printf "\n";
}'
What platform are you on? I usually flee sh and csh for ksh93 or bash.