Help with identify gradient and the coefficient of determination of a straight line

perl_beginner · April 3, 2014, 6:38am

Hi,

Do anybody experience using awk or perl command to identify gradient of a straight line and The coefficient of determination/R-square value of a chart (R2) ?

Below is my input file :

t1 2 3 5 8
t2 0 2 0 2
t3 1 1 1 1
t4 50 70 80 90
.
.
.

Desired Output :

t1 2 3 5 8 0.6986 0.999
t2 0 2 0 2 0.1279 0.2237
t3 1 1 1 1 1 #N/A
t4 50 70 80 90 3.8813 0.9426
.
.
.

From the input file, column 1 is item that I wanna to calculate their corresponding gradient and The coefficient of determination/R-square value.
Column 2, 3, 4, 5 is the value at 0, 3, 6, 10 seconds.
Column 6 and 7 in the desired output file is gradient and The coefficient of determination/R-square value of item in column 1.

As I know that microsoft excel able to calculate the gradient of a straight line and The coefficient of determination/R-square value of a scatter lot (R2)
when we display the equation of chart and display R-square value of a chart.

Because I have a long list of item wanna to calculate the gradient and The coefficient of determination/R-square value of a chart (R2).
Thus I just curious whether anybody experience to calculate it through awk/perl command.

Thanks for any advice.

blakeoft · April 3, 2014, 8:47am

I'm assuming that you want the gradient of the best fit line. Is that correct?

perl_beginner · April 3, 2014, 9:15am

Hi blakeoft,

Yup. You're right. I just wanna find the gradient of the best fit line. It might not necessary get exactly the same value as what I manual generated from microsoft excel.
As long as it able to generate the gradient of best fit line and R-square value are fine enough.

I have around 10k item wanna to calculate the gradient of best fit line and R-square value of each item.
I unable to manual do it one by one.
Thus hope that got other alternative way able to count it automatic.

Really thanks and appreciate your advice.

blakeoft · April 3, 2014, 9:44am

This will get you the slope of the least squares line.

>awk '{sumy=$2+$3+$4+$5} {sumxy=$3*3+$4*6+$5*10} {print $1,$2,$3,$4,$5,(4.0*sumxy-19*sumy)/(4.0*145-361)}' file.txt
t1 2 3 5 8 0.611872
t2 0 2 0 2 0.127854
t3 1 1 1 1 0
t4 50 70 80 90 3.88128

I'm not sure why our first slopes differs so much, but the slope of the third line is definitely zero. This awk line only works if you have four data points for each line. Otherwise, you'll need to make some modifications. Also, I don't know what the coefficient of determination is so I need to look it up. You might be able to do it yourself if you follow this example.

perl_beginner · April 3, 2014, 11:14am

Hi blakeoft,

Many thanks. I will try it out now.
Really thanks and appreciate your help.

I will find it out for the R square

blakeoft · April 3, 2014, 11:39am

Sorry if this looks awful. It seems to work though.

>awk '{sumy=$2+$3+$4+$5} {sumxy=$3*3+$4*6+$5*10} {sumy2=$2*$2+$3*$3+$4*$4+$5*$5}
{denomy=(4.0*sumy2-sumy*sumy)} {slope=(4.0*sumxy-19*sumy)/(4.0*145-361)}
{if(denomy!=0) print $0,slope,(4.0*sumxy-19*sumy)*(4.0*sumxy-19*sumy)/((4.0*145-361)*denomy) ;
else print $0,slope,"#N/A"}' file.txt

Output:

t1 2 3 5 8 0.611872 0.976082
t2 0 2 0 2 0.127854 0.223744
t3 1 1 1 1 0 #N/A
t4 50 70 80 90 3.88128 0.942596

You might want to consider looking into R (I prefer to use R Studio as an interface). It's pretty easy to use and does stuff like this. Just something to think about if you have more work along the same lines.

perl_beginner · April 3, 2014, 12:20pm

Hi blakeoft,

Really many thanks for your assist.
The awk code looks awesome to me.

I'm really appreciate your help.
I will "digest" the meaning of your awk code.
Thanks again and a lot.

It really useful for me

blakeoft · April 3, 2014, 1:18pm

I can help you digest it. I honestly meant to explain every solution that I post on this site.

Most of the stuff before "if (denomy != 0)" are just variables that are used to calculate the slopes and R squared values. I'm specifically talking about the first five sets of curly brackets. There are also several integers in the formulas that are fixed. 19 is the sum of the x values, 145 is the sum of the squares of the x values, 361 is the square of the sum of the x values and 4 is the number of data points you have (I use 4.0 to avoid integer division--I'm not sure if that's necessary). We check if denomy is equal to zero because it's the only thing responsible for division by zero and thus the #N/A. If denomy is not equal to zero, then we print the original file (that's what $0 means, and by the way, $1 means first field, $2 means second field and so on), then we print the slope, and then the R squared value. If denomy is zero, we do the same as if it weren't, except we don't print R squared, we just print "#N/A". You could put anything in place of "#N/A" that you want. Let me know if you have any questions.

perl_beginner · April 8, 2014, 4:13am

Thanks, blakeoft.

Your explanation in detail really help me a lot.
I'm very appreciate for that

Currently I'm trying to edit the code if I have more data column information add in.
I will ask your professional advice again later.
Thanks first.

perl_beginner · April 9, 2014, 10:33pm

Hi blakeoft,

Sorry for disturbing again.
Do you have any idea calculate only those fixed number in your awk code based on case by case ?
I have 80,000++ line in an input file. I would like to calculate slope and R2 of each line based on their corresponding column data.

Really thanks if you have any idea.

blakeoft · April 10, 2014, 9:33am

I'm not quite sure I understand your question. Are you wanting to be able to find the slope and R squared for each row if the x values aren't always 0, 3, 6 and 10?