Calculate Pearson correlation

hi i am beginner in bash script
I am trying to calculate Pearson's correlation between two columns 2,3
my file looks
like that

Transcript stable ID	Medians	GC_content
ENST00000313407	6.103269453	67.48
ENST00000367381	0	35.29
ENST00000412443	75.39943201	47.96
ENST00000418552	75.39943201	47.96
ENST00000421608	75.39943201	47.96
ENST00000423247	75.39943201	47.93
ENST00000420257	75.39943201	47.93

the output file should contain ID and correlation
any help please ?

Hello and welcome to the forums @M_A

I'm not a moderator, but have you made any attemps on your own yet?
Please post your code and the error message.

Thank you

2 Likes

Welcome on board

@sea

I'm not a moderator,

No, but have have all required qualities... send an offer to the admin... I will approve...

About the thread:

I am trying to calculate Pearson's correlation between two columns 2,3

Unless you work in statistics, I doubt UNIX Geek would find that sentence appealing...
I lucky was in charge of a state statistics office, so it reminded me the time I was in charge, but in those days I would have used SAS...

@Hwaida_Ali:

If you want help as suggested above, we NEED more INFORMATION:

Your OS, version and architecture

What shell you are using OR you are confident with

Now:
calculate Pearson's correlation how would you do that:

Unless you show us your current state code, and issues you are having with it;
You will have do describe in plain english for non-statistician how it can be done!

Expect a lot of questions after your description before any attempts of algorithm
then only will you see some elaboration towards a solution

So get to work!

We are all impatient to read your replies...

2 Likes

Bash is entirely unsuitable for this purpose: the algorithm (as described in Wikipedia) requires at least floating-point arithmetic and a square-root function. Bash only has integer arithmetic capabilities.

You should probably investigate and research some alternative tools to make this possible: Python, Java, awk or a recognised statistical package. You will have a considerable learning curve for any viable solution, so you want to select a toolkit that will also be useful in the future.

2 Likes

Let's let the original poster answer the questions posed by our team members in the first and second replies to the original question.

On another note, this calculation is not "rocket science" and is well documented.

Because I code in Ruby everyday, these days, I did a simple Google search for this calculation and the results of the search are remarkable (and does not requires a "considerable learning curve" to execute, to be honest).

This Ruby method (below, see references) is 12 years old and the "learning required" in 2021 is to simply Google "Ruby Pearson correlation" and do a few minutes of study. A reasonable person familiar with basic modern computer programming could easily "get an in depth handle on it" in less than an hour.

def ruby_pearson(x,y)
  n=x.length

  sumx=x.inject(0) {|r,i| r + i}
  sumy=y.inject(0) {|r,i| r + i}

  sumxSq=x.inject(0) {|r,i| r + i**2}
  sumySq=y.inject(0) {|r,i| r + i**2}

  prods=[]; x.each_with_index{|this_x,i| prods << this_x*y[i]}
  pSum=prods.inject(0){|r,i| r + i}

  # Calculate Pearson score
  num=pSum-(sumx*sumy/n)
  den=((sumxSq-(sumx**2)/n)*(sumySq-(sumy**2)/n))**0.5
  if den==0
    return 0
  end
  r=num/den
  return r
end

There are countless references on the net using various tools and libs to perform Pearson correlations in Ruby and Python. Since I am currently programming daily in Ruby, I provide Ruby references; but you could also easily find Python libs as well:

References:

https://blog.chrislowis.co.uk/2008/11/24/ruby-gsl-pearson.html

pearson

This, I agree with. But many people use Bash and shell and AWK and other tools for solving numerical problems better suited for other programming languages. Maybe the original poster has a reason they wish to do this in Bash?

This is precisely why we ask questions of the original poster to help them define their problem and understand what and why (and on what platform / computing environment they are operating in).

In many cases we find that when a user has requested a solution in a specific programming language, and only that language, they are simply responding to a homework task from their professor who says "code the Pearson in Bash". Until we ask the original poster questions, and get a truthful reply, we actually have no idea what the OP is actually doing, and why.

HTH

3 Likes

Hi @M_A and welcome to our community.

Is this a homework assignment to use Bash to calculate Pearson? As you can see from the questions; our team does not understand why you would use Bash to do this calculation; and my guess is that you are using Bash because it is homework.

Is that correct?

Please and kindly reply to the questions from our team.

Thanks.

1 Like

Thanks for your comments and sorry for not clarifying enough and my late reply,
what i was trying to do is to calculate the pearson correlation between the median of gene expression and GC content to help me in my study research.

I was doing lots of part of the project using bash script in anaconda that's why i want to continue using it. but this seems challenging for me so i used python and online tool already. I am grateful for your comments it was really helpful :slight_smile .

also, I followed the formula equation and do step by step by awk like that and it worked with me, its not very nice and very long but it worked with me

awk -F, '{print $1 "," $2 "," $3 "," $2 * $3 "," $2 * $2 "," $3 * $3}' input > output


awk -F',' '{ for (i=2;i<=NF;i++) sum[i]+=$i } END { for (i in sum) printf("%f ", sum[i])}' output> output1

sed -e 's/\s\+/,/g' output1 > output2

awk -F',' -v factor=157364 '{print $1 "," $2 "," $3 "," $4 "," $5 "," $3 * factor "," $4 * factor "," $5 *factor "," $1 * $2}' output2 > correlation_output

awk -F"," '{printf ("%s,%2.5f,%2.5f,%2.5f,%2.5f,%2.5f,%2.5f,%2.5f,%2.5f,%2.5f,%2.5f,%2.5f,%2.5f,%2.3f", $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$6-$9,$7-$10,$8-$11)}' correlation_output > correlation_parameter.txt

awk -F"," '{print $0 "," $13 * $14}' correlation_parameter.txt > correlation_parameter_modified.txt awk -F"," '{r=$15; print $0 "," sqrt(r)}' correlation_parameter_modified.txt > last_parameter.txt

awk -F',' '{print "R=" $12/$16 }' last_parameter.txt > correlation_value.txt 

Thanks a lot

2 Likes

Very cool!

Do you want our team to continue to help you to refine with AWK?

Or are you happy with your Python solution?

What exact Python lib / solution did you end up using?

1 Like

@neo I completely agree with you, and I am taken aback that my "Do this with more appropriate tools" is accepted as a "solution".

We do need to know what constraints there are on the OP. As a professional, I am all for the "Build on existing knowledge" techniques - the client gets a better product for less money. As a mentor, I am for the "Try it yourself, and ask for a helping hand when you need it" route.

However, when a post starts "I am a beginner in Bash", and wants a computation like this, some early guidance seems helpful. If he was a dab hand in Ruby, Python, or (!) Fortran, he would not mention Bash. If a Prof set this exact requirement, he surely is not expecting a valid result.

I see two different types of learning here: (a) How to implement (from scratch) a moderately complicated calculation in a programming language for the first time; (b) How to discover, assimilate, and (possibly) adapt a known valid solution into a specific case, also for the first time. Those can be equally daunting to a relative novice, and the likely future benefits of each kind of learning cannot be ignored.

Hi @Paul_Pedant

It's up to the OP to mark a post "solved" as they wish.

The OP is happy, marked the thread "solved" and so this thread is "solved" as far as the OP is concerned.

I asked the OP to post back with their final "solution" so please wait for their reply (that is if the OP wishes to reply).

There is no need to "overthink" things and second-guess the OP, who has already posted that they have "solved" it using Python, etc. It is their issue, not ours.

Cheers.

PS: If you are interested in "Pearson correlations" outside of this user-topic, please feel free to open a new topic and discuss your solution(s) for Pearson correlations (and feel free to reference this user-topic).

I would do a step-by-step consolidation.
E.g. the second awk, the following sed, and the following awk:

awk '
  BEGIN { FS=OFS="," }
  { print $1, $2, $3, $2*$3, $2*$2, $3*$3 }
' input > output

awk -v factor=157364 '
  BEGIN { FS=OFS="," }
  { for (i=2;i<=6;i++) sum[i]+=$i }
  END { print sum[2], sum[3], sum[4], sum[5], sum[6], sum[4]*factor, sum[5]*factor, sum[6]*factor, sum[2]*sum[3] }
' output > correlation_output

...

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.