3 column .csv --> correlation matrix; awk, perl?

Greetings, salutations.

I have a 3 column csv file with ~13 million rows and I would like to generate a correlation matrix. Interestingly, you all previously provided a solution to the inverse of this problem. Thread title: "awk? adjacency matrix to adjacency list / correlation matrix to list" (I haven't posted 5 times yet - can't insert URL)

Brief description of data: columns 1 and 2 contain variables, column 3 contains their correlation. For simplicity and discussion, let's use this example:

a,a,1
a,b,0.8
a,c,0.5
b,a,0.8
b,b,1
b,c,0.2
c,a,0.5
c,b,0.2
c,c,1

(sidebar - forum users "birei" and "shamrock" provided excellent solutions to remove the a,b = b,a redundancy. Thread title: "awk with existence argument?")

desired output:

,a,b,c
a,1,0.8,0.5
b,0.8,1,0.2
c,0.5,0.2,1

I currently use the following python script, where the creation of the dictionary and subsequent line-by-line look up is painfully slow for large files.

import os, time

def matrix(f):
    
    if 'csv' in f and 'matrix' not in f:
        
        i = 0
        on = time.time()
        fp = open(os.path.join(os.getcwd(),f),'r')
        
        data = [x.split(',') for x in fp]
        data.pop(0)
        
        correlations = {}
        for x in data:
            correlations[x[0]] = {}
            variables = correlations.keys()
            variables.sort()
        
        for x in variables:
            for y in variables:
                if x == y:
                    correlations[x][y] = 1
                    i+=1
                else:
                    for row in data:
                        if x in row and y in row:
                            i+=1
                            correlations[x][y] = float(row[2])
        
        data = ''
        for y in variables:
            data += ',%s' %y
        data += '\n'
        
        for x in variables:
            data += '%s,' %x
            for y in variables:
                i+=1
                try:
                    data += '%s,' %correlations[x][y]
                except:
                    data += 'CORRELATION NOT FOUND!,'
            data += '\n'
        fw = open(os.path.join(os.getcwd(),f[:-4]+'_matrix.csv'),'w')
        fw.write(data)
        fw.close()
        
        s = (on - time.time()) / float(1000)
        print '%s complete: %i iterations, %s duration.' %(f,i,s)


for f in os.listdir(os.getcwd()):
    matrix(f)

I am not partial to any particular scripting language. I am, however, partial to speed ;). In my experience, awk and perl are the fastest at manipulating these .csv files, but I'm not savvy enough with either. Any suggestions would be appreciated.

Thanks in advance.

Will do, thanks bakunin.

Try this awk script...

awk -F, 'BEGIN {OFS = ","}
{
   t = t ? (l != $1 ? t OFS $1 : t) : OFS $1
   x[$1] = x[$1] ? x[$1] OFS $NF : $NF
   l = $1
} END {
   print t
   for (i in x) print i, x
}' file
1 Like

Hi Shamrock,

Thanks very much for your help with this. I must be missing something fundamental when running your script. Whether I simplify it and run line by line or by making it an executable and run just as you posted, I get the following error:

context is
	awk -F, >>>  ' <<<

Attached is the simple 3column.csv for testing (note, I had to append ".txt" in order to upload). Does it run ok for you?

Thanks in advance.

disregard the error message. I believe this was on my end and not related to shamrock's script

Let us first have some rough estimations about sizes:

How many rows/columns will this matrix have? Will there be empty matrix elements? The background is: there are some limitations which may or may not affect the solution: Unix files have maximum line lengths because text processing utilities like "sed", "awk", etc. can't handle longer ones (see MAXLINE in sys/limits.h . Shell arrays cannot have more than 1024 elements.

Then some questions: How to deal with multiple entries with the same indexes - add together, generate error, other?

My take would be to first put the lines in "normal form": the lines are of the form

keyA, keyB, value

As all keys are interchangeable, so that the following two entries would in effect be the same

x, y, value
y, x, value

and the matrix you are constructing is symetrical along the main diagonal the first step should be to sort the keys within the lines by some criteria, so that the first key in the line is consistently "lower or equal" or "higher or equal" than the second key in the line.

Than a simple sort over the first two fields will reduce the problem to a simple sort-of group-change: all lines with a given keyA will represent one row AND - because the matrix is symetrical - also one column.

I hope this helps.

bakunin

1 Like

Ok, so I was able to run this last night on a test file:

a,a,1
a,b,0.8
a,c,0.5
b,b,1
b,c,0.2
c,c,1

It returned:

,a,b,c
,0.5
,0.2
c,1

so the first line looks good, meaning 't' is correct but the subsequent line of code isn't quite right.

Running without the final x[i]:

awk -F, 'BEGIN {OFS = ","}
{
   t = t ? (l != $1 ? t OFS $1 : t) : OFS $1
   x[$1] = x[$1] ? x[$1] OFS $NF : $NF
   l = $1
} END {
   print t
   for (i in x) print i
}' file

returns:

,a,b,c
a
b
c

which is what we want for the first row and the first column.

Shamrock, do you have any thoughts on what we could do to the 'x' variable to create the correlation matrix?

Thanks very much for your help.

---------- Post updated at 10:11 AM ---------- Previous update was at 09:42 AM ----------

If we process our data in one batch, the matrix will have 5,122 rows and columns. If there are inherent line limitations that prevent this, we'll come up with ways to chunk the data into smaller batches. Presumably, there will be a handful of empty matrix elements where the original correlation analyses failed to converge.

There are no multiple entries with indexes of the same order - that is

x, y, value

will not repeat. However,

x, y, value
and
y, x, value

are both present if we do not reduce our data. shamrock and birei already provided means to remove these redundancies. Furthermore, shamrock's awk program in this thread that we are currently working on is more problematic if

x, y, value
y, x, value

are both present in the input data, even if the values are identical.

Thanks bakunin, this does indeed help, as we will pursue something along these lines if we cannot get shamrock's awk program to make the correlation matrix.

I am not sure why you are getting incorrect results as the output on my system is exactly what you want...i changed the awk script slightly so it uses if else statements instead of the conditional operator...thogh i am not sure why that would fix things on your system but it helps if you copy and paste correctly...

awk -F, '{
    OFS = ","
    if (t) {
       if (l != $1)
          t = t OFS $1
    } else t = OFS $1
    x[$1] = x[$1] ? x[$1] OFS $NF : $NF
    l = $1
} END {
    print t
    for (i in x) print i, x
}' file
1 Like

indeed :slight_smile:

Which version of awk are you using?

Thanks.