Greetings, salutations.
I have a 3 column csv file with ~13 million rows and I would like to generate a correlation matrix. Interestingly, you all previously provided a solution to the inverse of this problem. Thread title: "awk? adjacency matrix to adjacency list / correlation matrix to list" (I haven't posted 5 times yet - can't insert URL)
Brief description of data: columns 1 and 2 contain variables, column 3 contains their correlation. For simplicity and discussion, let's use this example:
a,a,1
a,b,0.8
a,c,0.5
b,a,0.8
b,b,1
b,c,0.2
c,a,0.5
c,b,0.2
c,c,1
(sidebar - forum users "birei" and "shamrock" provided excellent solutions to remove the a,b = b,a redundancy. Thread title: "awk with existence argument?")
desired output:
,a,b,c
a,1,0.8,0.5
b,0.8,1,0.2
c,0.5,0.2,1
I currently use the following python script, where the creation of the dictionary and subsequent line-by-line look up is painfully slow for large files.
import os, time
def matrix(f):
if 'csv' in f and 'matrix' not in f:
i = 0
on = time.time()
fp = open(os.path.join(os.getcwd(),f),'r')
data = [x.split(',') for x in fp]
data.pop(0)
correlations = {}
for x in data:
correlations[x[0]] = {}
variables = correlations.keys()
variables.sort()
for x in variables:
for y in variables:
if x == y:
correlations[x][y] = 1
i+=1
else:
for row in data:
if x in row and y in row:
i+=1
correlations[x][y] = float(row[2])
data = ''
for y in variables:
data += ',%s' %y
data += '\n'
for x in variables:
data += '%s,' %x
for y in variables:
i+=1
try:
data += '%s,' %correlations[x][y]
except:
data += 'CORRELATION NOT FOUND!,'
data += '\n'
fw = open(os.path.join(os.getcwd(),f[:-4]+'_matrix.csv'),'w')
fw.write(data)
fw.close()
s = (on - time.time()) / float(1000)
print '%s complete: %i iterations, %s duration.' %(f,i,s)
for f in os.listdir(os.getcwd()):
matrix(f)
I am not partial to any particular scripting language. I am, however, partial to speed ;). In my experience, awk and perl are the fastest at manipulating these .csv files, but I'm not savvy enough with either. Any suggestions would be appreciated.
Thanks in advance.
Will do, thanks bakunin.