Extract values from a matrix given the rows and columns

shoaibjameel123 · September 6, 2011, 9:46pm

Hi All,

I have a huge (and its really huge!) matrix about 400GB in size (2 million rows by 1.5 million columns) . I am trying to optimize its space by creating a sparse representation of it.

Miniature version of the matrix looks like this (matrix.mtx):

3.4543 65.7876 54.564
2.12344 0.776565 4.563
1 4 7

So, this is what I have done until now.

I got the important rows and columns from another means not by processing this great matrix, those rows and columns which I really care about, and have those rows and columns stored in another text file called row_column.tmp

My row_column.tmp looks like this:

So, this means first row and first column is really important to me and I would like to extract the value from the huge matrix and make my output file look like this:

output.mtx

2 3 4.563
1 1 3.4543
1 3 54.564
2 2 0.776565
3 1 1

The above output shows that by reading in the rows and columns from the row_column.tmp, I go to the main matrix file matrix.mtx and extract the value from that particular row and column and put the value against that row and column in my output.mtx file.

Things which I need to care about is that I should not load the entire matrix in memory else things will get really messy. I am using Linux with BASH.

This is what I have done but not working:

awk -F' ' '
  NR == FNR { a[$1]=NR }
  NR != FNR {
    sub("row_column.tmp", "", FILENAME)
    print a[$2], FILENAME,  $1
  }
' matrix.mtx

yazu · September 6, 2011, 10:43pm

Yes, you can do it in awk and it's not difficult. But I very seriously doubt that awk and shell is appropriate tools for processing 400GB files. Just try to time some very simple awk script, ie:

time awk 'NR % 3 { $100=$1000; print NR, $1}' YORFILE

I believe It may take days or weeks. I suggest to use any compiled language, use available parallel tools and convert your file to noSQL database (with numbers, not strings) before processing. Or maybe use some specialized tools/languages like Matlab (or Octave).

shoaibjameel123 · September 6, 2011, 10:46pm

If its days or weeks, then probably I cannot afford that. I'll write my C program then and run it in parallel. MATLAB in the first place gave up "Not Enough Memory" This is a computational challenge, I believe and C can handle this very well. I'll paste my C code here when I am done with that.

yazu · September 6, 2011, 11:03pm

Well, I've never dealt with data files of such sizes.
Before starting to code try to find all possible information about processing this kind of data. The best of course if you could find someone who really worked with huge matrices stored in text files.

shoaibjameel123 · September 7, 2011, 4:47am

This is my first time too that's why facing many computational bottlenecks. But its fun at the end of the day

One possible solution is to split the matrix file and do parallel processing on those split files.
Another which I am currently doing is to do away with the matrix file itself and change my source program which actually created the matrix file to create the sparse file. This matrix is really really huge..blew off all my disk space too

---------- Post updated at 04:47 PM ---------- Previous update was at 11:07 AM ----------

I just did some tweaks to my C program and made it more efficient. Instead of creating that huge matrix file, I read in the files with the rows and columns information

I then read wrote my program in such a way, that I get the output as I have given above:

2 3 4.563
1 1 3.4543
1 3 54.564
2 2 0.776565
3 1 1

and BINGO!!! It worked pretty much well and occupied just 600MB of disk space and the program took just few minutes to execute whereas my last program that generated the BIG matrix file ran for the entire night. I am not posting my C code here as it won;t make sense and people cannot understand what the entire purpose of the program is.