Transpose columns to Rows : Big data

genehunter · August 9, 2010, 4:44pm

Hi,
I did read a few posts on the subjects, tried out a few solutions, but did not solve my problem.

http://www.unix.com/shell-programming-scripting/137953-large-file-columns-into-rows-etc-4.html

Please help. Problem very similar to the second link poster, but slighlt different input format. The field separator is space . The actual data matrix is a file with 2000 rows and 600,000 columns.
Input style:

IID    PAT    MAT    SEX    PHENOTYPE    rs15286_1    rs319_1    rs80300_1    rs40777_1    rs8597_1    rs5136_1    rs60595_1    rs64968_1    rs4405_1    rs1554_1
TD-MIKV    0 0 2 1 1 0 0 1 0 1 0 1 1 0
TD-HA4Q 0 0 2 1 1 0 0 0 0 0 0 0 0 0
TD-H9ZG 0 0 2 2 0 0 0 1 0 0 0 0 0 0
TD-HAQX 0 0 2 1 0 0 0 2 0 0 0 0 0 0
TD-HA5E 0 0 2 2 0 1 1 1 0 0 0 1 1 0
TD-MGFV 0 0 2 2 1 0 0 0 0 NA 0 0 0 1
TD-HB4V 0 0 2 1 0 0 1 0 1 NA 0 1 1 0
TD-MIPE 0 0 2 2 0 0 0 1 0 0 0 0 0 0
TD-MINR 0 0 2 2 0 0 0 0 0 2 0 1 1 0

Output style

   IID TD-MIKV TD-HA4Q TD-H9ZG TD-HAQX TD-HA5E TD-MGFV TD-HB4V TD-MIPE TD-MINR
PAT 0 0 0 0 0 0 0 0 0
MAT 0 0 0 0 0 0 0 0 0
SEX 2 2 2 2 2 2 2 2 2
PHENOTYPE 1 1 2 1 2 2 1 2 2
rs15286_1 1 1 0 0 0 1 0 0 0
rs319_1 0 0 0 0 1 0 0 0 0
rs80300_1 0 0 0 0 1 0 1 0 0
rs40777_1 1 0 1 2 1 0 0 1 0
rs8597_1 0 0 0 0 0 0 1 0 0
rs5136_1 1 0 0 0 0 NA NA 0 2
rs60595_1 0 0 0 0 0 0 0 0 0
rs64968_1 1 0 0 0 1 0 1 0 1
rs4405_1 1 0 0 0 1 0 1 0 1
rs1554_1 0 0 0 0 0 1 0 0 0

awk or python preferable, since I understand them a teeny weeny bit.
Thanks in advance,
Regards
~GH

cjcox · August 9, 2010, 8:11pm

A bit ole school... but here's a possible shell solution:

#!/bin/sh

filein="$1"
fileout="$2"
cols=`head -1 <"$filein" | wc -w`
count=1
while [ $count -le $cols ]; do
        if [ $count = 1 ]; then
                tr -s ' ' ' ' <"$filein" | cut -f$count -d' ' | tr '\012' ' ' >"$fileout"
        else
                tr -s ' ' ' ' <"$filein" | cut -f$count -d' ' | tr '\012' ' ' >>"$fileout"
        fi
        echo "" >>"$fileout"
        count=`expr $count + 1`
done

In some cases, this may actually work better than reading everything into memory and processing... Anyhow... something to consider...

kurumi · August 9, 2010, 8:14pm

#deleted#

alister · August 9, 2010, 11:51pm

awk '{for (i=1; i<=NF; i++) a=a(NR!=1?FS:"")$i} END {for (i=1; i in a; i++) print a}' file

rdcwayx · August 10, 2010, 12:40am

The code will have problem, if some lines don't have same columns.

kurumi's code is still correct in this situation.

alister · August 10, 2010, 1:42am

Strictly speaking, neither solution correctly handles irregular lengths; properly handling a file with differing row lengths would involve not making any assumptions based on a single row (if row n has more columns than row 1, truncation would occur).

In any case, nothing in the original post implies that rows vary in width. Quite the contrary, the word "matrix" is used, which indicates a rectangular array. Still, it's good that you pointed that out just in case it is a concern.

On a different note, if memory allows, when dealing with a very large dataset, my solution should be much much much faster. Only one instance of awk is needed and the file is only read once. kurumi's will fork-exec 600,000 awk processes and read the file 600,000 times.

kurumi · August 10, 2010, 2:08am

I wrote the code based on what OP has provided, so please don't assume anything else. As for the 600,000 figure i don't know where you got it from. I assume you mean 600,000 lines of records. If that's so, my code will call awk 15 * 600,000 times (for 15 columns of date) i/o, while yours will fill up memory with the whole big file. Because OP has a big file, do you think its advisable to fill everything to memory ? think about it.

alister · August 10, 2010, 2:38am

I have not assumed anything but what was stated in the original post. And yes, if there is sufficient ram to hold the data in memory, then most definitely that would be the best way to go (unless you want to wait days instead of hours).

The 600,000 figure comes from the original post, which states that the data contains 2,000 rows of 600,000 columns each (that's 600,001 awk invocations, 1 to determine the number of columns in the first row plus one per column), with each of those invocations reading the entire file.

A test run on a small (compared to 2,000x600,000) 500x1000 matrix shows that your i/o heavy approach takes 2m35s versus 5s for the ram hungry solution. So, yeah, I think it's a good idea to fill up memory when the gains are on the order of 30x faster (1 hour versus 1.25 days). And as the data set grows, so will the disparity in performance.

Obviously, if the machine does not have sufficient ram, then my approach is not feasible. If it does, it's the way to go.

kurumi · August 10, 2010, 4:13am

Ok, so the OP has 2000x 600000 columns. that is over 2GB in file size. do you slurp more than 2GB into memory to do such stuff? What about other processes?

alister · August 10, 2010, 11:19am

Hi kurumi:

What about other processes? Perhaps there are none besides the basic os services. Perhaps there are thousands. Perhaps there is a dedicated machine for this problem. Perhaps not. Perhaps the hardware available to solve this problem is an old pentium with only a few megabytes of free ram. Perhaps it's a 64-bit monster with many gigabytes of free ram. I do not know. You do not know.

What we do know is that your solution is horribly slow and inefficient even on comparitively small datasets, but it may be the best and only possible approach if there is insufficient ram. My solution is much faster, but horribly ram hungry. This type of trade-off (ram vs i/o-time) is common in algorithmic design. Choice is good and the original poster can choose which best suits his situation, if indeed either of these solutions is suitable.

If I had that 64-bit monster at my disposal, I know which approach I'd choose.

Regards,
Alister

cjcox · August 10, 2010, 5:14pm

I'd be interested in timing the various solutions on your large data.

genehunter · August 12, 2010, 6:01pm

Just to update.
I am running the solution from cjcox since 3 days.
I am almost at the end of the run.
Will test that & report.
I will also test kurumi's solution if you can re-post it.
And alister's solution too.

All I can say is thanks for the solution.. but it takes a long time !
Can you let me know how to time the run already started?
Thanks

alister · August 12, 2010, 6:17pm

Hello, genehunter:

I don't think you can time it after having started it, except perhaps by using ps to determine the process' start/elapsed time. You can then poll once per minute or so with a script or cronjob until the process is no longer running, at which point run date to record the current time.

For future runs, prepend 'time ' to whatever command you use to launch the job.

On a different note, I'm curious. What's the size of the data file and what are the specs of the machine being used to process it?

Regards,
Alister

genehunter · August 12, 2010, 10:54pm

The size of the datafile is 1.9G
I am running on a 12 core Xeon with openSUSE 10.2 (X86-64)
66002864k total, 65684428k used.
Hope some of that makes sense!

kurumi · August 13, 2010, 12:47am

awk 'NR==1{for(i=1;i<=NF;i++)col=i}{for(i=1;i<=NF;i++){ printf "%s%s",$i,FS > col".txt" } }' myfile
paste -s ?.txt

alister · August 13, 2010, 5:55pm

Hi, kurumi:

This is a good approach for avoiding my code's memory appetite and for also avoiding your original approach's appetite for fork-exec. However, there a few fatal bugs that will not yield a correct matrix transposition.

'?.txt' would not match properly since the '?' only matches a single character but the headers shown -- from which the file names are taken -- are all multicharacter; '*.txt' is necessary.
The shell globbing used to build the paste command line may not (will not with the data given in the original post) yield the header names in the same order as they appear in the file. The file names need to be named in a way that preserves the order.
The use of paste -s on multiple files without a terminating newline produces a one line file, when what's wanted is one line per file. cat would be correct if not for the absence of a newline at the end of each file.
The command line generated by all those file names may possibly exceed the system's command line length limit.
The AWK process will require many file descriptors (600,003 or so .... one per output file in addition to stdin, stdout, and stderr). Depending on how the system is configured, this may exceed the login account's imposed descriptor limits. If that's the case, then the limits need to be bumped up or AWK needs to close each descriptor after writing to it.

Regards,
Alister

P.S. Please don't delete proposed solutions that have been discussed in a thread. Disappearing posts cause confusion and are an obstacle to continued dialogue and learning.

---------- Post updated at 05:55 PM ---------- Previous update was at 05:53 PM ----------

Hi genehunter:

The following code is a reworking of Kurumi's approach which fixes the bugs I mentioned above and makes conservative assumptions regarding ARG_MAX and file descriptor limits. To invoke it, save the code to a file, make it executable, and invoke it with one argument, the name of the data file:

#!/bin/sh

infile="$1"

awk '
    BEGIN {
        getline
        l=length(NF)
        for (i=1; i<=NF; i++) {
            f=sprintf("%"l"s", i) ".col"
            gsub(" ", "0", f)
            printf("%s", $i) > f
            close(f)
        }
    }

    {
        for (i=1; i<=NF; i++) {
            printf("%s%s", FS, $i) >> f
            close(f)
        }
    }

    END {
        for (i in f) {
            printf("\n") >> f
            close(f)
        }
    }' "$infile"

for f in *.col; do
    echo "$f"
done | xargs cat > outfile

If the file descriptor limit were sufficiently large, then all the close() calls could be deleted. And if ARG_MAX were sufficiently large, a simple "cat * > outfile" would suffice.

Regards,
Alister