Large file - columns into rows etc

Myrona · June 8, 2010, 9:33pm

I have done a couple of searches on this and have found many threads but I don't think I've found one that is useful to me - probably because I have very basic comprehension of perl and beginners shell so trying to manipulate a script already posted maybe beyond my capabilities....

Anyway - I have a huge file (247 columns, over 500,000 lines). What I want to do ultimately is transpose this entire file to make the columns, rows and the rows, columns. Is there an easy way to do this in perl and/or shell? If so, how?

Cheers.

deindorfer · June 9, 2010, 1:47am

perl -F, -lane 'for ( 0 .. $#F ) { $rows[$_] .= $F[$_] }; eof && print map "$_\n", @rows' data

if the file "data" contains CSV data, like this:

1,2,3,4,5,6,7
a,b,c,d,e,f,g

The code above will output something like this:

1a
2b
3c
4d
5e
6f
7g

Whatever is the separatrix for your data can be exchanged for the comma after the "-F" switch in the code.

This should work on arbitrarily large files.

Hope That Helps

P.S. you are talking about 123,500,000 cells in a 247 by 500,000 matrix so memory could become a problem for the $rows variable, particularly if you are on 32bit. We are building up the result in the $rows array and waiting to the end to print it out. I can work on a streaming solution if you get the old "Out of Memory!" error

Myrona · June 9, 2010, 2:55am

Yes - I am getting the out of memory error... but I dunno why. I can easily open the original file in an 32bit OS system, but when I transpose, all hell breaks loose. I tried to grep out a single line from the new data file, but I got this:

grep: line too long

What does that mean? That my dat file is in one single line?

Cheers

deindorfer · June 9, 2010, 2:59am

it could. there are many ways to find out if that is the case. Here is one way:

shell_prompt-> head -1  datafile

if that kicks back the whole file, there are no line endings

Myrona · June 9, 2010, 3:10am

Mmmmm... this is where I get stupid. Typing that in doesn't give me anything...

ahmad.diab · June 9, 2010, 4:55am

use perl module called Array::Transpose:-

if file is space separated.

perl -M'Array::Transpose'  -wlane '

push @in , [ @F ] ;

END{
@out = transpose(\@in) ;
print "@out" ;
}
' infile.txt

;);)

Myrona · June 9, 2010, 5:29am

Thanks - I'll give this a go. Though, my file is actually tab-delimited - will this matter?

ahmad.diab · June 9, 2010, 5:44am

No will not matter;Tabs are considered to be spaces (actually 8 spaces)

deindorfer · June 9, 2010, 9:36am

if the first example I gave runs the box out of memory with only one array, using Array::Transpose should run the box out of memory twice as quickly, yes?

After a brief RTS A::T uses two named variables to do the lifting

ahmad.diab · June 9, 2010, 12:26pm

same code with modification where the max array (@in) size is equal to row size :D:D:D @in array of references:

#!/bin/perl -w

use Array::Transpose ;
open(FH,"<infile.txt") or die "something wrong $!" ;

while (<FH>) {
chomp $_ ;
push @in , [ split /\s+/ , $_  ] ;
}

END{
         foreach (transpose(\@in)) {
                  print "@{$_}\n" ;
	     }
}

;);)

Myrona · June 11, 2010, 12:10am

I think this has worked. I'm still getting the "out of memory" and the same grep errors but when I count the rows and columns, they seem to be right.

Dunno how else to check whether the file has been sorted correctly, though, if I can't view it.

deindorfer · June 11, 2010, 8:31am

The algorithm in Array::Transpose is better than the one I was working together, I would suggest using that. The out of memory error is still there because there is still a single array holding all the data at once.

One way around this problem is to break the run into smaller chunks of data

bash> head -100000 | tail -100000 | perl -e '....' > out.1
bash> head -200000 | tail -100000 | perl -e '....' > out.2
bash> head -300000 | tail -100000 | perl -e '....' > out.3
bash> head -400000 | tail -100000 | perl -e '....' > out.4
bash> head -500000 | tail -100000 | perl -e '....' > out.5

Then /bin/cat the out files together:

bash> cat out.1 out.2 out.3 out.4 out.5 > concattted.out

This will break the data in five more managable chunks. Let us know if you are still getting the "Out of Memory!" Error. you can always break the datafile into 50_000 line segments, and so on...

HTH,

Scott

Myrona · June 11, 2010, 9:20am

I will try this. One question though, which could be slightly stupid:

For example: head -100000 | tail -100000 | perl -e '....' > out.1

I know head -100000 = first 100000 lines of the file and
tail -100000 = last 100000 lines in the file...
but what lines are specified if you put both like you have done?

deindorfer · June 11, 2010, 11:32am

You are correct that the first line is redundant, i.e. you could simply do this:

bash> head -100000 | perl -e '....' > out.1

I use this idiom frequently, and I like starting the pattern of pipelines in a way that can stay constant. It a preference thing.

you could also cat the first one

Myrona · June 13, 2010, 10:34pm

Actually, that didn't work either... still getting the same error. Though, considering my lack of knowledge in the area, I'm sure I'm doing something wrong.

Myrona · June 23, 2010, 6:33am

Just curious to know whether anyone else has a solution to this.... I think I've tried everything but nothing seems to work... and as I don't know much about writing scripts - the use of laymens terms would be greatly appreciated.

Another problem I'm having is that I want to use this data file in a linux based software, but it's saying that it can only find 10 columns but when I do a count in the file, there is the correct number.

Any ideas. Dunno what else to do...

---------- Post updated at 06:33 PM ---------- Previous update was at 03:19 PM ----------

I've had a few pm's about a better description of the data and what exactly I need, so here is an example of 6 columns * 6 rows....

This is a genetic data file....

       ind1   ind2  ind3  ind4  ind5  ind6
rs1    AA    AG    GG    GA   AA    GG
rs2    CT    TT    TT    --    CC    TC
rs3    AG    AA    --    GG    GA    GA
rs4    TT    CT    --    TT    TC    --
rs5    GG    --    GA    AA    GG    AG
rs6    CG    CG    CC    GG    --    GC

I would like the output to be like this:

ind1 A A C T A G T T G G C G
ind2 A G T T A A C T 0 0 C G
ind3 G G T T 0 0 0 0 G A C C
ind4 G A 0 0 G G T T A A G G
ind5 A A C C G A T C G G 0 0
ind6 G G T C G A 0 0 A G G C

Hope that helps a bit.... I thought that transposing the original file and doing some another data manipulation using shell/awk to end up like the end product above would suffice but obviously that's not working.

Scrutinizer · June 23, 2010, 8:27am

Hi, try this:

nawk '{for(i=1;i<=NF;i++) if(NR==1) A=$i" ";else B[i-1]=B[i-1]$i}
      END{for(i=1;i<=NF-1;i++) {gsub(/./,"& ",B); print A,B}}' infile

Myrona · June 25, 2010, 12:33am

So - I set this running, though its been 24 hours and its still running - is that normal for such a large file?

Edit: to check if this was working, I ran it with a smaller file but my outfile is 0kb big - which indicates nothing has worked... what could I be doing wrong???

a_programmer · June 25, 2010, 1:59am

Try this:

#!/bin/ksh

#set -x

typeset TEMP=.tmp.$$.dat

function transpose_file
{
#set -x
  typeset file=$1
  typeset -Z3 i=0  ## -Zn, n is the order of maximum number of columns. So if n is 3 here, max number of columns can be only 999

  cat $file | while read -A fields
  do
    fld_cnt=${#fields[@]}	## Number of fields in the current record
    for ((i=0 ; i< ${fld_cnt} ; i++))
    do
      ## Print the value of every field in a separate file
      ## You can tweak the value here, before printing it out to the file
      print -n -R "${fields} " >> ${TEMP}.$i
    done
  done

  ## Append a newline to each of the temp files (here is an assumption that number of fields is same for each record)
  for ((i=0 ; i< ${fld_cnt} ; i++))
  do
    print >> ${TEMP}.$i
  done
  
  ## cat all the temp files together
  cat ${TEMP}.*

  rm ${TEMP}.*
}


file=${1:-input.dat}
output=${2:-output.dat}

transpose_file $file > $output

Note: If script does not run with ksh, try using ksh93 (some systems keep ksh exec as the older ksh88 version).

It took, 50 seconds to transpose a file with 247x500 records, so the extrapolated estimate for 500K records would be around 13/14 hours.
If you need better performance, try implementing this same logic in C.

I would, however, not recommend to feed in 500K columns to any process. Also, I believe, most standard shell commands will not be able to handle that big a line.

Perhaps, you should address the problem in a different way... Why do you really need that kind of a file format?
Can't you feed in data in an id-value kind of a pair?
For example,

ind1 A 
ind1 A 
ind1 C 
ind1 T 
ind1 A 
ind1 G 
...
ind2 A 
ind2 G 
ind2 T
ind2 T 
...
indN G
indN G 
indN T 
indN T
...

Or:

KEY:ind1 
A 
A 
C 
T 
A 
G 
...
KEY:indN
G
G 
T 
T
...

So you would get ~247x500K rows worth of data, but each line will be of a manageable size.

Myrona · June 25, 2010, 2:58am

Hi,

Thanks for the reply. I'm not sure if anyone is familiar with it, but I need this particular format for the program PLINK. File format is for genome-wide SNP data for each individual (one row = one individual), i.e:

FAMID INID FID MID SEX AFF rs1a rs1b rs2a rs2b rs3a rs3b..... rs500Ka rs500Kb
n1
n2
n3
--
n247

Plink needs all information in one file, it will not work if separated like the way you have suggested. I am unable to figure out a way to transpose the data from what I got from the genotyping people to get it in the form for this program.

Does that help?