I have done a couple of searches on this and have found many threads but I don't think I've found one that is useful to me - probably because I have very basic comprehension of perl and beginners shell so trying to manipulate a script already posted maybe beyond my capabilities....
Anyway - I have a huge file (247 columns, over 500,000 lines). What I want to do ultimately is transpose this entire file to make the columns, rows and the rows, columns. Is there an easy way to do this in perl and/or shell? If so, how?
Whatever is the separatrix for your data can be exchanged for the comma after the "-F" switch in the code.
This should work on arbitrarily large files.
Hope That Helps
P.S. you are talking about 123,500,000 cells in a 247 by 500,000 matrix so memory could become a problem for the $rows variable, particularly if you are on 32bit. We are building up the result in the $rows array and waiting to the end to print it out. I can work on a streaming solution if you get the old "Out of Memory!" error
Yes - I am getting the out of memory error... but I dunno why. I can easily open the original file in an 32bit OS system, but when I transpose, all hell breaks loose. I tried to grep out a single line from the new data file, but I got this:
grep: line too long
What does that mean? That my dat file is in one single line?
if the first example I gave runs the box out of memory with only one array, using Array::Transpose should run the box out of memory twice as quickly, yes?
After a brief RTS A::T uses two named variables to do the lifting
The algorithm in Array::Transpose is better than the one I was working together, I would suggest using that. The out of memory error is still there because there is still a single array holding all the data at once.
One way around this problem is to break the run into smaller chunks of data
This will break the data in five more managable chunks. Let us know if you are still getting the "Out of Memory!" Error. you can always break the datafile into 50_000 line segments, and so on...
I will try this. One question though, which could be slightly stupid:
For example: head -100000 | tail -100000 | perl -e '....' > out.1
I know head -100000 = first 100000 lines of the file and
tail -100000 = last 100000 lines in the file...
but what lines are specified if you put both like you have done?
Actually, that didn't work either... still getting the same error. Though, considering my lack of knowledge in the area, I'm sure I'm doing something wrong.
Just curious to know whether anyone else has a solution to this.... I think I've tried everything but nothing seems to work... and as I don't know much about writing scripts - the use of laymens terms would be greatly appreciated.
Another problem I'm having is that I want to use this data file in a linux based software, but it's saying that it can only find 10 columns but when I do a count in the file, there is the correct number.
Any ideas. Dunno what else to do...
---------- Post updated at 06:33 PM ---------- Previous update was at 03:19 PM ----------
I've had a few pm's about a better description of the data and what exactly I need, so here is an example of 6 columns * 6 rows....
This is a genetic data file....
ind1 ind2 ind3 ind4 ind5 ind6
rs1 AA AG GG GA AA GG
rs2 CT TT TT -- CC TC
rs3 AG AA -- GG GA GA
rs4 TT CT -- TT TC --
rs5 GG -- GA AA GG AG
rs6 CG CG CC GG -- GC
I would like the output to be like this:
ind1 A A C T A G T T G G C G
ind2 A G T T A A C T 0 0 C G
ind3 G G T T 0 0 0 0 G A C C
ind4 G A 0 0 G G T T A A G G
ind5 A A C C G A T C G G 0 0
ind6 G G T C G A 0 0 A G G C
Hope that helps a bit.... I thought that transposing the original file and doing some another data manipulation using shell/awk to end up like the end product above would suffice but obviously that's not working.
So - I set this running, though its been 24 hours and its still running - is that normal for such a large file?
Edit: to check if this was working, I ran it with a smaller file but my outfile is 0kb big - which indicates nothing has worked... what could I be doing wrong???
#!/bin/ksh
#set -x
typeset TEMP=.tmp.$$.dat
function transpose_file
{
#set -x
typeset file=$1
typeset -Z3 i=0 ## -Zn, n is the order of maximum number of columns. So if n is 3 here, max number of columns can be only 999
cat $file | while read -A fields
do
fld_cnt=${#fields[@]} ## Number of fields in the current record
for ((i=0 ; i< ${fld_cnt} ; i++))
do
## Print the value of every field in a separate file
## You can tweak the value here, before printing it out to the file
print -n -R "${fields} " >> ${TEMP}.$i
done
done
## Append a newline to each of the temp files (here is an assumption that number of fields is same for each record)
for ((i=0 ; i< ${fld_cnt} ; i++))
do
print >> ${TEMP}.$i
done
## cat all the temp files together
cat ${TEMP}.*
rm ${TEMP}.*
}
file=${1:-input.dat}
output=${2:-output.dat}
transpose_file $file > $output
Note: If script does not run with ksh, try using ksh93 (some systems keep ksh exec as the older ksh88 version).
It took, 50 seconds to transpose a file with 247x500 records, so the extrapolated estimate for 500K records would be around 13/14 hours.
If you need better performance, try implementing this same logic in C.
I would, however, not recommend to feed in 500K columns to any process. Also, I believe, most standard shell commands will not be able to handle that big a line.
Perhaps, you should address the problem in a different way... Why do you really need that kind of a file format?
Can't you feed in data in an id-value kind of a pair?
For example,
ind1 A
ind1 A
ind1 C
ind1 T
ind1 A
ind1 G
...
ind2 A
ind2 G
ind2 T
ind2 T
...
indN G
indN G
indN T
indN T
...
Or:
KEY:ind1
A
A
C
T
A
G
...
KEY:indN
G
G
T
T
...
So you would get ~247x500K rows worth of data, but each line will be of a manageable size.
Thanks for the reply. I'm not sure if anyone is familiar with it, but I need this particular format for the program PLINK. File format is for genome-wide SNP data for each individual (one row = one individual), i.e:
Plink needs all information in one file, it will not work if separated like the way you have suggested. I am unable to figure out a way to transpose the data from what I got from the genotyping people to get it in the form for this program.