Matrix to 3 col sorted

senhia83 · September 21, 2017, 11:32am

Hello experts, I have matrices sorted by position, there are 400k rows, 3000 columns.

ID      CHR     POS     M1      M2      M3      M4      M5
ID1     1       1       4.6     2.6     2.1     3.5     4.2
ID2     1       100     3.6     2.9     3.2     2.6     2.5
ID3     1       1000    4.1     2.9     2.4     2.8     2.5
ID4     1       2000    4.2     2.4     2.8     3.7     2.8
ID5     1       3000    4.6     2.9     3.7     3.5     2.7
ID6     1       4000    5.2     3       3.2     4.2     3.7
ID7     1       5000    5.8     2.2     2.8     3.1     2.7

I`m looking for a sorted long-format output by position, grouped by markers (Ms)

M1 ID1__1__1 4.6
M1 ID2__1__100 3.6
M1 ID3__1__1000 4.1
M1 ID4__1__2000 4.2
M1 ID5__1__3000 4.6
M1 ID6__1__4000 5.2
M1 ID7__1__5000 5.8
M2 ID1__1__1 2.6
M2 ID2__1__100 2.9
M2 ID3__1__1000 2.9
M2 ID4__1__2000 2.4
M2 ID5__1__3000 2.9
M2 ID6__1__4000 3
M2 ID7__1__5000 2.2
M3 ID1__1__1 2.1
M3 ID2__1__100 3.2
M3 ID3__1__1000 2.4
M3 ID4__1__2000 2.8
M3 ID5__1__3000 3.7
M3 ID6__1__4000 3.2
M3 ID7__1__5000 2.8
M4 ID1__1__1 3.5
M4 ID2__1__100 2.6
M4 ID3__1__1000 2.8
M4 ID4__1__2000 3.7
M4 ID5__1__3000 3.5
M4 ID6__1__4000 4.2
M4 ID7__1__5000 3.1

I have two working methods

This step requires a final sorting which for 400000x3000 datapoints will be very row or run out of memory.

awk -F"\t" 'NR==1{split($0,lines);}NR>1{for(i=4;i<length(lines);i++) print lines,$1"__"$2"__"$3,$i}' tmp | sort -k1,1

This step is very very slow as it requires to take the entire matrix into memory before making splits.

awk -F"\t" '{a[NR]=$0}END{split(a[1],lines); for(i=4;i<=length(lines);i++) { for(j=2;j<=length(a);j++) { split(a[j],haps); print lines,haps[1]"__"haps[2]"__"haps[3],haps}}}'  tmp

Is there a more efficient way to get the output I need? Thanks in advance.

rdrtx1 · September 21, 2017, 12:14pm

rm -rf output
mkdir output
awk '
NR==1 { for (i=1; i<=NF; i++) a=$i ;}
NR > 1 {
   for (i=4; i<=NF; i++) print "touch output/\"" a, $1 "__" $2 "__" $3, $i "\"";
}
' infile | sh

ls -1 output > outfile

MadeInGermany · September 21, 2017, 1:30pm

Interesting idea! Let the filesystem do a part of the sorting, and let ls sort the other parts.
More fine-tuned so ls sorts less:

outdir=./output
mkdir -p $outdir
awk '
NR==1 {
  for (i=4; i<=NF; i++) { line=$i; printf "mkdir -p \"%s\"\n",$i }
  next
}
{
  for (i=4; i<=NF; i++) printf ">\"%s/%s__%s__%s %s\"\n",line,$1,$2,$3,$i
}
' infile > tmpfile
(
cd $outdir
sh tmpfile
for i in *
do
  ls -1 "$i" | sed "s#^#${i} #"
done
) | tee outfile

RudiC · September 21, 2017, 1:40pm

Me too, I found this idea at least creative, but 1,2E9 directory entries, albeit in 3000 directories, might be somewhat, hmm, demanding on the file system...
On the other hand, I thought of awk distributing the columns into 3000 files, one column each, and then con cat ening them, but opening and appending 3000 files 400k times also would not be too fast. Right now, I'm lost...

ctsgnb · September 22, 2017, 6:08am

Whatever strategy you choose, according to your expectation, you will get an output file that is near 5 times bigger than your input file.

input file :
rows = 400k
columns = 3000
Total amount of datas = 400k x 3000 = 1,2 billions

output file :
rows = (400k -1) x ( 3000 -3 )
"-1" because of the header row
"-3" because the 3 first contains the datas you want
columns = 5 (i take into account the datas, not the formatting and "__" stuff)
Total amount of datas = 5 x (400k -1) x ( 3000 -3 ) = 5,99...billions

Indeed, since you will be repeating the datas of the first 3 column for each and every subsequent column. (so (3 + 2 ) x (3000 - 3) ) rather than having it once for all of them ( "+ 2" is because you even want to add the header and the values of the subsequent column).

So just because of your prerequisits and expectations, for sure you will have to write more data, and thus you will need the corresponding number of I/O writing, independently of the strategy you choose.

If you have a huge amount of data to write, then there is and uncompressible amount of time to do it.
Of course I/O operations are quicker to do in RAM or SSD than on a standard hard drive, but still ...

It would be cheaper to save the transposed matrix and request on it
So you will have it twice : in line and in columns ... of course it will cost you 1,2 billions of datas more.
But 1,2 billions x 2 = 2,4 billions is still more than twice smaller than 5,9 billions of datas !

Anyway, processing such an amount of data using files is inappropriate : this is what Database have been designed for...
My 2 cents ...

PS// The reality might even be worse since i didn't even take into account the size of the datas, just the number of them ...

disedorgue · September 24, 2017, 3:43pm

Hi senhia83,
maybe a little bit faster than your second solution:

awk 'NR==1 {C=NF;Y=$0;Z=$4;next}{X[NR]=$0;print Z,$1"__"$2"__"$3,$4}END{for(i=5;i<C;++i){$0=Y;Z=$i;j=1;while(++j<=NR){$0=X[j];print Z,$1"__"$2"__"$3,$i}}}'

or:

awk 'NR==1 {C=NF;Y=$0;Z=$4;next}
{X[NR]=$0;print Z,$1"__"$2"__"$3,$4}
END{
 for(i=5;i<C;++i){
   $0=Y;Z=$i;j=1;
   while(++j<=NR){
     $0=X[j];print Z,$1"__"$2"__"$3,$i
   }
 }
}' file

Regards.

senhia83 · September 27, 2017, 12:59pm

Thanks ! the filesystem ended up crashing with the files approach ..but a very innovative idea nonetheless !

MadeInGermany · September 27, 2017, 3:08pm

Your initial approach can be optimized a little

awk -F"\t" 'NR==1{nl=split($0,lines);next} {for(i=4;i<=nl;i++) print lines,$1"__"$2"__"$3,$i}' tmp | sort -k1.2,1n

The .2 offset omits the first character in the field#1 when making the sort key.
I am not sure if all sort versions retain the order of field#2; some might need sort -k1.2,1n -k2,2 that is certainly slower again.

MadeInGermany · September 27, 2017, 4:36pm

And here is an awk solution with a minimum memory consumption

awk -F"\t" '
NR==1{nl=split($0,lines);next}
{for(i=4;i<=nl;i++) out=(((i in out) ? (out RS) : "") $1 "__" $2 "__" $3 " " $i)}
END {for (i=4;i<=nl;i++) {o=out;gsub(RS,(RS lines OFS),o);print lines,o}}
' tmp

MadeInGermany · September 27, 2017, 5:34pm

And last but not least, read the file a 3000 times. Seems stupid but is fast enough if your OS has a big buffer cache.

#!/bin/bash
inputfile=tmp
IFS=$'\t' read -a lines < $inputfile
for ((i=4; i<=${#lines[@]}; i++))
do
  awk -F "\t" -v col=$i -v name="${lines[i-1]}" 'NR>1 {print name,($1 "__" $2 "__" $3),$col}' $inputfile
done