Hello experts, I have matrices sorted by position, there are 400k rows, 3000 columns.
ID CHR POS M1 M2 M3 M4 M5
ID1 1 1 4.6 2.6 2.1 3.5 4.2
ID2 1 100 3.6 2.9 3.2 2.6 2.5
ID3 1 1000 4.1 2.9 2.4 2.8 2.5
ID4 1 2000 4.2 2.4 2.8 3.7 2.8
ID5 1 3000 4.6 2.9 3.7 3.5 2.7
ID6 1 4000 5.2 3 3.2 4.2 3.7
ID7 1 5000 5.8 2.2 2.8 3.1 2.7
I`m looking for a sorted long-format output by position, grouped by markers (Ms)
M1 ID1__1__1 4.6
M1 ID2__1__100 3.6
M1 ID3__1__1000 4.1
M1 ID4__1__2000 4.2
M1 ID5__1__3000 4.6
M1 ID6__1__4000 5.2
M1 ID7__1__5000 5.8
M2 ID1__1__1 2.6
M2 ID2__1__100 2.9
M2 ID3__1__1000 2.9
M2 ID4__1__2000 2.4
M2 ID5__1__3000 2.9
M2 ID6__1__4000 3
M2 ID7__1__5000 2.2
M3 ID1__1__1 2.1
M3 ID2__1__100 3.2
M3 ID3__1__1000 2.4
M3 ID4__1__2000 2.8
M3 ID5__1__3000 3.7
M3 ID6__1__4000 3.2
M3 ID7__1__5000 2.8
M4 ID1__1__1 3.5
M4 ID2__1__100 2.6
M4 ID3__1__1000 2.8
M4 ID4__1__2000 3.7
M4 ID5__1__3000 3.5
M4 ID6__1__4000 4.2
M4 ID7__1__5000 3.1
I have two working methods
- This step requires a final sorting which for 400000x3000 datapoints will be very row or run out of memory.
awk -F"\t" 'NR==1{split($0,lines);}NR>1{for(i=4;i<length(lines);i++) print lines,$1"__"$2"__"$3,$i}' tmp | sort -k1,1
- This step is very very slow as it requires to take the entire matrix into memory before making splits.
awk -F"\t" '{a[NR]=$0}END{split(a[1],lines); for(i=4;i<=length(lines);i++) { for(j=2;j<=length(a);j++) { split(a[j],haps); print lines,haps[1]"__"haps[2]"__"haps[3],haps}}}' tmp
Is there a more efficient way to get the output I need? Thanks in advance.