Match child with parents and form matrix

jalaj841 · January 31, 2015, 11:30pm

thank you for letting me join this forum, lots of learning opportunities looks like.
Myself a biologist, very new into unix, so please excuse if I use incorrect language. I am using cygwin on windows, it can run perl , awk , sed etc.

I have 2 files, the first sample sheet, tells which parent and children are in which sample. Parents are represented as P1, P2, P3 and corresponding children groups are represented as P1/P2 , P2/P3 etc.

index,line,sample
1,p1,s1
2,p2,s2
3,p1/p2,s3
4,p1/p2,s4
5,p1/p2,s5
6,p1/p2,s6
7,p1/p3,s7
8,p1/p3,s8
9,p1/p3,s9
10,p1/p3,s10
11,p2/p3,s11
12,p2/p3,s12
13,p2/p3,s13
14,p2/p3,s14
15,p3,s15

The second file contains data, having sample number, variable name and value. The parents always can be aa,tt,gg,cc (same character repeated twice)

sample,var,value
s1,v1,aa
s1,v2,tt
s1,v3,aa
s1,v4,gg
s2,v1,tt
s2,v2,aa
s2,v3,aa
s2,v4,gg
s3,v1,at
s3,v3,aa
s3,v4,tt
s4,v1,tt
s4,v2,at
s4,v3,aa
s4,v4,gt
s5,v1,aa
s5,v2,tt
s5,v3,aa
s5,v4,gt
s6,v1,aa
s6,v2,aa
s6,v3,aa
s6,v4,tt
s7,v1,aa
s7,v2,aa
s7,v3,at
s7,v4,ag
s8,v1,aa
s8,v2,tt
s8,v3,at
s8,v4,ag
s9,v1,aa
s9,v2,at
s9,v3,tt
s9,v4,gg
s10,v1,aa
s10,v2,at
s10,v3,aa
s10,v4,ag
s11,v1,aa
s11,v2,aa
s11,v3,tt
s11,v4,gg
s12,v1,tt
s12,v2,tt
s12,v3,tt
s12,v4,ag
s13,v1,aa
s13,v2,at
s13,v3,aa
s13,v4,ag
s14,v1,at
s14,v2,aa
s14,v3,at
s14,v4,aa
s15,v1,aa
s15,v2,aa
s15,v3,tt
s15,v4,aa

I am only interested in variables in which a pair of parents dont match. If parents have same value, that variable is not considered in the output, also if one/both parents are absent for a variable, I dont want to consider that one.

What I need to do is create new files for all sets of children with same parents, and assign the variables values a (if matching first parent) , b (if matching second parent) and m (mixture of both) . If data is missing in child variable, hyphen (-) can be used.

So my desired output are 3 files, all in matrix form.

file p1_p2

    s3  s4  s5 s6 
v1  m   b   a   a
v2  -   m   a   b


file p1_p3

    s7  s8  s9 s10 
v2  b   a   m   m
v3  m   m   b   a
v4  m   m   a   m


file p2_p3

    s11  s12  s13 s14 
v1   b    a    b   m 
v3   b    b    a   m
v4   a    m    m   b

I`m ready to answer questions that you may have. please guide me to achieve the output.

disedorgue · February 1, 2015, 7:59pm

Hi,

Here, an awk command file:

$ cat matrx.awk 
BEGIN{FS=",";T="null";I=-1}
FNR == NR {
	if ( $2 !~ /\// ) {
		P[$3]=$2
		Q[$2]=$3
	}
	else {
		C[$3]=$2
	}
	next
}
{
	if ( P[$1] ) {
		V[P[$1]$1$2]=$3
	}
	else {
		split(C[$1],A,"/")
		T == "null" ? T=C[$1] : 0 
		if ( V[P[Q[A[1]]]Q[A[1]]$2] != V[P[Q[A[2]]]Q[A[2]]$2] ) {
			if ( C[$1] == T ) {
				I++
			}
			else {
				J=0
				print T":"
				while (J <= I){
					split(E[J],li,":")
					L[li[1]li[2]]=li[3]
					if(V1[li[1]]!=1){
						V1[li[1]]=1
						V2[G++]=li[1]
					}
					if(M[li[2]]!=1){
						M[li[2]]=1
						B[D++]=li[2]
					}
					J++
				}
				for(Y=0;Y<D;Y++){
					K=K"\t"B[Y]
				}
				print K
				for(Z=0;Z<G;Z++){
					K=V2[Z]
					for(Y=0;Y<D;Y++){
						if(L[V2[Z]B[Y]]) {
							K=K"\t"L[V2[Z]B[Y]]
						}
						else{
							K=K"\t-"
						}
					}
					print K
				}
				I=0
				T=C[$1]
				K=""
				split("",L)
				split("",B)
				split("",V2)
				split("",V1)
				split("",M)
				G=D=0
			}
			V[P[Q[A[1]]]Q[A[1]]$2] == $3 ? X="a" : V[P[Q[A[2]]]Q[A[2]]$2] == $3 ? X="b" : X="m"
			E=$2":"$1":"X
		}
	}
}
END{
	print T":"
	J=0
	while (J <= I){
		split(E[J],li,":")
		L[li[1]li[2]]=li[3]
		if(V1[li[1]]!=1){
			V1[li[1]]=1
			V2[G++]=li[1]
		}
		if(M[li[2]]!=1){
			M[li[2]]=1
			B[D++]=li[2]
		}
		J++
	}
	for(Y=0;Y<D;Y++){
		K=K"\t"B[Y]
	}
	print K
	for(Z=0;Z<G;Z++){
		K=V2[Z]
		for(Y=0;Y<D;Y++){
			if(L[V2[Z]B[Y]]) {
				K=K"\t"L[V2[Z]B[Y]]
			}
			else{
				K=K"\t-"
			}
		}
		print K
	}
}

For this code work fine, we must use as this:

$ awk -F,  'FNR == NR && !/\// {T[$3]=1;next};T[$1] {print}' file1 file2 | awk -f matrx.awk file1 - file2
p1/p2:
	s3	s4	s5	s6
v1	m	b	a	a
v2	-	m	a	b
p1/p3:
	s7	s8	s9	s10
v2	b	a	m	m
v3	m	m	b	a
v4	m	m	a	m
p2/p3:
	s11	s12	s13	s14
v1	b	a	b	m
v3	b	b	a	m
v4	a	m	m	b

Syntax: file1 - file2 is correct, hyphen (-) is output of awk commande before pipe (|)
Regards.

jalaj841 · February 2, 2015, 11:06am

wow..unbelievable, thank you from my heart..never thought it will require so much time and effort..

I will do testing and then confirm results...meanwhile if you can also post a short description of the steps, it will help me learn specially and others also in this forum..

---------- Post updated at 12:06 PM ---------- Previous update was at 12:30 AM ----------

Hi disedorgue,

I found a couple of potential issues, would you please consider these scenarios.

If I use the files

20,p5,s20
21,p5/p6,s21
22,p5/p6,s22
23,p6,s23

and

s20,v5,gg
s20,v6,tt
s21,v5,tt
s21,v6,gg
s22,v7,tt
s22,v5,gg
s23,v5,tt

I get the output

p5/p6:
        s21     s22
v5      b       a
v6      m       -

since v6 is absent in p6, output should just be the first line only

        s21     s22
v5      b       a

I tested some more with the following set

1,p1,s1
16,p4,s16
17,p1/p4,s17
18,p1/p4,s18
19,p1/p4,s19

and

s1,v1,aa
s1,v2,tt
s1,v3,aa
s1,v4,gg
s16,v1,aa
s16,v2,aa
s16,v3,tt
s16,v4,aa
s17,v2,tt
s17,v3,tt
s17,v4,ag
s18,v2,at
s18,v3,aa
s19,v1,aa
s19,v2,aa
s19,v3,at
s19,v4,aa

It gives me

p1/p4:
        s17     s18     s19
v2      m       m       m
v3      m       m       m
v4      m       -       m

It should rather be

    s17 s18 s19  
v2   a   m   b
v3   b   a   m
v4   m   -   b

disedorgue · February 2, 2015, 1:31pm

First issue, is normal, because the script not check this case, to correct this:
replace:

if ( V[P[Q[A[1]]]Q[A[1]]$2] != V[P[Q[A[2]]]Q[A[2]]$2] )

by:

if ( V[P[Q[A[1]]]Q[A[1]]$2] && V[P[Q[A[2]]]Q[A[2]]$2] && V[P[Q[A[1]]]Q[A[1]]$2] != V[P[Q[A[2]]]Q[A[2]]$2] )

Second issue is strange, because this case work fine at home, it return:

$ awk -F,  'FNR == NR && !/\// {T[$3]=1;next};T[$1] {print}' matrx_4 matrx_5 | awk -f matrx.awk matrx_4 - matrx_5
p1/p4:
	s17	s18	s19
v2	a	m	b
v3	b	a	m
v4	m	-	b

Regards.

jalaj841 · February 2, 2015, 5:09pm

Thank you disedorgue, does your code require any sorting of the first or second files?
I`m running this on a huge dataset and getting weird output for some pair of parents.

disedorgue · February 2, 2015, 5:57pm

The first file is stocking in memory (array P,Q and C in code)
Only lines sample number parent of the second file is stocking in memory (Array V in code) and so, only these lines must be at begin of the second file.
If you launch only the awk commande before pipe, you see it return only lines of second file that represent parents.

EDIT: each children pair in second file must be consecutive.

jalaj841 · February 2, 2015, 9:20pm

Can I do this with code? Since the data file has 800 million records, it is impossible to do this manually, sorry for so many questions.

disedorgue · February 3, 2015, 6:34am

The data file with 800 million records is too big to stock in memory.
How many parents in data file ?
An idea of the maximum size of a matrix ?

Regards.

jalaj841 · February 3, 2015, 8:15am

Number of parents is 234 now, but it will grow,, the data file has much redundant data where parents and children have many redundant/extra variables which will not appear in output.. the max size of a matrix will be 8,000 x 100 .. I was monitoring data for 40 parents,,, it was taking 400 Mb memory on top which is not much..I have access to university cluster having 300 gb memory,,so I can run the data there if needed..

disedorgue · February 3, 2015, 7:08pm

Here another awk program that load all in memory:

$ cat newmatrx.awk
BEGIN{FS=","}
FNR == NR {
	if ( $2 !~ /\// ) {
		P[$3]=$2
		Q[$2]=$3
	}
	else {
		C[$3]=$2
	}
	next
}
{
	if ( P[$1] ) {
		V[P[$1]$1$2]=$3
	}
	else {
		split(C[$1],A,"/")
		if ( V[P[Q[A[1]]]Q[A[1]]$2] && V[P[Q[A[2]]]Q[A[2]]$2] && V[P[Q[A[1]]]Q[A[1]]$2] != V[P[Q[A[2]]]Q[A[2]]$2] ) {
			W[C[$1]]++
			V[P[Q[A[1]]]Q[A[1]]$2] == $3 ? X="a" : V[P[Q[A[2]]]Q[A[2]]$2] == $3 ? X="b" : X="m"
			E[W[C[$1]]C[$1]]=$2":"$1":"X
		}
	}
}
END{
	for (i in W){
		print i":"
		J=1
		while (J <= W){
			split(E[J""i],li,":")
			L[li[1]li[2]]=li[3]
			if(V1[li[1]]!=1){
				V1[li[1]]=1
				V2[G++]=li[1]
			}
			if(M[li[2]]!=1){
				M[li[2]]=1
				B[D++]=li[2]
			}
			J++
		}
		for(Y=0;Y<D;Y++){
			K=K"\t"B[Y]
		}
		print K
		for(Z=0;Z<G;Z++){
			K=V2[Z]
			for(Y=0;Y<D;Y++){
				if(L[V2[Z]B[Y]]) {
					K=K"\t"L[V2[Z]B[Y]]
				}
				else{
					K=K"\t-"
				}
			}
			print K
		}
		K=""
		split("",L)
		split("",B)
		split("",V2)
		split("",V1)
		split("",M)
		G=D=0
	}
}

And you must execute this as old version:

awk -F,  'FNR == NR && !/\// {T[$3]=1;next};T[$1] {print}' file1 file2 | awk -f newmatrx.awk file1 - file2

Regards.

jalaj841 · February 4, 2015, 5:22pm

Thank you ! this is running now..I will get back to you if I find some issues.. thank you again....