Venn Data Maker

jacobs.smith · August 18, 2016, 1:06pm

Hi,

My input is like this

head input.txt
Set1,Set2,Set3
g1,g2,g3
g2,g1,g3,
g4,g5,g5
g1,g1,g1,
g2,g1,g1,
g6,g7,g8
,g7,g8
,,g8

My output file should be

Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

Logic

First get all unique genenames (g1,g2.....g8).
Then look if that particular gene is present in any of the columns in the input file.
If it is present, print 1. If absent, print 0.

Special Notes

Please note that the each of the columns (Set1,Set2,Set3) in input.txt can have missing values(last two records in input.txt).
The columns are not always three. My actual input file has 7. So I want the column counts to be dynamic.

Thanks

Yoda · August 18, 2016, 1:15pm

An approach using gawk:-

gawk -F, '
        NR == 1  {
                print "Name," $0
        }
        NR > 1 {
                for ( i = 1; i <= NF; i++ )
                {
                        if ( $i )
                                T[$i]
                        R[$i FS i]
                }
        }
        END {
                n = asorti(T)
                for ( i = 1; i <= n; i++ )
                {
                        for ( j = 1; j <= NF; j++ )
                        {
                                if ( ( T FS j ) in R )
                                        S = S ? S FS 1 : T FS 1
                                else
                                        S = S ? S FS 0 : T FS 0
                        }
                        print S
                        S = ""
                }
        }
' file

RavinderSingh13 · August 18, 2016, 1:51pm

Hello jacobs.smith,

If you are not bothered about sequence of field 1st as per your Input_file then following may help you in same.

awk -F, 'NR==1{print "Name," $0;R=NF} NR>1{for(i=1;i<=NF;i++){A[$i,i]++;if($i){C[$i]}}} END{for(i in C){for(j=1;j<=R;j++){Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};print Q;Q=""}}'  Input_file

Output will be as follows.

Name,Set1,Set2,Set3
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0

In case you need output into same order as per Input-file(sorted order) then following may help you in same.

awk -F, 'NR==1{print "Name," $0;R=NF} NR>1{for(i=1;i<=NF;i++){A[$i,i]++;if($i){C[$i]}}} END{for(i in C){for(j=1;j<=R;j++){Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};print Q;Q=""}}' Input_file  | sort -k1

Output will be as follows.

Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

EDIT: Adding a non-one liner form of solutions here.
Solution1:

awk -F, 'NR==1{
                print "Name," $0;
                R=NF
              }
         NR>1 {
                for(i=1;i<=NF;i++){
                                        A[$i,i]++;
                                        if($i){
                                                C[$i]
                                              }
                                  }
              }
         END  {
                for(i in C)       {
                                        for(j=1;j<=R;j++){
                                                                Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};
                                                                print Q;
                                                                Q=""
                                                         }
              }
         ' Input_file

Solution2:

awk -F, 'NR==1{
                print "Name," $0;
                R=NF
              }
         NR>1 {
                for(i=1;i<=NF;i++){
                                        A[$i,i]++;
                                        if($i){
                                                C[$i]
                                              }
                                  }
              }
         END  {
                for(i in C)       {
                                        for(j=1;j<=R;j++){
                                                                Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};
                                                                print Q;
                                                                Q=""
                                                         }
              }
         ' Input_file  | sort -k1

Thanks,
R. Singh

RudiC · August 18, 2016, 4:40pm

Try also

awk '
NR==1   {print "Name", $0
         next
        }
        {for (i=1; i<=3; i++)   {T[$i]
                                 R[$i,i] = 1
                                }
        }
END     {delete T[""]
         for (t in T) print t, R[t,1]+0, R[t,2]+0, R[t,3]+0
        }
' FS=, OFS=, file
Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

jacobs.smith · August 19, 2016, 1:01pm

Thank you folks - @Yoda @RavinderSingh13 and @Rudic.

Can I also get the intersection list into another file?

Intersectionlist.txt
Set1_unique=2
Set2_unique=1
Set3_unique=2
Set12_common=1
Set13_common=0
Set23_common=1
Set123_common=1

Thanks

RudiC · August 19, 2016, 1:12pm

Do we need to guess what an "intersection" is? Any attempt from your side?

jacobs.smith · August 19, 2016, 1:20pm

My apologies Rudic.

If the value is "1" in any set, that means a presence of value and it should be counted.

If the value is "0" in any set, that means an absent and it should not be counted.

Ex:

Name, set1, set2, set3
g1,0,0,1
g2,0,0,1
g3,1,1,0

Gene1 and Gene2 are present only in set3. So set3_unique=2.

Gene3 is present in both set1 and set2. So set12_common=1

Please ask me more questions and I will be glad to reply.

Also - the number of lines in the intersectionlist.txt should be equal to = (2^(number of sets))-1

Thanks.

RudiC · August 19, 2016, 1:24pm

And the second question?

jacobs.smith · August 19, 2016, 1:33pm

I am loving your questions.

Glad to learn.

Here is a way I tried. But it has two disadvantages.

One - I can only do three sets. But my actual input has 7 and even more.

Two - I cannot write the first column saying unique or common.

for i in 100 010 001 110 101 011 111; do awk -F"," 'NR>1 {print $2$3$4}' 1 | grep $i | wc -l;done

Thanks

RudiC · August 19, 2016, 1:36pm

How about

awk '
NR==1   {print "Name", $0
         next
        }
        {for (i=1; i<=3; i++)   {T[$i]
                                 R[$i,i] = 1
                                }
        }
END     {delete T[""]
         for (t in T)   {print t, R[t,1]+0, R[t,2]+0, R[t,3]+0
                         TMP = R[t,1] * 100 + R[t,2] * 10 + R[t,3]
                         if (TMP == 111) Set123++
                         if (TMP == 110) Set12++
                         if (TMP == 101) Set13++
                         if (TMP == 11)  Set23++
                         if (TMP == 100) Set1++
                         if (TMP == 10)  Set2++
                         if (TMP == 1)   Set3++
                        }
         print "Set1_unique="   0+Set1
         print "SDet2_unique="  0+Set2
         print "Set3_unique="   0+Set3
         print "Set12_common="  0+Set12
         print "Set13_common="  0+Set13
         print "Set23_common="  0+Set23
         print "Set123_common=" 0+Set123
        }
' FS=, OFS=, file
Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1
Set1_unique=2
SDet2_unique=1
Set3_unique=2
Set12_common=1
Set13_common=0
Set23_common=1
Set123_common=1

jacobs.smith · August 19, 2016, 1:41pm

Thanks Rudic.

But I would like to make it dynamic.

The post example has 3 sets. But my actual input file has numerous sets.

Can you please share any comments on how I can edit your solution?

RudiC · August 19, 2016, 1:50pm

That's not that easy, as the number of combinations grows dramatically with increasing set count.

RavinderSingh13 · August 19, 2016, 3:25pm

Hello jacobs.smith,

Let's say we have following Input_file(which willbe created by your 1st requirement, so I have edited it to test it more).

cat Input_file
Name,Set1,Set2,Set3,Set4,Set5,Set6,Set7,Set8
g5,0,1,1,1,0,1,1,0
g6,1,0,0,0,0,0,0,0
g7,0,1,0,0,0,0,0,1
g8,0,0,1,1,1,1,0,1
g1,1,1,1,0,1,1,1,0
g2,1,1,0,1,1,1,1,1
g3,0,0,1,0,0,0,1,1
g4,1,0,0,1,0,0,1,1

Then following code may help you in same.

awk -F, 'NR==1{
		next
              } 
              {
		for(i=2;i<=NF;i++){
					for(j=i+1;j<=NF;j++){
								if($i==$j && $i!=0 && $j!=0){
												S["Set"(i-1)(j-1)"_common"]++;
											    };
                                                            }
                                  }
              } 
              {
                for(q=2;q<=NF;q++){
					if($q==1)           {
								num=q-1;
								E++
							    }
                                  };
                if(E==1)          {
					Y["Set"num"_unique"]++
				  };
		E=""
              } 
         END  {
		for(i in S){
				print i OFS S
			   }
                for(u in Y){
				print u OFS Y
                           }
              }
         '   Input_file

Then output will be as follows:

Set28_common 2
Set27_common 3
Set18_common 2
Set45_common 2
Set35_common 2
Set36_common 3
Set34_common 2
Set78_common 3
Set25_common 2
Set26_common 3
Set24_common 2
Set17_common 3
Set16_common 2
Set15_common 2
Set68_common 2
Set58_common 2
Set23_common 2
Set14_common 2
Set13_common 1
Set12_common 2
Set67_common 3
Set57_common 2
Set56_common 3
Set48_common 3
Set47_common 3
Set46_common 3
Set38_common 2
Set37_common 3
Set1_unique 1

Now you could make All in All command as follows, which you could run with original Input_file(posted in POST#1)

awk -F, 'NR==1{
                print "Name," $0;
                R=NF
              }
         NR>1 {
                for(i=1;i<=NF;i++){
                                        A[$i,i]++;
                                        if($i){
                                                C[$i]
                                              }
                                  }
              }
         END  {
                for(i in C)       {
                                        for(j=1;j<=R;j++){
                                                                Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};
                                                                print Q;
                                                                Q=""
                                                         }
              }
         ' Input_file   |   awk -F, 'NR==1{
		next
              } 
              {
		for(i=2;i<=NF;i++){
					for(j=i+1;j<=NF;j++){
								if($i==$j && $i!=0 && $j!=0){
												S["Set"(i-1)(j-1)"_common"]++;
											    };
                                                            }
                                  }
              } 
              {
                for(q=2;q<=NF;q++){
					if($q==1)           {
								num=q-1;
								E++
							    }
                                  };
                if(E==1)          {
					Y["Set"num"_unique"]++
				  };
		E=""
              } 
         END  {
		for(i in S){
				print i OFS S
			   }
                for(u in Y){
				print u OFS Y
                           }
              }
         '

Output will be as follows(as per Input_file into your POST#1).

Set23_common 2
Set13_common 1
Set12_common 2
Set1_unique 2
Set3_unique 2
Set2_unique 1

Please let me know if this helps you, will be glad.

Thanks,
R. Singh

jacobs.smith · August 19, 2016, 3:30pm

ravindersingh13:

Hello jacobs.smith,

Let's say we have following Input_file(which willbe created by your 1st requirement, so I have edited it to test it more).

cat Input_file
Name,Set1,Set2,Set3,Set4,Set5,Set6,Set7,Set8
g5,0,1,1,1,0,1,1,0
g6,1,0,0,0,0,0,0,0
g7,0,1,0,0,0,0,0,1
g8,0,0,1,1,1,1,0,1
g1,1,1,1,0,1,1,1,0
g2,1,1,0,1,1,1,1,1
g3,0,0,1,0,0,0,1,1
g4,1,0,0,1,0,0,1,1

Then following code may help you in same.

awk -F, 'NR==1{
		next
   } 
   {
		for(i=2;i<=NF;i++){
					for(j=i+1;j<=NF;j++){
								if($i==$j && $i!=0 && $j!=0){
												S["Set"(i-1)(j-1)"_common"]++;
											    };
   }
   }
   } 
   {
   for(q=2;q<=NF;q++){
					if($q==1)           {
								num=q-1;
								E++
							    }
   };
   if(E==1)          {
					Y["Set"num"_unique"]++
				  };
		E=""
   } 
   END  {
		for(i in S){
				print i OFS S
			   }
   for(u in Y){
				print u OFS Y
   }
   }
   '   Input_file

Then output will be as follows:

Set28_common 2
Set27_common 3
Set18_common 2
Set45_common 2
Set35_common 2
Set36_common 3
Set34_common 2
Set78_common 3
Set25_common 2
Set26_common 3
Set24_common 2
Set17_common 3
Set16_common 2
Set15_common 2
Set68_common 2
Set58_common 2
Set23_common 2
Set14_common 2
Set13_common 1
Set12_common 2
Set67_common 3
Set57_common 2
Set56_common 3
Set48_common 3
Set47_common 3
Set46_common 3
Set38_common 2
Set37_common 3
Set1_unique 1

Now you could make All in All command as follows, which you could run with original Input_file(posted in POST#1)

awk -F, 'NR==1{
   print "Name," $0;
   R=NF
   }
   NR>1 {
   for(i=1;i<=NF;i++){
   A[$i,i]++;
   if($i){
   C[$i]
   }
   }
   }
   END  {
   for(i in C)       {
   for(j=1;j<=R;j++){
   Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};
   print Q;
   Q=""
   }
   }
   ' Input_file   |   awk -F, 'NR==1{
		next
   } 
   {
		for(i=2;i<=NF;i++){
					for(j=i+1;j<=NF;j++){
								if($i==$j && $i!=0 && $j!=0){
												S["Set"(i-1)(j-1)"_common"]++;
											    };
   }
   }
   } 
   {
   for(q=2;q<=NF;q++){
					if($q==1)           {
								num=q-1;
								E++
							    }
   };
   if(E==1)          {
					Y["Set"num"_unique"]++
				  };
		E=""
   } 
   END  {
		for(i in S){
				print i OFS S
			   }
   for(u in Y){
				print u OFS Y
   }
   }
   '

Output will be as follows(as per Input_file into your POST#1).

Set23_common 2
Set13_common 1
Set12_common 2
Set1_unique 2
Set3_unique 2
Set2_unique 1

Please let me know if this helps you, will be glad.

Thanks,
R. Singh

You are a legend R.Singh. Thank you!

RavinderSingh13 · August 19, 2016, 3:38pm

Thank you jsscob.smith, it is all Almigthy's grace on me. Glad it helped you, did you test all permutations and cominations here? As I haven't done that much testing in it, please confirm.

Also thank you for asking this question, please keep asking questions(trying by yourself too), wherever is WILL there is a PATH.

Thanks,
R. Singh

RudiC · August 19, 2016, 3:39pm

So, with 7 sets there should be 127 lines, no? And the sum of individual set counts should be equal to the No. of lines?

Should g2,1,1,0,1,1,1,1,1 from RavinderSingh13's example be in Set1245678 or in Set12 , Set14 , Set15 , ..., Set78 ?

jacobs.smith · August 19, 2016, 5:10pm

Hi R.Singh,

I checked it with the input file.

But the number of lines in the output.txt doesn't reach to be 127.

I guess, it is printing only the values where there is a common or unique set.

However, I would like to see all combination values.

Thanks

RudiC · August 19, 2016, 6:32pm

OK, try this:

awk '
NR==1   {print "Name", $0
         CC = NF
         for (i=1; i<2^CC; i++) Set = 0
         next
        }
        {for (i=1; i<=CC; i++)  {T[$i]
                                 R[$i,i] = 1
                                }
        }
END     {delete T[""]
         for (t in T)   {printf "%s", t
                         for (i=1; i<=CC; i++)  printf ",%s", R[t,i]+0
                         printf RS
                         TMP = 0
                         for (i=1; i<=CC; i++)  TMP = TMP + 2^(i-1)*R[t,i]
                         Set[TMP]++
                        }
         for (i=1; i<2^CC; i++) {printf "Set"; for (j=0; j<CC; j++) if (int(i/2^j)%2) printf "%d", j+1; printf "=%d%s", Set, RS }
        }
' FS=,  file
Name,Set1,Set2,Set3,Set4,S5,S6
g1,1,1,1,1,1,1
g2,1,1,0,0,1,1
g3,0,0,1,0,1,1
g4,1,0,0,1,0,0
g5,1,1,1,1,1,1
g6,1,1,0,0,1,1
g7,0,1,0,1,1,0
g8,0,0,1,1,0,0
Set1=0
Set2=0
Set12=0
Set3=0
Set13=0
Set23=0
Set123=0
Set4=0
Set14=1
Set24=0
Set124=0
Set34=1
.
.
.
Set145=0
Set245=1
Set1245=0
.
.
.
Set256=0
Set1256=2
Set356=1
Set1356=0
.
.
.
Set23456=0
Set123456=2

Lines: 63 ( = 2^6 -1 )
Sum (set-values): 8 (8 different genes)

A bit complicated as awk doesn't provide binary operations nor print formats.

RudiC · August 21, 2016, 1:14pm

Another small refinement:

awk '
NR==1   {print "Name," $0
         CC = NF
         for (i=1; i<2^CC; i++) Set = 0
         next
        }

        {for (i=1; i<=CC; i++)  {T[$i]
                                 R[$i,i] = 1
                                }
        }

END     {delete T[""]
         for (t in T)   {TMP = 0
                         for (i=1; i<=CC; i++)  {printf "%s,%s%s", i==1?t:_, R[t,i]+0, i==CC?RS:_
                                                 TMP = TMP + 2^(i-1)*R[t,i]
                                                }
                         Set[TMP]++
                        }
         for (i=1; i<2^CC; i++) {TMP = 0
                                 for (j=0; j<CC; j++) if (int(i/2^j)%2) TMP = TMP * 10 + j+1
                                 printf "Set%d_%s=%d" RS, TMP, TMP<10?"unique":"common", Set |  "sort -k1.4,1n"
                                }
        }
' FS=, file
Name,Set1,Set2,Set3,S4,S5
g1,1,1,1,1,1
g2,1,1,0,1,0
g3,0,0,1,1,0
g4,1,0,0,1,1
g5,0,1,1,1,1
g6,1,0,0,1,1
g7,0,1,0,1,1
g8,0,0,1,0,1
Set1_unique=0
Set2_unique=0
Set3_unique=0
Set4_unique=0
Set5_unique=0
Set12_common=0
Set13_common=0
Set14_common=0
Set15_common=0
Set23_common=0
Set24_common=0
Set25_common=0
Set34_common=1
Set35_common=1
Set45_common=0
Set123_common=0
Set124_common=1
Set125_common=0
Set134_common=0
Set135_common=0
Set145_common=2
Set234_common=0
Set235_common=0
Set245_common=1
Set345_common=0
Set1234_common=0
Set1235_common=0
Set1245_common=0
Set1345_common=0
Set2345_common=1
Set12345_common=1