Venn Data Maker

Hi,

My input is like this

head input.txt
Set1,Set2,Set3
g1,g2,g3
g2,g1,g3,
g4,g5,g5
g1,g1,g1,
g2,g1,g1,
g6,g7,g8
,g7,g8
,,g8

My output file should be

Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

Logic

  1. First get all unique genenames (g1,g2.....g8).
  2. Then look if that particular gene is present in any of the columns in the input file.
  3. If it is present, print 1. If absent, print 0.

Special Notes

  1. Please note that the each of the columns (Set1,Set2,Set3) in input.txt can have missing values(last two records in input.txt).
  2. The columns are not always three. My actual input file has 7. So I want the column counts to be dynamic.

Thanks

1 Like

An approach using gawk:-

gawk -F, '
        NR == 1  {
                print "Name," $0
        }
        NR > 1 {
                for ( i = 1; i <= NF; i++ )
                {
                        if ( $i )
                                T[$i]
                        R[$i FS i]
                }
        }
        END {
                n = asorti(T)
                for ( i = 1; i <= n; i++ )
                {
                        for ( j = 1; j <= NF; j++ )
                        {
                                if ( ( T FS j ) in R )
                                        S = S ? S FS 1 : T FS 1
                                else
                                        S = S ? S FS 0 : T FS 0
                        }
                        print S
                        S = ""
                }
        }
' file
1 Like

Hello jacobs.smith,

If you are not bothered about sequence of field 1st as per your Input_file then following may help you in same.

awk -F, 'NR==1{print "Name," $0;R=NF} NR>1{for(i=1;i<=NF;i++){A[$i,i]++;if($i){C[$i]}}} END{for(i in C){for(j=1;j<=R;j++){Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};print Q;Q=""}}'  Input_file 

Output will be as follows.

Name,Set1,Set2,Set3
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0

In case you need output into same order as per Input-file(sorted order) then following may help you in same.

awk -F, 'NR==1{print "Name," $0;R=NF} NR>1{for(i=1;i<=NF;i++){A[$i,i]++;if($i){C[$i]}}} END{for(i in C){for(j=1;j<=R;j++){Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};print Q;Q=""}}' Input_file  | sort -k1

Output will be as follows.

Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

EDIT: Adding a non-one liner form of solutions here.
Solution1:

awk -F, 'NR==1{
                print "Name," $0;
                R=NF
              }
         NR>1 {
                for(i=1;i<=NF;i++){
                                        A[$i,i]++;
                                        if($i){
                                                C[$i]
                                              }
                                  }
              }
         END  {
                for(i in C)       {
                                        for(j=1;j<=R;j++){
                                                                Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};
                                                                print Q;
                                                                Q=""
                                                         }
              }
         ' Input_file

Solution2:

awk -F, 'NR==1{
                print "Name," $0;
                R=NF
              }
         NR>1 {
                for(i=1;i<=NF;i++){
                                        A[$i,i]++;
                                        if($i){
                                                C[$i]
                                              }
                                  }
              }
         END  {
                for(i in C)       {
                                        for(j=1;j<=R;j++){
                                                                Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};
                                                                print Q;
                                                                Q=""
                                                         }
              }
         ' Input_file  | sort -k1
 

Thanks,
R. Singh

1 Like

Try also

awk '
NR==1   {print "Name", $0
         next
        }
        {for (i=1; i<=3; i++)   {T[$i]
                                 R[$i,i] = 1
                                }
        }
END     {delete T[""]
         for (t in T) print t, R[t,1]+0, R[t,2]+0, R[t,3]+0
        }
' FS=, OFS=, file
Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

Thank you folks - @Yoda @RavinderSingh13 and @Rudic.

Can I also get the intersection list into another file?

Intersectionlist.txt
Set1_unique=2
Set2_unique=1
Set3_unique=2
Set12_common=1
Set13_common=0
Set23_common=1
Set123_common=1

Thanks

Do we need to guess what an "intersection" is? Any attempt from your side?

My apologies Rudic.

If the value is "1" in any set, that means a presence of value and it should be counted.

If the value is "0" in any set, that means an absent and it should not be counted.

Ex:

Name, set1, set2, set3
g1,0,0,1
g2,0,0,1
g3,1,1,0

Gene1 and Gene2 are present only in set3. So set3_unique=2.

Gene3 is present in both set1 and set2. So set12_common=1

Please ask me more questions and I will be glad to reply.

Also - the number of lines in the intersectionlist.txt should be equal to = (2^(number of sets))-1

Thanks.

And the second question?

I am loving your questions.

Glad to learn.

Here is a way I tried. But it has two disadvantages.

One - I can only do three sets. But my actual input has 7 and even more.

Two - I cannot write the first column saying unique or common.

for i in 100 010 001 110 101 011 111; do awk -F"," 'NR>1 {print $2$3$4}' 1 | grep $i | wc -l;done

Thanks

How about

awk '
NR==1   {print "Name", $0
         next
        }
        {for (i=1; i<=3; i++)   {T[$i]
                                 R[$i,i] = 1
                                }
        }
END     {delete T[""]
         for (t in T)   {print t, R[t,1]+0, R[t,2]+0, R[t,3]+0
                         TMP = R[t,1] * 100 + R[t,2] * 10 + R[t,3]
                         if (TMP == 111) Set123++
                         if (TMP == 110) Set12++
                         if (TMP == 101) Set13++
                         if (TMP == 11)  Set23++
                         if (TMP == 100) Set1++
                         if (TMP == 10)  Set2++
                         if (TMP == 1)   Set3++
                        }
         print "Set1_unique="   0+Set1
         print "SDet2_unique="  0+Set2
         print "Set3_unique="   0+Set3
         print "Set12_common="  0+Set12
         print "Set13_common="  0+Set13
         print "Set23_common="  0+Set23
         print "Set123_common=" 0+Set123
        }
' FS=, OFS=, file
Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1
Set1_unique=2
SDet2_unique=1
Set3_unique=2
Set12_common=1
Set13_common=0
Set23_common=1
Set123_common=1
1 Like

Thanks Rudic.

But I would like to make it dynamic.

The post example has 3 sets. But my actual input file has numerous sets.

Can you please share any comments on how I can edit your solution?

That's not that easy, as the number of combinations grows dramatically with increasing set count.

1 Like

Hello jacobs.smith,

Let's say we have following Input_file(which willbe created by your 1st requirement, so I have edited it to test it more).

cat Input_file
Name,Set1,Set2,Set3,Set4,Set5,Set6,Set7,Set8
g5,0,1,1,1,0,1,1,0
g6,1,0,0,0,0,0,0,0
g7,0,1,0,0,0,0,0,1
g8,0,0,1,1,1,1,0,1
g1,1,1,1,0,1,1,1,0
g2,1,1,0,1,1,1,1,1
g3,0,0,1,0,0,0,1,1
g4,1,0,0,1,0,0,1,1

Then following code may help you in same.

awk -F, 'NR==1{
		next
              } 
              {
		for(i=2;i<=NF;i++){
					for(j=i+1;j<=NF;j++){
								if($i==$j && $i!=0 && $j!=0){
												S["Set"(i-1)(j-1)"_common"]++;
											    };
                                                            }
                                  }
              } 
              {
                for(q=2;q<=NF;q++){
					if($q==1)           {
								num=q-1;
								E++
							    }
                                  };
                if(E==1)          {
					Y["Set"num"_unique"]++
				  };
		E=""
              } 
         END  {
		for(i in S){
				print i OFS S
			   }
                for(u in Y){
				print u OFS Y
                           }
              }
         '   Input_file

Then output will be as follows:

Set28_common 2
Set27_common 3
Set18_common 2
Set45_common 2
Set35_common 2
Set36_common 3
Set34_common 2
Set78_common 3
Set25_common 2
Set26_common 3
Set24_common 2
Set17_common 3
Set16_common 2
Set15_common 2
Set68_common 2
Set58_common 2
Set23_common 2
Set14_common 2
Set13_common 1
Set12_common 2
Set67_common 3
Set57_common 2
Set56_common 3
Set48_common 3
Set47_common 3
Set46_common 3
Set38_common 2
Set37_common 3
Set1_unique 1

Now you could make All in All command as follows, which you could run with original Input_file(posted in POST#1)

awk -F, 'NR==1{
                print "Name," $0;
                R=NF
              }
         NR>1 {
                for(i=1;i<=NF;i++){
                                        A[$i,i]++;
                                        if($i){
                                                C[$i]
                                              }
                                  }
              }
         END  {
                for(i in C)       {
                                        for(j=1;j<=R;j++){
                                                                Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};
                                                                print Q;
                                                                Q=""
                                                         }
              }
         ' Input_file   |   awk -F, 'NR==1{
		next
              } 
              {
		for(i=2;i<=NF;i++){
					for(j=i+1;j<=NF;j++){
								if($i==$j && $i!=0 && $j!=0){
												S["Set"(i-1)(j-1)"_common"]++;
											    };
                                                            }
                                  }
              } 
              {
                for(q=2;q<=NF;q++){
					if($q==1)           {
								num=q-1;
								E++
							    }
                                  };
                if(E==1)          {
					Y["Set"num"_unique"]++
				  };
		E=""
              } 
         END  {
		for(i in S){
				print i OFS S
			   }
                for(u in Y){
				print u OFS Y
                           }
              }
         ' 

Output will be as follows(as per Input_file into your POST#1).

Set23_common 2
Set13_common 1
Set12_common 2
Set1_unique 2
Set3_unique 2
Set2_unique 1

Please let me know if this helps you, will be glad.

Thanks,
R. Singh

1 Like

You are a legend R.Singh. Thank you!

Thank you jsscob.smith, it is all Almigthy's grace on me. Glad it helped you, did you test all permutations and cominations here? As I haven't done that much testing in it, please confirm.

Also thank you for asking this question, please keep asking questions(trying by yourself too), wherever is WILL there is a PATH.

Thanks,
R. Singh

So, with 7 sets there should be 127 lines, no? And the sum of individual set counts should be equal to the No. of lines?

Should g2,1,1,0,1,1,1,1,1 from RavinderSingh13's example be in Set1245678 or in Set12 , Set14 , Set15 , ..., Set78 ?

Hi R.Singh,

I checked it with the input file.

But the number of lines in the output.txt doesn't reach to be 127.

I guess, it is printing only the values where there is a common or unique set.

However, I would like to see all combination values.

Thanks

OK, try this:

awk '
NR==1   {print "Name", $0
         CC = NF
         for (i=1; i<2^CC; i++) Set = 0
         next
        }
        {for (i=1; i<=CC; i++)  {T[$i]
                                 R[$i,i] = 1
                                }
        }
END     {delete T[""]
         for (t in T)   {printf "%s", t
                         for (i=1; i<=CC; i++)  printf ",%s", R[t,i]+0
                         printf RS
                         TMP = 0
                         for (i=1; i<=CC; i++)  TMP = TMP + 2^(i-1)*R[t,i]
                         Set[TMP]++
                        }
         for (i=1; i<2^CC; i++) {printf "Set"; for (j=0; j<CC; j++) if (int(i/2^j)%2) printf "%d", j+1; printf "=%d%s", Set, RS }
        }
' FS=,  file
Name,Set1,Set2,Set3,Set4,S5,S6
g1,1,1,1,1,1,1
g2,1,1,0,0,1,1
g3,0,0,1,0,1,1
g4,1,0,0,1,0,0
g5,1,1,1,1,1,1
g6,1,1,0,0,1,1
g7,0,1,0,1,1,0
g8,0,0,1,1,0,0
Set1=0
Set2=0
Set12=0
Set3=0
Set13=0
Set23=0
Set123=0
Set4=0
Set14=1
Set24=0
Set124=0
Set34=1
.
.
.
Set145=0
Set245=1
Set1245=0
.
.
.
Set256=0
Set1256=2
Set356=1
Set1356=0
.
.
.
Set23456=0
Set123456=2

Lines: 63 ( = 2^6 -1 )
Sum (set-values): 8 (8 different genes)

A bit complicated as awk doesn't provide binary operations nor print formats.

2 Likes

Another small refinement:

awk '
NR==1   {print "Name," $0
         CC = NF
         for (i=1; i<2^CC; i++) Set = 0
         next
        }

        {for (i=1; i<=CC; i++)  {T[$i]
                                 R[$i,i] = 1
                                }
        }

END     {delete T[""]
         for (t in T)   {TMP = 0
                         for (i=1; i<=CC; i++)  {printf "%s,%s%s", i==1?t:_, R[t,i]+0, i==CC?RS:_
                                                 TMP = TMP + 2^(i-1)*R[t,i]
                                                }
                         Set[TMP]++
                        }
         for (i=1; i<2^CC; i++) {TMP = 0
                                 for (j=0; j<CC; j++) if (int(i/2^j)%2) TMP = TMP * 10 + j+1
                                 printf "Set%d_%s=%d" RS, TMP, TMP<10?"unique":"common", Set |  "sort -k1.4,1n"
                                }
        }
' FS=, file
Name,Set1,Set2,Set3,S4,S5
g1,1,1,1,1,1
g2,1,1,0,1,0
g3,0,0,1,1,0
g4,1,0,0,1,1
g5,0,1,1,1,1
g6,1,0,0,1,1
g7,0,1,0,1,1
g8,0,0,1,0,1
Set1_unique=0
Set2_unique=0
Set3_unique=0
Set4_unique=0
Set5_unique=0
Set12_common=0
Set13_common=0
Set14_common=0
Set15_common=0
Set23_common=0
Set24_common=0
Set25_common=0
Set34_common=1
Set35_common=1
Set45_common=0
Set123_common=0
Set124_common=1
Set125_common=0
Set134_common=0
Set135_common=0
Set145_common=2
Set234_common=0
Set235_common=0
Set245_common=1
Set345_common=0
Set1234_common=0
Set1235_common=0
Set1245_common=0
Set1345_common=0
Set2345_common=1
Set12345_common=1
1 Like