Kindly check it: Camparison of files only column1 of 2 files

manigrover · July 26, 2012, 11:49pm

Hi all,

I have 2 files in which i have to find commom entries in column 1 an dif soemthing is common write other data of both files in front of it mentioned.

Gene symbol and disease name

column 1    column2
ARFGEF2 CAD
DDEF2 CAD
PSCD3 CAD
PSCD4 CAD
CAMK1 CAD,HT,HT,HT,HT,HT,HT,HT,HT,HT,HT,HT,HT
HSP90AA1 CAD,CAD,CAD,T2D,T2D
KDR CAD,CD,CD
VEGF CAD,CAD,CAD,CAD,T2D,T2D,T2D
CTNNA3 CAD,HT,T2D
PTPRM CAD,T2D
RAC2 CAD,CAD,T1D,T1D
SMAD3 CAD,T2D,T2D,T2D,T2D,T2D,T2D,T2D
SORBS1 CAD,CAD,CAD
CD36 CAD
IRS1 CAD,CAD,CAD
IRS2 CAD,CAD,CAD,CAD
MTFMT CAD,CAD,CAD,T1D,T1D,T1D
SARS CAD
GNPDA2 CAD
NANS CAD
SRD5A1 CAD

Second file with 3 colu

second file is like this

Gene symbol drug drug

F2        Lepirudin    Refludan
FCGR2A,FCGR2B,FCGR2C,EGFR,FCGR3B,C1R,C1QA,C1QB,C1QC,FCGR3A,C1S,FCGR1A        Cetuximab    Erbitux
Not Available        Dornase Alfa    Pulmozyme
IL2RA,IL2RB,IL2RG        Denileukin diftitox    Ontak
C1S,C1R,C1QA,C1QB,C1QC,TNF,TNFRSF1B,FCGR1A,FCGR3A,FCGR2A,FCGR2B,FCGR2C,LTA,FCGR3B        Etanercept    Enbrel
F2        Bivalirudin    Angiomax
GNRHR        Leuprolide    Eligard
IFNAR2,IFNAR1        Peginterferon alfa-2a    Pegasys
PLG,FGA,PLAUR,SERPINE1        Alteplase    Activase (Genentech Inc

The expected out put contain

common entries of first column                        disease from first file                         drug from second file                drug from second file

prabampm · July 27, 2012, 12:10am

I didnt get u properly.. can u explain a bit..

manigrover · July 27, 2012, 12:49am

hi

thanks for ur reply.

I have to compare entries in coumn 1 o ffirst file with entries in column 1 of secoond file both are gene names in capitalised letters.

And, if there is a gene naem for exaple IFNAR1 is common in first column of both files then I have to write the entries of second file and and first file in front of it. so that I will get 4 columns in out put file

first column: common gene entry
secoond column: disease name in front of common entry in first file(CAD,HT)
third column drug in second filein front of common entry
fourth column drug in second file in front of common entry

Please let me know if u need any other information.

Mani

rangarasan · July 27, 2012, 1:26am

Hi,

Try this,

join -11 -21 file1 file2

Cheers,
Ranga:-)

manigrover · July 27, 2012, 1:33am

Hi

thanksf orr reply.
The output is completely blank!

bash-3.2$ join -11 -21 SARAIN_FILE drugbankin_file >sara-drugbankout
bash-3.2$ join -11 -21 SARAIN_FILE drugbankin_file
bash-3.2$ join -11 -21 SARAIN_FILE drugbankin_file
bash-3.2$ join -11 -21 SARAIN_FILE drugbankin_file
bash-3.2$ join -11 -21 SARAIN_FILE drugbankin_file
bash-3.2$

i150371485 · July 27, 2012, 1:40am

Sorry i didnt get you. can you exaplain with some examples. I didnt get the examples which you specified above..

manigrover · July 27, 2012, 1:48am

Alright for example

1 file contains data in 3 columns

column1         col2     col3     
1,2,3                c          d 
4                     g          h 
6,7                  k           c

second file contain 2 columns

1  x
2  u
3  s
4  l

output shuld be check common entreis in column 1 of both files and write the data before it like this as mentione din file 1 and file 2

1  x    c  d 
2  u   c  d 
3  s   c   d 
4  l    g h

rangarasan · July 27, 2012, 2:11am

Hi,
You said col1 value on both the files are same thats why i provided the join suggestion.
Try this one,

awk 'FNR==NR{if($1 ~ /,/){split($1,b,",");for(i in b){a=$2" "$3;}}else{a[$1]=$2" "$3;}next;}{if(a[$1]){print $0,a[$1];}}' file1  file2

Make sure that the col1 should have unique values.
Cheers,
Ranga:-)

manigrover · July 27, 2012, 3:17am

cheers!This has done gud wrk but not all the common entries are matched

for example
if the first file contain more than one time common entries it wont show the data present in front of all 1 digits.

so if the input files are like this

1,2,3                c          d 
1,4                     g          h 
1,2, 6,7                  k           c

and

1  x
2  u
3  s
4  l

output shuld contain all the data in same column with , mentioned between two words of same column with common entry

 1 x    c,g,k  d,h,c  
2  u    c,k      d,c
3  s    c       d 
4  l    g        h

rangarasan · July 27, 2012, 1:47pm

Hi,

Try this one,

awk 'FNR==NR{a[$1]=$2;next;}{if($1 ~ /,/){split($1,b,",");for(i in b){if(a){print i,$2,$3,a;}}}else{if(a[$1]){print $0,a[$1];}}}' file2  file1

Again i assuming file2's col1 has unique values.
I hope it helps you.
Cheers,
Ranga:-)

manigrover · July 30, 2012, 2:37am

Hi all,

In one of the other files which I have to compare data is arrange dlike this Please let me knw the scripting

Mani Grover    1   2   3

Sumit Grover Saxena  1  3  4
  
Mani Grover    3   4  5

Harsh Grover     4  5  6

Mani Grover    4 5 6

Sumit saxena    5  6  8

the second file contain data

Mani Grover    z34  

Sushant Grover   z65  

Harsh Grover    z44

Rinku Saxena    x4y

I want the output shuld be

Mani Grover   z34     1,3,4   2,5,4   3 ,5,6

Harsh Grover     z44  4  5  6

elixir_sinari · July 30, 2012, 2:44am

It would help if you could state the modus operandi of arriving at the output.

manigrover · July 30, 2012, 6:58am

Hi

Thanks for reply.

Yes, first entries in first column( which can be combination of words like Mani Grover) of one file has to be matched with entries in first column of second file

If , there is common entries in both files
then, it should write the other entries the column 2,3,4 separated by comma in front of it in arrangement of columns.

And, it should also match data in column 2 in front of common entry wnd write before it.

Like if Mani Grover is common then output is mentioned in the above part of questions.

Mani

---------- Post updated at 05:58 AM ---------- Previous update was at 02:00 AM ----------

Hi all,

In one of the other files which I have to compare data is arrange dlike this Please let me knw the scripting

Mani Grover    1   2   3

Sumit Grover Saxena  1  3  4
  
Mani Grover    3   4  5

Harsh Grover     4  5  6

Mani Grover    4 5 6

Sumit saxena    5  6  8

the second file contain data

Mani Grover    z34  

Sushant Grover   z65  

Harsh Grover    z44

Rinku Saxena    x4y

I want the output shuld be

Mani Grover   z34     1,3,4   2,5,4   3 ,5,6

Harsh Grover     z44  4  5  6

[/quote]

raj_saini20 · July 30, 2012, 7:34am

Try this

awk 'NR==FNR{
		a[$1]=$0
	}

NR!=FNR{
	n=split($1,b,",");
	for(i=1;i<=n;i++)
		{	if(a[b])	
				{
					print a[b]" "$2" "$3
				}
		}
	}' file2 file1

And output is

1  x c d
2  u c d
3  s c d
4  l g h

manigrover · July 30, 2012, 7:40am

Hi

I tried this but ourput is not accodring to expected
[

bash-3.2$ awk 'NR==FNR{
>                 a[$1]=$0
>         }
> 
> NR!=FNR{
>         n=split($1,b,",");
>         for(i=1;i<=n;i++)
>                 {       if(a[b])
>                                 {
>                                         print a[b]" "$2" "$3
>                                 }
>                 }
>         }' saragenename4 TTDtargets2.txt >sararoughttddrugs5
bash-3.2$

The output is just like

its just repeating the common entries of first columns i think and that also doesnt cotain dat of column2,3, it's blank in fornt of 1,2....

raj_saini20 · August 1, 2012, 7:02am

try this (copy in file and run)


awk '
NR==FNR&&NF>1{
		x=$1;
		for(i=2;i<NF;i++)
				{
					x=substr(x,1)":"substr($i,1);
				};
		a[x]=$NF
	}
(NR!=FNR&&NF>1){
		x=$1;
		for(i=2;i<NF-2;i++)
				{
					x=substr(x,1)":"substr($i,1);
				};
		if(a[x])
				{
					n=split(a[x],b,":");
					if(n==1)
						{
							a[x]=substr(a[x],1)":"substr($NF-2,1)":"substr($NF-1,1)":"substr($NF,1);
						}
					else
						{
							f1=substr(b[2],1)","substr($NF-2,1);
							f2=substr(b[3],1)","substr($NF-1,1);
							f3=substr(b[4],1)","substr($NF,1);
							a[x]=substr(b[1],1)":"substr(f1,1)":"substr(f2,1)":"substr(f3,1);
						}
				}
	}END{
		while(getline < "file2")
			{
				x=$1;
				for(i=2;i<NF;i++)
					{
						x=substr(x,1)":"substr($i,1);
					};
				if(a[x])
					{
						n1=split(a[x],b,":");
						n2=split(x,c,":");
						if(n1>1)
							{
								for(j=1;j<=n2;j++)
									{
										printf c[j]" ";
									};
								print b[1]" "b[2]" "b[3]" "b[4];
							}
					}
			}
	}
' file2 file1

and output is

Mani Grover z34 1,3,4 2,4,5 3,5,6
Harsh Grover z44 4 5 6

Here assumption is made that in file2 the names entries are unique