Match look up file and find result

manigrover · September 10, 2012, 6:29am

Hi

I ahve a lookup file wiht seven words

CD
HT
CAD
HT
T1D
T2D
BD

another file contain data like this

CHRM1    P11229    Pirenzepine    DAP000492    Peptic ulcer disease    Approved T2D
CHRM1    P11229    Glycopyrrolate    DAP001116    Anesthetic    Approved T2D
CHRM1    P11229    Clidinium    DAP001117    Abdominal/stomach pain    Approved T2D
CHRM1    P11229    Dicyclomine    DAP001118    Irritable bowel syndrome    Approved T2D
CHRM1    P11229    Ethopropazine    DAP001119    Parkinson's disease    Approved T2D
CHRM1    P11229    Cycrimine    DAP001120    Parkinson's disease    Approved T2D
CHRM1    P11229    Benztropine    DAP001121    Parkinson's disease    Approved T2D
CHRM1    P11229    Trihexyphenidyl    DAP001122    Parkinson's disease    Approved T2D
CHRM1    P11229    Propantheline    DAP001123    Excessive sweating (hyperhidrosis)    Approved T2D
CHRM1    P11229    Oxyphenonium    DAP001124    Spasm    Approved T2D
CHRM1    P11229    Biperiden    DAP001125    Parkinson's disease    Approved T2D
CHRM1    P11229    Talsaclidine isomer    DCL000268    Alzheimer's disease    Discontinued T2D
CHRM1    P11229    Sabcomeline hydrochloride    DCL000279    Cardiovascular diseases    Phase IIa T2D
CHRM1    P11229    Talsaclidine fumarate    DCL000303    Alzheimer's disease    Discontinued T2D
CHRM1    P11229    Xanomeline tartrate    DCL000328    Alzheimer's disease    Phase II T2D
CHRM1    P11229    GSK573719    DCL000381    Chronic Obstructive Pulmonary Disease (COPD)    Phase II T2D
CHRM1    P11229    GSK961081    DCL000397    Chronic Obstructive Pulmonary Disease (COPD)    Phase II completed T2D
CHRM1    P11229    GSK1034702    DCL000402    Schizophrenia, Dementia    Phase I completed T2D
CHRM1    P11229    Darotropium    DCL000514    COPD    Suspended in Phase II in GSK 2009 Report T2D
CHRM1    P11229    Darotropium + 642444    DCL000515    COPD    Phase III T2D
CHRM1    P11229    Revatropate    DCL000957    Chronic obstructive pulmonary disease    Discontinued in Phase I T2D
FLT1    P17948    Sorafenib    DAP000006    Advanced renal cell carcinoma    Launched CAD
FLT1    P17948    Sorafenib    DAP000006    Hepatocellular carcinoma, NSCLC, melanoma    Phase III CAD
FLT1    P17948    Sorafenib    DAP000006    Myelodyspalstic syndrome, AML, head & neck cancer, breast, colon, ovarian, pancreatic cancer    Phase II CAD
FLT1    P17948    Ranibizumab    DAP001260    Age-related macular degeneration    Approved CAD
FLT1    P17948    Ranibizumab    DAP001260    Diabetic macular edema and retinal vein occlusion    Phase III CAD
FLT1    P17948    Telbermin    DCL001016    Diabetic foot ulcers    Discontinued in Phase II CAD
KDR    P35968    Sunitinib    DAP000005    Advanced renal cell carcinoma    Launched CAD,CD,CD
KDR    P35968    Sunitinib    DAP000005    Advanced renal cell carcinoma    Phase II CAD,CD,CD
KDR    P35968    Pazopanib HCl    DAP001550    Renal cell carcinoma    Approved CAD,CD,CD
KDR    P35968    CYC116    DCL000010    Solid Tumors    Terminated in Phase I CAD,CD,CD
KDR    P35968    XL999    DCL000011    Advanced Malignancies    Phase I CAD,CD,CD
KDR    P35968    CT-322    DCL000096    Cancer/Tumors    Phase I CAD,CD,CD
KDR    P35968    CT-322    DCL000096    Macular Degeneration    Preclinical CAD,CD,CD
KDR    P35968    XL647    DCL000263    Cancer    Phase I completed CAD,CD,CD
KDR    P35968    XL647    DCL000263    Carcinoma, Non-Small-Cell Lung    Phase II completed CAD,CD,CD
KDR    P35968    XL880    DCL000265    Solid Tumors    Phase I CAD,CD,CD
KDR    P35968    XL880    DCL000265    Gastric Cancer, Renal Cell Carcinoma, Squamous Cell Cancer of the Head and Neck    Phase II CAD,CD,CD
KDR    P35968    SU-6668    DCL000342    Advanced solid tumours    Discontinued CAD,CD,CD

[/CODE]
I am using following code

awk -F'\t' 'FNR==NR{a[$0]=1;next} {
gsub(/Approved */,"",$6)
n=split($6,b,",")
$6=""
for(i=1;i<=n;i++)
 if(b in a)
  print $0, "Approved" > "file_" b ".txt"
}' OFS='\t' lookupfile mainfile

But I m receiving seven file but output doesnot contain allt he data according to second input file

For eg one part of the output for T2D file is

CHRM1    P11229    Pirenzepine    DAP000492    Peptic ulcer disease        Approved
CHRM1    P11229    Glycopyrrolate    DAP001116    Anesthetic        Approved
CHRM1    P11229    Clidinium    DAP001117    Abdominal/stomach pain        Approved
CHRM1    P11229    Dicyclomine    DAP001118    Irritable bowel syndrome        Approved
CHRM1    P11229    Ethopropazine    DAP001119    Parkinson's disease        Approved
CHRM1    P11229    Cycrimine    DAP001120    Parkinson's disease        Approved
CHRM1    P11229    Benztropine    DAP001121    Parkinson's disease        Approved
CHRM1    P11229    Trihexyphenidyl    DAP001122    Parkinson's disease        Approved
CHRM1    P11229    Propantheline    DAP001123    Excessive sweating (hyperhidrosis)        Approved
CHRM1    P11229    Oxyphenonium    DAP001124    Spasm        Approved
CHRM1    P11229    Biperiden    DAP001125    Parkinson's disease        Approved

But, the expected output is

CHRM1    P11229    Pirenzepine    DAP000492    Peptic ulcer disease    Approved 
CHRM1    P11229    Glycopyrrolate    DAP001116    Anesthetic    Approved 
CHRM1    P11229    Clidinium    DAP001117    Abdominal/stomach pain    Approved 
CHRM1    P11229    Dicyclomine    DAP001118    Irritable bowel syndrome    Approved 
CHRM1    P11229    Ethopropazine    DAP001119    Parkinson's disease    Approved 
CHRM1    P11229    Cycrimine    DAP001120    Parkinson's disease    Approved 
CHRM1    P11229    Benztropine    DAP001121    Parkinson's disease    Approved 
CHRM1    P11229    Trihexyphenidyl    DAP001122    Parkinson's disease    Approved 
CHRM1    P11229    Propantheline    DAP001123    Excessive sweating (hyperhidrosis)    Approved 
CHRM1    P11229    Oxyphenonium    DAP001124    Spasm    Approved 
CHRM1    P11229    Biperiden    DAP001125    Parkinson's disease    Approved 
CHRM1    P11229    Talsaclidine isomer    DCL000268    Alzheimer's disease    Discontinued 
CHRM1    P11229    Sabcomeline hydrochloride    DCL000279    Cardiovascular diseases    Phase IIa 
CHRM1    P11229    Talsaclidine fumarate    DCL000303    Alzheimer's disease    Discontinued 
CHRM1    P11229    Xanomeline tartrate    DCL000328    Alzheimer's disease    Phase II 
CHRM1    P11229    GSK573719    DCL000381    Chronic Obstructive Pulmonary Disease (COPD)    Phase II 
CHRM1    P11229    GSK961081    DCL000397    Chronic Obstructive Pulmonary Disease (COPD)    Phase II completed 
CHRM1    P11229    GSK1034702    DCL000402    Schizophrenia, Dementia    Phase I completed 
CHRM1    P11229    Darotropium    DCL000514    COPD    Suspended in Phase II in GSK 2009 Report 
CHRM1    P11229    Darotropium + 642444    DCL000515    COPD    Phase III 
CHRM1    P11229    Revatropate    DCL000957    Chronic obstructive pulmonary disease    Discontinued in Phase I

[/CODE]So in out put its showing only those lines which cotain word "approved" on right hand side but others should also be there

---------- Post updated 09-10-12 at 05:29 AM ---------- Previous update was 09-09-12 at 11:56 PM ----------

Hi

Whether I will be able to get result after editing "approved" word but I have to choose many other words in the following code to make it worthwile

awk -F'\t' 'FNR==NR{a[$0]=1;next} {
gsub(/Approved */,"",$6)
n=split($6,b,",")
$6=""
for(i=1;i<=n;i++)
 if(b in a)
  print $0, "Approved" > "file_" b ".txt"
}' OFS='\t' lookupfile mainfile

CarloM · September 10, 2012, 7:31am

If your ID tags are always the last thing in $6 and with no embedded spaces then you could split on space and take the last element. i.e. something like:

{
m=split($6,c," ");
$6=c[m];
n=split($6,b,",")
...

manigrover · September 10, 2012, 7:48am

Hi

Thanks for the reply .I tried following but I m getting error. Not aware how to solve?

bash-3.2$ awk -F'\t' 'FNR==NR{a[$0]=1;next} {
n=split($6,c," ");
$6=c;
n=split($6,b,",")
for(i=1;i<=n;i++)
 if(b in a)
  print $0, "Approved" > "file_" b ".txt"
}' OFS='\t' lookupfie sarattdnewdruggene4.txt
awk: cmd. line:2: (FILENAME=sarattdnewdruggene4.txt FNR=1) fatal: attempt to use array `c' in a scalar context

raj_saini20 · September 10, 2012, 8:11am

this error is because you are using
Trying to assign array to non-array

$6=c;

---------- Post updated at 05:41 PM ---------- Previous update was at 05:37 PM ----------

if your id tags doesn't contain space then try (But not tested)

awk 'FNR==NR{a[$0]=1;next} {
n=split($NF,b,",")
$NF=""
for(i=1;i<=n;i++)
 if(b in a)
  print  > "file_" b ".txt"
}' OFS='\t' lookupfile mainfile

manigrover · September 10, 2012, 8:29am

Hi Raj

Thanks for reply.

It s giving correct results but the only issue is as u said the spacing between .

So when I am trying to paste result in excel the spacing between words being separated into columns like

below data contain 9 or morecolumns but it should come in just 6 columns
For example:
for first row:

1 column for CHRM1
2 column for P1129
3 xolumn for Pirenzepine
4 column for DAP000492
5 column for Peptic ulcer disease( not 3 different columns)
6 column for approved

CHRM1    P11229    Pirenzepine    DAP000492    Peptic ulcer disease    Approved 
CHRM1    P11229    Glycopyrrolate    DAP001116    Anesthetic    Approved 
CHRM1    P11229    Clidinium    DAP001117    Abdominal/stomach pain    Approved 
CHRM1    P11229    Dicyclomine    DAP001118    Irritable bowel syndrome    Approved 
CHRM1    P11229    Ethopropazine    DAP001119    Parkinson's disease    Approved 
CHRM1    P11229    Cycrimine    DAP001120    Parkinson's disease    Approved 
CHRM1    P11229    Benztropine    DAP001121    Parkinson's disease    Approved 
CHRM1    P11229    Trihexyphenidyl    DAP001122    Parkinson's disease    Approved 
CHRM1    P11229    Propantheline    DAP001123    Excessive sweating (hyperhidrosis)    Approved 
CHRM1    P11229    Oxyphenonium    DAP001124    Spasm    Approved 
CHRM1    P11229    Biperiden    DAP001125    Parkinson's disease    Approved 
CHRM1    P11229    Talsaclidine isomer    DCL000268    Alzheimer's disease    Discontinued 
CHRM1    P11229    Sabcomeline hydrochloride    DCL000279    Cardiovascular diseases    Phase IIa 
CHRM1    P11229    Talsaclidine fumarate    DCL000303    Alzheimer's disease    Discontinued 
CHRM1    P11229    Xanomeline tartrate    DCL000328    Alzheimer's disease    Phase II 
CHRM1    P11229    GSK573719    DCL000381    Chronic Obstructive Pulmonary Disease (COPD)    Phase II 
CHRM1    P11229    GSK961081    DCL000397    Chronic Obstructive Pulmonary Disease (COPD)    Phase II completed 
CHRM1    P11229    GSK1034702    DCL000402    Schizophrenia, Dementia    Phase I completed 
CHRM1    P11229    Darotropium    DCL000514    COPD    Suspended in Phase II in GSK 2009 Report 
CHRM1    P11229    Darotropium + 642444    DCL000515    COPD    Phase III 
CHRM1    P11229    Revatropate    DCL000957    Chronic obstructive pulmonary disease    Discontinued in Phase I

raj_saini20 · September 10, 2012, 9:28am

try this your code with some modification

awk -F'\t' 'FNR==NR{a[$0]=1;next} {
n=split($6,b,",")
n1=split(b[1],c," ")
x=b[1]
b[1]=c[n1]
$6=""
for(i=1;i<=n;i++)
 if(b in a)
  print $0, "Approved" > "file_" b ".txt"
}' OFS='\t' lookupfile mainfile

CarloM · September 10, 2012, 10:02am

manigrover:

Hi

Thanks for the reply .I tried following but I m getting error. Not aware how to solve?

bash-3.2$ awk -F'\t' 'FNR==NR{a[$0]=1;next} {
n=split($6,c," ");
$6=c;
n=split($6,b,",")
for(i=1;i<=n;i++)
 if(b in a)
  print $0, "Approved" > "file_" b ".txt"
}' OFS='\t' lookupfie sarattdnewdruggene4.txt
awk: cmd. line:2: (FILENAME=sarattdnewdruggene4.txt FNR=1) fatal: attempt to use array `c' in a scalar context

It should have been $6=c[m] (although that's actually an unnecessary step anyway):

carlo@host:/tmp -> cat x.awk
awk -F'\t' 'FNR==NR{a[$0]=1;next} {
   m=split($6,c," ")
   n=split(c[m],b,",")
   $6=""
   for(i=1;i<=n;i++)
      if(b in a)
         print $0, "Approved" "::" "file_" b ".txt"
}' OFS='\t' lookupfile mainfile2
carlo@host:/tmp -> ./x.awk
CHRM1   P11229  Pirenzepine     DAP000492       Peptic ulcer disease            Approved::file_T2D.txt
CHRM1   P11229  Glycopyrrolate  DAP001116       Anesthetic              Approved::file_T2D.txt
CHRM1   P11229  Clidinium       DAP001117       Abdominal/stomach pain          Approved::file_T2D.txt
CHRM1   P11229  Dicyclomine     DAP001118       Irritable bowel syndrome                Approved::file_T2D.txt
CHRM1   P11229  Ethopropazine   DAP001119       Parkinson's disease             Approved::file_T2D.txt
CHRM1   P11229  Cycrimine       DAP001120       Parkinson's disease             Approved::file_T2D.txt
CHRM1   P11229  Benztropine     DAP001121       Parkinson's disease             Approved::file_T2D.txt
CHRM1   P11229  Propantheline   DAP001123       Excessive sweating (hyperhidrosis)              Approved::file_T2D.txt
CHRM1   P11229  Oxyphenonium    DAP001124       Spasm           Approved::file_T2D.txt
CHRM1   P11229  Biperiden       DAP001125       Parkinson's disease             Approved::file_T2D.txt
CHRM1   P11229  Talsaclidine isomer     DCL000268       Alzheimer's disease             Approved::file_T2D.txt
CHRM1   P11229  Sabcomeline hydrochloride       DCL000279       Cardiovascular diseases         Approved::file_T2D.txt
CHRM1   P11229  Talsaclidine fumarate   DCL000303       Alzheimer's disease             Approved::file_T2D.txt
CHRM1   P11229  GSK573719       DCL000381       Chronic Obstructive Pulmonary Disease (COPD)            Approved::file_T2D.txt
CHRM1   P11229  GSK961081       DCL000397       Chronic Obstructive Pulmonary Disease (COPD)            Approved::file_T2D.txt
CHRM1   P11229  GSK1034702      DCL000402       Schizophrenia, Dementia         Approved::file_T2D.txt
CHRM1   P11229  Darotropium     DCL000514       COPD            Approved::file_T2D.txt
CHRM1   P11229  Darotropium + 642444    DCL000515       COPD            Approved::file_T2D.txt
CHRM1   P11229  Revatropate     DCL000957       Chronic obstructive pulmonary disease           Approved::file_T2D.txt
FLT1    P17948  Sorafenib       DAP000006       Advanced renal cell carcinoma           Approved::file_CAD.txt
FLT1    P17948  Sorafenib       DAP000006       Hepatocellular carcinoma, NSCLC, melanoma               Approved::file_CAD.txt
FLT1    P17948  Sorafenib       DAP000006       Myelodyspalstic syndrome, AML, head & neck cancer, breast, colon, ovarian, pancreatic cancer            Approved::file_CAD.txt
FLT1    P17948  Ranibizumab     DAP001260       Age-related macular degeneration                Approved::file_CAD.txt
FLT1    P17948  Ranibizumab     DAP001260       Diabetic macular edema and retinal vein occlusion               Approved::file_CAD.txt
FLT1    P17948  Telbermin       DCL001016       Diabetic foot ulcers            Approved::file_CAD.txt
KDR     P35968  Sunitinib       DAP000005       Advanced renal cell carcinoma           Approved::file_CAD.txt
KDR     P35968  Sunitinib       DAP000005       Advanced renal cell carcinoma           Approved::file_CD.txt
KDR     P35968  Sunitinib       DAP000005       Advanced renal cell carcinoma           Approved::file_CD.txt
KDR     P35968  Sunitinib       DAP000005       Advanced renal cell carcinoma           Approved::file_CAD.txt
KDR     P35968  Sunitinib       DAP000005       Advanced renal cell carcinoma           Approved::file_CD.txt
KDR     P35968  Sunitinib       DAP000005       Advanced renal cell carcinoma           Approved::file_CD.txt
...etc...

(outputting to terminal for testing purposes...)

Don_Cragun · September 10, 2012, 9:42pm

The way I understand what was desired, the text in the last field ("Approved", "Discontinued", "Phase II", etc.) should appear in the output files rather than putting "Approved" in that spot no matter what the status is.

First a few notes:

The entries you specified for lookupfile contain the string "HT" twice. I assume either that there are only seven different disease codes, or that one of the "HT" entries is a typo that needs to be corrected.
The text of mainfile given in the 1st message in this thread has spaces between fields rather than tab characters. To make things work correctly with fields that contain spaces, the input field separators have to be something other than spaces. Given that the script supplied in that message used -F'\t' , I replaced all sequences of multiple adjacent space characters with a single tab character and wrote the results to a file named mainfile2 and am using that as the 2nd input file rather than the original mainfile given.
There are several entries in mainfile that end with "CAD,CD,CD". The entries will produce two lines of output in file_CD.txt for every input line that contains an occurrence of this string.
The version of awk I'm using (from Mac OS X Lion) will not accept a concatenation of strings when specifying an output file. I think this is a bug in OS X's awk utility, but I don't know how common this restriction is. Therefore, the script below computes the name of the output file before using it in the print statement.

I believe the awk script below will do what was wanted:

awk -F'\t' 'FNR==NR {a[$0]
	next
}
 {	m = split($6, c, " ")
	n = split(c[m], b, ",")
	$6 = substr($6, 1, length($6) - length(c[m]) - 1)
	for(i = 1; i <= n; i++)
		if(b in a) {
			file = "file_"b".txt"
			print > file
		}
}' OFS='\t' lookupfile mainfile2