Find matches and write the data before it

Priyanka_Chopra · November 7, 2012, 12:30am

Hi all

I am here for help once again

I have two files

One file is like this with one columns

F2
B2
CAD
KGM
HTC
CSP

Second file is like this in 5 columns where firs column contain sometime entries of first file with space and other entries

F2 XYZ CDT CAD          it is part of agriculture    it is part of university   it is part of ...             it is used for.... 

KGM HTC CSP      it is part of agriculture    it is part of university   it is part of ...             it is used for....

If there is a match then I have to separate like this in 5 columns

F2  it is part of agriculture    it is part of university   it is part of ...             it is used for.... 
CAD  it is part of agriculture    it is part of university   it is part of ...             it is used for.... 


KGM it is part of agriculture    it is part of university   it is part of ...             it is used for.... 

HTC  it is part of agriculture    it is part of university   it is part of ...             it is used for.... 

CSP  it is part of agriculture    it is part of university   it is part of ...             it is used for....

please help me out

summer_cherry · November 7, 2012, 1:43am

gawk '{
if(NR==FNR){
	_[$1] = 1
}
else{
	for(i=1;i<=NF;i++){
		if(_[$i] == 1){
			for(j=i;j<=NF;j++){
				printf $j" "
			}
			print ""
		}
	}
}
}
' a b

Priyanka_Chopra · November 8, 2012, 12:59am

Thankyou very much dear.

Its seemd good code but its not working completely as my output is like this if F2 matches or HTc matches

F2 XYZ CDT CAD          it is part of agriculture    it is part of university   it is part of ...             it is used for.... 

KGM HTC CSP      it is part of agriculture    it is part of university   it is part of ...             it is used for...

But I want to remove other non matched entries of first column so that output wilbe

F2  it is part of agriculture    it is part of university   it is part of ...             it is used for.... 
CAD  it is part of agriculture    it is part of university   it is part of ...             it is used for.... 


KGM it is part of agriculture    it is part of university   it is part of ...             it is used for.... 

HTC  it is part of agriculture    it is part of university   it is part of ...             it is used for.... 

CSP  it is part of agriculture    it is part of university   it is part of ...             it is used for....

Means there should be only matched entry in the first columnin the output.

Guide me please if possible

rangarasan · November 8, 2012, 1:51am

Hi,

Try this one,

awk 'BEGIN{FS=OFS="\t";}NR==FNR{a[$0]=1;next;}{split($1,f," ");for(i=1;i<=length(f);i++){p=f;if(a[p]==1){print p,$2,$3,$4,$5;}}}' file1 file2

Assumptions:

The field separator is tab(\t).
The field length is fixed(5 fields).

Cheers,
Ranga

Priyanka_Chopra · November 8, 2012, 2:05am

Hi

Thanks for reply.

but this time output file is completely blank!

but yeah, in the input second file there are more than 5 columns therefore, what I wanted is just write whatever is front of common match is present as it is and for sure in columns as input!

And, I checked in the previous output file there are not at all any columns rather entries of 5 columns are row wise..

and regarding tab seaparation entries are like this here each colur represent each column so in input file there are 8 columns.

FCGR2A FCGR2B FCGR2C EGFR FCGR3B C1R C1QA C1QB C1QC FCGR3A C1S FCGR1A Cetuximab Erbitux FCGR2A FCGR2B FCGR2C EGFR FCGR3B C1R C1QA C1QB C1QC FCGR3A C1S FCGR1A Cetuximab binds to the epidermal growth factor receptor (EGFr) on both normal and tumor cells. EGFr is over-expressed in many colorectal cancers. Cetuximab competitively inhibits the binding of epidermal growth factor (EGF) and TGF alpha, thereby reducing their effects on cell growth and metastatic spread. Epidermal growth factor receptor binding FAB. Cetuximab is composed of the Fv (variable; antigen-binding) regions of the 225 murine EGFr monoclonal antibody specific for the N-terminal portion of human EGFr with human IgG1 heavy and kappa light chain constant (framework) regions. For treatment of EGFR-expressing metastatic colorectal cancer in patients who are refractory to other irinotecan-based chemotherapy regimens. Cetuximab is also indicated for treatment of squamous cell carcinoma of the head and neck in conjucntion with radiation therapy. Used in the treatment of colorectal cancer, cetuximab binds specifically to the epidermal growth factor receptor (EGFr, HER1, c-ErbB-1) on both normal and tumor cells. EGFr is over-expressed in many colorectal cancers. Cetuximab competitively inhibits the binding of epidermal growth factor (EGF) and other ligands, such as transforming growth factor-alpha. Binding of cetuximab to the EGFr blocks phosphorylation and activation of receptor-associated kinases, resulting in inhibition of cell growth, induction of apoptosis, decreased matrix metalloproteinase secretion and reduced vascular endothelial growth factor production.

so if FCGR2A is present in first file then output will be

FCGR2A etuximab Erbitux FCGR2A FCGR2B FCGR2C EGFR FCGR3B C1R C1QA C1QB C1QC FCGR3A C1S FCGR1A Cetuximab binds to the epidermal growth factor receptor (EGFr) on both normal and tumor cells. EGFr is over-expressed in many colorectal cancers. Cetuximab competitively inhibits the binding of epidermal growth factor (EGF) and TGF alpha, thereby reducing their effects on cell growth and metastatic spread. Epidermal growth factor receptor binding FAB. Cetuximab is composed of the Fv (variable; antigen-binding) regions of the 225 murine EGFr monoclonal antibody specific for the N-terminal portion of human EGFr with human IgG1 heavy and kappa light chain constant (framework) regions. For treatment of EGFR-expressing metastatic colorectal cancer in patients who are refractory to other irinotecan-based chemotherapy regimens. Cetuximab is also indicated for treatment of squamous cell carcinoma of the head and neck in conjucntion with radiation therapy. Used in the treatment of colorectal cancer, cetuximab binds specifically to the epidermal growth factor receptor (EGFr, HER1, c-ErbB-1) on both normal and tumor cells. EGFr is over-expressed in many colorectal cancers. Cetuximab competitively inhibits the binding of epidermal growth factor (EGF) and other ligands, such as transforming growth factor-alpha. Binding of cetuximab to the EGFr blocks phosphorylation and activation of receptor-associated kinases, resulting in inhibition of cell growth, induction of apoptosis, decreased matrix metalloproteinase secretion and reduced vascular endothelial growth factor production.

manigrover · November 8, 2012, 6:40am

hmmm seems complex!

pamu · November 8, 2012, 7:36am

Considering your inputs from post 1 this should work..

awk 'NR==FNR{X[$1]=$0;next}{n=split($1,P," ");sub($1,"",$0);for(i=1;i<=n;i++){if(X[P]){print P,$0}}}' file1 FS="  +" file2

If not, Please provide real inputs from your files.

pamu

Priyanka_Chopra · November 8, 2012, 7:41am

Yes, it didnt wrk as output is just first file and

the sample which I provided is above is exactly from real file

shall I attach file?

let me know but its same as my sample provided above.

pamu · November 8, 2012, 7:45am

I don't know what you are trying.

Please check..

$ cat file1
F2
B2
CAD
KGM
HTC
CSP

$ cat file2
F2 XYZ CDT CAD          it is part of agriculture    it is part of university   it is part of ...             it is used for....

KGM HTC CSP      it is part of agriculture    it is part of university   it is part of ...             it is used for....

$ awk 'NR==FNR{X[$1]=$0;next}{n=split($1,P," ");sub($1,"",$0);for(i=1;i<=n;i++){if(X[P]){print P,$0}}}' file1 FS="  +" file2
F2           it is part of agriculture    it is part of university   it is part of ...             it is used for....
CAD           it is part of agriculture    it is part of university   it is part of ...             it is used for....
KGM       it is part of agriculture    it is part of university   it is part of ...             it is used for....
HTC       it is part of agriculture    it is part of university   it is part of ...             it is used for....
CSP       it is part of agriculture    it is part of university   it is part of ...             it is used for....

Is this what you want..?

I hope this helps:)

pamu

Priyanka_Chopra · November 8, 2012, 8:16am

Hi

It seems some error

Finally I am attaching both BD(first) 1diseasedrug(second) files and output file(see)

Pleas check it

these are just part as the files are big and I also got one error

bash-3.2$ awk 'NR==FNR{X[$1]=$0;next}{n=split($1,P," ");sub($1,"",$0);for(i=1;i<=n;i++){if(X[P]){print P,$0}}}' BD FS="  +" diseasedrugbank >see
awk: (FILENAME=diseasedrugbank FNR=471) fatal: Unmatched ( or \(: /DRD2 ADRA1A  Droperidol      DHBP    DRD2 ADRA1A     The exact mechanism of action is unknown, however, droperidol causes a CNS depression at subcortical levels of the brain, midbrain, and brainstem reticular formation. It may antagonize the actions of glutamic acid within the extrapyramidal system. It may also inhibit cathecolamine receptors and the reuptake of neurotransmiters and has strong central antidopaminergic action and weak central anticholinergic action. It can also produce ganglionic blockade and reduced affective response. The main actions seem to stem from its potent Dopamine(2) receptor antagonism with minor antagonistic effects on alpha-1 adrenergic receptors as well.      A butyrophenone with general properties similar to those of haloperidol. It is used in conjunction with an opioid analgesic such as fentanyl to maintain the patient in a calm state of neuroleptanalgesia with indifference to surroundings but still able to cooperate with the surgeon. It is also used as a premedicant, as an antiemetic, and for the control of agitation in acute psychoses. (From Martindale, The Extra/
bash-3.2$

so output is not as expected in these files

pamu · November 8, 2012, 8:25am

From your sample file it looks like your file2 has no pattern to get the required result.

still try using "\t"

awk 'NR==FNR{X[$1]=$0;next}{n=split($1,P," ");sub($1,"",$0);for(i=1;i<=n;i++){if(X[P]){print P,$0}}}' file1 FS="\t" file2

Priyanka_Chopra · November 16, 2012, 5:49am

Hi Pamu

This semed to be a good code I apllied to many other files and suddenly I realise it didnt wrok fo rmy many other files and my hard goes waste!

Becasuse it was matching with just first entry of second file

Kindly help me as I have to again run on all those files.

I have attachedone of those files

May be it has happened becuase second file contain first oclumn with entire separarated by comma.........

sorry for inconvenience.

Kindly guide me:o

Priyanka_Chopra · November 16, 2012, 5:55am

I doesnt wrk even on these once

first file pHAMRGKBT2D

second file Pharmgkbdrugdisease3.txt confused: