Fetch entries in front of specific word till next word

Priyanka_Chopra · November 2, 2012, 5:10am

Hi all

I have following file which I have to edit for research purpose

file:///tmp/moz-screenshot.png    	 	 	 	 	body, div, table, thead, tbody, tfoot, tr, th, td, p { font-family: "Liberation Sans"; font-size: x-small; } 	   	 	 		 			Drug: KRP-104 QD Drug: Placebo Drug: Metformin|Drug: Placebo Drug: Metformin|Drug: KRP-104 BID Drug: Placebo Drug: Metformin    Phase 2
Drug: Dapagliflozin    Phase 1
Drug: MK-3102|Drug: Matching placebo to MK-3102|Drug: Basal medication    Phase 3
Dietary Supplement: Vitamin C|Drug: glyburide    Phase 1
Drug: Insulin glargine new formulation (HOE901)|Drug: Insulin glargine (HOE901)    Phase 3
Drug: Pioglitazone|Drug: Placebo|Drug: Pioglitazone|Drug: Placebo    Phase 4
Drug: Metformin HCl and Colesevelam Placebo|Drug: Metformin HCl tablets and Colesevelam tablets|Drug: Colesevelam placebo|Drug: Colesevelam    Phase 3
Drug: Insulin-Levemir|Drug: Exenatide-Bayetta|Drug: Insulin-Levemir and Exenatide-Bayetta|Device: SenseWear Pro3� armband|Device: DexCom CGM    Phase 4
Drug: exenatide once weekly|Drug: metformin|Drug: sitagliptin|Drug: pioglitazone    Phase 3
Drug: intensive insulin group|Drug: Oral AntiDiabetic Drug (glimepiride and metformin)    Phase 4
Drug: LY2189265|Drug: Sulfonylureas (SU)|Drug: Biguanides|Drug: Thiazolidinedione (TZD)|Drug: alpha-glucosidase inhibitor (a-GI)|Drug: Glinides    Phase 3
Drug: Insulin glargine new formulation (HOE901)|Drug: Insulin glargine (HOE901)    Phase 3
Drug: placebo|Drug: exenatide|Drug: exenatide    Phase 3
Drug: Vildagliptin (LAF237)|Drug: Voglibose|Drug: Vildagliptin and Voglibose    Phase 4
Drug: pioglitazone|Drug: insulin glargine    Phase 4
Drug: GSK189075 oral tablets|Drug: metformin tablets    Phase 1
Drug: Vildagliptin|Drug: Metformin|Drug: Vildagliptin + Metformin    Phase 3
Drug: Insulin glargine plus insulin analogues    Phase 4
Drug: Glipizide|Drug: Metformin    Phase 4
Drug: vildagliptin|Drug: Metformin Comparator    Phase 3
Drug: Dapagliflozin|Drug: Placebo matching Dapagliflozin    Phase 3
Drug: vildagliptin|Drug: Gliclazide    Phase 3
Drug: GSK1614235|Drug: Sitagliptin|Other: Placebo    Phase 1
Drug: Pioglitazone (Actos)|Drug: Anti-diabetic agent other than pioglitazone or rosiglitazone    Phase 1
Drug: Vildagliptin 100 mg qd|Drug: Metformin 1500 mg daily    Phase 3
Drug: Alogliptin and glimepiride|Drug: Alogliptin and glimepiride|Drug: Alogliptin and metformin|Drug: Alogliptin and metformin    Phase 2|Phase 3

I have to separate entries in a different file in such a way that it contains only names of drugs and phase in front of it so that expeected output is words after Drug: till next Drug: will start in the same row

as for first row expected output mentioend here

KRP-104 QD  Phase 2
Placebo         Phase 2
Metformin     Phase 2
Placebo         Phase 2
Metformin      Phase 2
KRP-104 BID    Phase 2
 Placebo            Phase 2
Metformin    Phase 2

pamu · November 2, 2012, 5:32am

Assuming you want only last Phase of the drug.

try

awk -F "Drug:" '{gsub("\\|","",$0);n=split($NF,P," +");for(i=2;i<NF;i++){print $i,P[n-1],P[n]};print $NF}' file

ctsgnb · November 2, 2012, 5:55am

Since the last line of your example shows that there can be more than 1 phase :

sed "s/^Drug: //;s/Drug: /|/g;s/ *||* */|/g;s/    /@/g" yourfile | awk -F"@" '{n=split($1,d,"|");m=split($2,p,"|");for(i=1;i<=m;i++) for(j=1;j<=n;j++) {print d[j]":"p}}'

or in awk only (and handling the case of a line like Drug: Drug1|Drug: Drug2|Drug: Drug3 Phase 1|Phase 421|Phase 69 )

awk -F"    " '{
s="Drug: "
sub(s,z)
gsub("[|]*"s,"|")
n=split($1,d,"|")
m=split($2,p,"|")
for(i=1;i<=m;i++)
    for(j=1;j<=n;j++)
        print d[j]":"p
}' yourfile

Priyanka_Chopra · November 4, 2012, 9:29pm

Hi all

Thanks for reply.

But it doesnt seem to be working properly. I think there I have to explain a bit more.

As my input file is like this,

Drug: Dapagliflozin    Phase 1
Drug: MK-3102|Drug: Matching placebo to MK-3102|Drug: Basal medication    Phase 3
Dietary Supplement: Vitamin C|Drug: glyburide    Phase 1
Drug: Insulin glargine new formulation (HOE901)|Drug: Insulin glargine (HOE901)    Phase 3
Drug: Pioglitazone|Drug: Placebo|Drug: Pioglitazone|Drug: Placebo    Phase 4
Drug: Metformin HCl and Colesevelam Placebo|Drug: Metformin HCl tablets  and Colesevelam tablets|Drug: Colesevelam placebo|Drug: Colesevelam     Phase 3
Drug: Insulin-Levemir|Drug: Exenatide-Bayetta|Drug: Insulin-Levemir and  Exenatide-Bayetta|Device: SenseWear Pro3� armband|Device: DexCom CGM     Phase 4
Drug: exenatide once weekly|Drug: metformin|Drug: sitagliptin|Drug: pioglitazone    Phase 3
Drug: intensive insulin group|Drug: Oral AntiDiabetic Drug (glimepiride and metformin)    Phase 4
Drug: LY2189265|Drug: Sulfonylureas (SU)|Drug: Biguanides|Drug:  Thiazolidinedione (TZD)|Drug: alpha-glucosidase inhibitor (a-GI)|Drug:  Glinides    Phase 3
Drug: Insulin glargine new formulation (HOE901)|Drug: Insulin glargine (HOE901)    Phase 3
Drug: placebo|Drug: exenatide|Drug: exenatide    Phase 3
Drug: Vildagliptin (LAF237)|Drug: Voglibose|Drug: Vildagliptin and Voglibose    Phase 4
Drug: pioglitazone|Drug: insulin glargine    Phase 4
Drug: GSK189075 oral tablets|Drug: metformin tablets    Phase 1
Drug: Vildagliptin|Drug: Metformin|Drug: Vildagliptin + Metformin    Phase 3
Drug: Insulin glargine plus insulin analogues    Phase 4
Drug: Glipizide|Drug: Metformin    Phase 4
Drug: vildagliptin|Drug: Metformin Comparator    Phase 3
Drug: Dapagliflozin|Drug: Placebo matching Dapagliflozin    Phase 3
Drug: vildagliptin|Drug: Gliclazide    Phase 3
Drug: GSK1614235|Drug: Sitagliptin|Other: Placebo    Phase 1
Drug: Pioglitazone (Actos)|Drug: Anti-diabetic agent other than pioglitazone or rosiglitazone    Phase 1
Drug: Vildagliptin 100 mg qd|Drug: Metformin 1500 mg daily    Phase 3
Drug: Alogliptin and glimepiride|Drug: Alogliptin and glimepiride|Drug:  Alogliptin and metformin|Drug: Alogliptin and metformin    Phase 2|Phase  3

I have to fragment each sentence in a way that Drugs get separated with phase mentioned in front of it

here is example for line 2

Drug: MK-3102|Drug: Matching placebo to MK-3102|Drug: Basal medication Phase 3

expected output

MK-3102                                               Phase 3
Matching placebo to MK-3102            Phase 3
Basal medication                                   Phase

3

For lastline expected output is

Drug: Alogliptin and glimepiride|Drug: Alogliptin and glimepiride|Drug: Alogliptin and metformin|Drug: Alogliptin and metformin Phase 2|Phase 3

Alogliptin and glimepiride         Phase 2|Phase  3
Alogliptin and glimepiride          Phase 2|Phase  3
Alogliptin and metformin           Phase 2|Phase  3
Alogliptin and metformin           Phase 2|Phase  3

so anyhting between Drug: and the symbol | get separated with phase mentioned in front of line in second column.

Although I dont want duplicates as present in second line but if it is there I can manage. But good nto to have duplicates

Alogliptin and glimepiride         Phase 2|Phase  3
Alogliptin and metformin           Phase 2|Phase  3

But I can use another programm fo rthat lateron but separation is a bit difficult

pamu · November 5, 2012, 1:08am

Slight modification to ctsgnb's code..

sed "s/^Drug: //;s/Drug: /|/g;s/ *||* */|/g;s/    /@/g" file | awk -F"@" '{n=split($1,d,"|");for(j=1;j<=n;j++) {print d[j]":"p,$2}}'

Priyanka_Chopra · November 5, 2012, 1:33am

Hi all,

Thanks for reply.

There seems to be soem error stillas the out put is like this

Drug: Ramipril: 
Drug: Placebo: 
Drug: Placebo    Phase 3: 
Drug: Etanercept: 
Drug: Placebo    Phase 1: 
Phase 2: 
Drug: 1,25-dihydroxy-vitamin D3 (calcitriol): 
Drug: placebo    Phase 2: 
Drug: Pro insulin peptide: 
Drug: Pro insulin peptide: 
Drug: Saline    Phase 1: 
Phase 2: 
Procedure: Islet transplant: 
Drug: Deoxyspergualin: 
Drug: Antithymocyte globulin: 
Drug: Daclizumab or basiliximab: 
Drug: Sirolimus: 
Drug: Tacrolimus: 
Drug: Etanercept    Phase 2: 
Drug: TAK-329: 
Drug: TAK-329: 
Drug: Insulin: 
Drug: Placebo    Phase 1: 
Drug: Exenatide: 
Drug: Rapid and long acting insulin: 
Drug: long acting insulin + rapid acting + 1.25 mcg Exenatide    Phase 4: 
Drug: Insulin glargine (HOE901): 
Drug: NPH insulin    Phase 3: 
Procedure: Islet transplant: 
Drug: Belatacept: 
Drug: Basiliximab: 
Drug: Mycophenolate Mofetil    Phase 2: 
Drug: Insulin glargine new formulation (HOE901): 
Drug: Insulin glargine (HOE901)    Phase 2: 
Drug: Insulin glargine new formulation (HOE901): 
Drug: Insulin glargine (HOE901)    Phase 3: 
Drug: Insulin glargine new formulation (HOE901): 
Drug: Insulin glargine (HOE901) (Lantus)    Phase 3: 
Drug: insulin detemir: 
Drug: insulin NPH: 
Drug: insulin aspart    Phase 3: 
Procedure: Islet Transplant:

pamu · November 5, 2012, 1:39am

Might there is difference between your input file and given file..

try this

sed "s/^Drug: //;s/Drug: /|/g;s/ *||* */|/g;s/ *Phase/@Phase/" file | awk -F"@" '{n=split($1,d,"|");for(j=1;j<=n;j++) {print d[j]":"p,$2}}'

Priyanka_Chopra · November 5, 2012, 3:04am

Hi all,

Thanks for continuous help but this time my output file is completely blank!

I am attaching here my sample input file.

Kindly check it.

RudiC · November 5, 2012, 3:16am

The first line in your file is a killer - it does not obey any rules. You might want to deal with it separtely. Try this - works on linux/mawk:

awk     '{gsub(/\n/,"") 
          Ar[++i]=$1}   
          /Phase/ {for (j in Ar)  print Ar[j], $NF; delete Ar; i=0}
        ' RS="\|?[A-Za-z ]*: " FS="   *" OFS="\t" file

What it does is split the file at "Drugs: ", "other: " etc, register all first fields until a phase information shows up, and then print all registered first fields together with the respective phase info.
The exotic record and field separators will not work on all awk implementations.

RudiC · November 5, 2012, 3:28am

With your new sample file, use FS="\t" in lieu of FS=" *"

pamu · November 5, 2012, 4:21am

There are few lines where phase is not present.

So i added previous phase of the line.

sed "s/^Drug: //;s/Drug: /|/g;s/ *||* */|/g;s/ *Phase/@Phase/" file | awk -F"@" '{if($2){s=$2};n=split($1,d,"|");for(j=1;j<=n;j++) {print d[j]":"p,s}}'

If you want to keep as it is blank. then use below.

sed "s/^Drug: //;s/Drug: /|/g;s/ *||* */|/g;s/ *Phase/@Phase/" file | awk -F"@" '{n=split($1,d,"|");for(j=1;j<=n;j++) {print d[j]":"p,$2}}'

Priyanka_Chopra · November 5, 2012, 4:43am

Hi all,

Thankyou for your support but my output in my sytem seems just irregular as it was!

Drug: Ramipril: 
Drug: Placebo: 
Drug: Placebo    Phase 3: 
Drug: Etanercept: 
Drug: Placebo    Phase 1: 
Phase 2: 
Drug: 1,25-dihydroxy-vitamin D3 (calcitriol): 
Drug: placebo    Phase 2: 
Drug: Pro insulin peptide: 
Drug: Pro insulin peptide: 
Drug: Saline    Phase 1: 
Phase 2: 
Procedure: Islet transplant: 
Drug: Deoxyspergualin: 
Drug: Antithymocyte globulin: 
Drug: Daclizumab or basiliximab: 
Drug: Sirolimus: 
Drug: Tacrolimus: 
Drug: Etanercept    Phase 2: 
Drug: TAK-329: 
Drug: TAK-329: 
Drug: Insulin: 
Drug: Placebo    Phase 1: 
Drug: Exenatide: 
Drug: Rapid and long acting insulin: 
Drug: long acting insulin + rapid acting + 1.25 mcg Exenatide    Phase 4: 
Drug: Insulin glargine (HOE901): 
Drug: NPH insulin    Phase 3: 
Procedure: Islet transplant: 
Drug: Belatacept: 
Drug: Basiliximab: 
Drug: Mycophenolate Mofetil    Phase 2: 
Drug: Insulin glargine new formulation (HOE901): 
Drug: Insulin glargine (HOE901)    Phase 2: 
Drug: Insulin glargine new formulation (HOE901): 
Drug: Insulin glargine (HOE901)    Phase 3: 
Drug: Insulin glargine new formulation (HOE901): 
Drug: Insulin glargine (HOE901) (Lantus)    Phase 3: 
Drug: insulin detemir: 
Drug: insulin NPH:

pamu · November 5, 2012, 4:50am

Have you tried my code?

In my previous post I already mentioned that. There are few line which don't have phase in your input file. Please look at my previous post.

Priyanka_Chopra · November 5, 2012, 5:02am

Hello Pamu,

Yes I checked. Thanks for your help.

But my expected out put is like this if a sentence is mentioned like this:

Drug: MK-3102|Drug: Matching placebo to MK-3102|Drug: Basal medication Phase 3

expected output is

Code:

MK-3102                                               Phase 3
Matching placebo to MK-3102            Phase 3
Basal medication                                   Phase
3

For lastline expected output is

Drug: Alogliptin and glimepiride|Drug: Alogliptin and glimepiride|Drug: Alogliptin and metformin|Drug: Alogliptin and metformin Phase 2|Phase 3

Alogliptin and glimepiride         Phase 2|Phase  3
Alogliptin and glimepiride          Phase 2|Phase  3
Alogliptin and metformin           Phase 2|Phase  3
Alogliptin and metformin           Phase 2|Phase  3

so anyhting between Drug: and the symbol | get separated with phase mentioned in front of line in second column.

Although I dont want duplicates as present in second line but if it is there I can manage. But good nto to have duplicates

Code:
Alogliptin and glimepiride         Phase 2|Phase  3
Alogliptin and metformin           Phase 2|Phase  3

But I can use another programm fo rthat lateron but separation is a bit difficult
[/CODE]

if there is not phase all the words in front of it will be blank
for eg drug: MK01 drug:VV09

  MK01
   VV09

so in front of these two words there are blank spaces without any phase. Thats what I expected.

pamu · November 5, 2012, 5:11am

Please check.

$ cat file
Drug: Pro insulin peptide|Drug: Pro insulin peptide|Drug: Saline        Phase 1|Phase 2
Drug: SAR161271|Drug: Insulin glargine HOE901   Phase 1|Phase 2
Drug: Insulin glargine HOE901|Drug: Insulin glargine - New formulation HOE901   Phase 1
Drug: insulin glargine (HOE901)|Drug: insulin glargine- new formulation (HOE901)        Phase 1
Drug: Insulin glargine|Drug: Insulin detemir    Phase 4
Drug: Angiotensin II receptor antagonists (Candesartan)|Drug: Placebo
Drug: VIAject|Drug: Regular Human Insulin       Phase 3
Drug: Fenofibrate|Drug: Inert lactose placebo   Phase 3

$ sed "s/^Drug: //;s/Drug: /|/g;s/ *||* */|/g;s/ *Phase/@Phase/" file | awk -F"@" '{n=split($1,d,"|");for(j=1;j<=n;j++) {print d[j]"\t"p,$2}}'
Pro insulin peptide      Phase 1|Phase 2
Pro insulin peptide      Phase 1|Phase 2
Saline           Phase 1|Phase 2
SAR161271        Phase 1|Phase 2
Insulin glargine HOE901          Phase 1|Phase 2 #both phases are present
Insulin glargine HOE901  Phase 1
Insulin glargine - New formulation HOE901                Phase 1
insulin glargine (HOE901)        Phase 1
insulin glargine- new formulation (HOE901)               Phase 1
Insulin glargine         Phase 4
Insulin detemir          Phase 4
Angiotensin II receptor antagonists (Candesartan) #here is a blank space
Placebo
VIAject  Phase 3
Regular Human Insulin            Phase 3
Fenofibrate      Phase 3
Inert lactose placebo            Phase 3

I don't know what you are trying..

Priyanka_Chopra · November 5, 2012, 5:39am

Thanks a lot its working now.

You re magician!