Hi all
I have following file which I have to edit for research purpose
file:///tmp/moz-screenshot.png body, div, table, thead, tbody, tfoot, tr, th, td, p { font-family: "Liberation Sans"; font-size: x-small; } Drug: KRP-104 QD Drug: Placebo Drug: Metformin|Drug: Placebo Drug: Metformin|Drug: KRP-104 BID Drug: Placebo Drug: Metformin Phase 2
Drug: Dapagliflozin Phase 1
Drug: MK-3102|Drug: Matching placebo to MK-3102|Drug: Basal medication Phase 3
Dietary Supplement: Vitamin C|Drug: glyburide Phase 1
Drug: Insulin glargine new formulation (HOE901)|Drug: Insulin glargine (HOE901) Phase 3
Drug: Pioglitazone|Drug: Placebo|Drug: Pioglitazone|Drug: Placebo Phase 4
Drug: Metformin HCl and Colesevelam Placebo|Drug: Metformin HCl tablets and Colesevelam tablets|Drug: Colesevelam placebo|Drug: Colesevelam Phase 3
Drug: Insulin-Levemir|Drug: Exenatide-Bayetta|Drug: Insulin-Levemir and Exenatide-Bayetta|Device: SenseWear Pro3� armband|Device: DexCom CGM Phase 4
Drug: exenatide once weekly|Drug: metformin|Drug: sitagliptin|Drug: pioglitazone Phase 3
Drug: intensive insulin group|Drug: Oral AntiDiabetic Drug (glimepiride and metformin) Phase 4
Drug: LY2189265|Drug: Sulfonylureas (SU)|Drug: Biguanides|Drug: Thiazolidinedione (TZD)|Drug: alpha-glucosidase inhibitor (a-GI)|Drug: Glinides Phase 3
Drug: Insulin glargine new formulation (HOE901)|Drug: Insulin glargine (HOE901) Phase 3
Drug: placebo|Drug: exenatide|Drug: exenatide Phase 3
Drug: Vildagliptin (LAF237)|Drug: Voglibose|Drug: Vildagliptin and Voglibose Phase 4
Drug: pioglitazone|Drug: insulin glargine Phase 4
Drug: GSK189075 oral tablets|Drug: metformin tablets Phase 1
Drug: Vildagliptin|Drug: Metformin|Drug: Vildagliptin + Metformin Phase 3
Drug: Insulin glargine plus insulin analogues Phase 4
Drug: Glipizide|Drug: Metformin Phase 4
Drug: vildagliptin|Drug: Metformin Comparator Phase 3
Drug: Dapagliflozin|Drug: Placebo matching Dapagliflozin Phase 3
Drug: vildagliptin|Drug: Gliclazide Phase 3
Drug: GSK1614235|Drug: Sitagliptin|Other: Placebo Phase 1
Drug: Pioglitazone (Actos)|Drug: Anti-diabetic agent other than pioglitazone or rosiglitazone Phase 1
Drug: Vildagliptin 100 mg qd|Drug: Metformin 1500 mg daily Phase 3
Drug: Alogliptin and glimepiride|Drug: Alogliptin and glimepiride|Drug: Alogliptin and metformin|Drug: Alogliptin and metformin Phase 2|Phase 3
I have to separate entries in a different file in such a way that it contains only names of drugs and phase in front of it so that expeected output is words after Drug: till next Drug: will start in the same row
as for first row expected output mentioend here
KRP-104 QD Phase 2
Placebo Phase 2
Metformin Phase 2
Placebo Phase 2
Metformin Phase 2
KRP-104 BID Phase 2
Placebo Phase 2
Metformin Phase 2
pamu
November 2, 2012, 5:32am
2
Assuming you want only last Phase of the drug.
try
awk -F "Drug:" '{gsub("\\|","",$0);n=split($NF,P," +");for(i=2;i<NF;i++){print $i,P[n-1],P[n]};print $NF}' file
ctsgnb
November 2, 2012, 5:55am
3
Since the last line of your example shows that there can be more than 1 phase :
sed "s/^Drug: //;s/Drug: /|/g;s/ *||* */|/g;s/ /@/g" yourfile | awk -F"@" '{n=split($1,d,"|");m=split($2,p,"|");for(i=1;i<=m;i++) for(j=1;j<=n;j++) {print d[j]":"p}}'
or in awk only (and handling the case of a line like Drug: Drug1|Drug: Drug2|Drug: Drug3 Phase 1|Phase 421|Phase 69
)
awk -F" " '{
s="Drug: "
sub(s,z)
gsub("[|]*"s,"|")
n=split($1,d,"|")
m=split($2,p,"|")
for(i=1;i<=m;i++)
for(j=1;j<=n;j++)
print d[j]":"p
}' yourfile
Hi all
Thanks for reply.
But it doesnt seem to be working properly. I think there I have to explain a bit more.
As my input file is like this,
Drug: Dapagliflozin Phase 1
Drug: MK-3102|Drug: Matching placebo to MK-3102|Drug: Basal medication Phase 3
Dietary Supplement: Vitamin C|Drug: glyburide Phase 1
Drug: Insulin glargine new formulation (HOE901)|Drug: Insulin glargine (HOE901) Phase 3
Drug: Pioglitazone|Drug: Placebo|Drug: Pioglitazone|Drug: Placebo Phase 4
Drug: Metformin HCl and Colesevelam Placebo|Drug: Metformin HCl tablets and Colesevelam tablets|Drug: Colesevelam placebo|Drug: Colesevelam Phase 3
Drug: Insulin-Levemir|Drug: Exenatide-Bayetta|Drug: Insulin-Levemir and Exenatide-Bayetta|Device: SenseWear Pro3� armband|Device: DexCom CGM Phase 4
Drug: exenatide once weekly|Drug: metformin|Drug: sitagliptin|Drug: pioglitazone Phase 3
Drug: intensive insulin group|Drug: Oral AntiDiabetic Drug (glimepiride and metformin) Phase 4
Drug: LY2189265|Drug: Sulfonylureas (SU)|Drug: Biguanides|Drug: Thiazolidinedione (TZD)|Drug: alpha-glucosidase inhibitor (a-GI)|Drug: Glinides Phase 3
Drug: Insulin glargine new formulation (HOE901)|Drug: Insulin glargine (HOE901) Phase 3
Drug: placebo|Drug: exenatide|Drug: exenatide Phase 3
Drug: Vildagliptin (LAF237)|Drug: Voglibose|Drug: Vildagliptin and Voglibose Phase 4
Drug: pioglitazone|Drug: insulin glargine Phase 4
Drug: GSK189075 oral tablets|Drug: metformin tablets Phase 1
Drug: Vildagliptin|Drug: Metformin|Drug: Vildagliptin + Metformin Phase 3
Drug: Insulin glargine plus insulin analogues Phase 4
Drug: Glipizide|Drug: Metformin Phase 4
Drug: vildagliptin|Drug: Metformin Comparator Phase 3
Drug: Dapagliflozin|Drug: Placebo matching Dapagliflozin Phase 3
Drug: vildagliptin|Drug: Gliclazide Phase 3
Drug: GSK1614235|Drug: Sitagliptin|Other: Placebo Phase 1
Drug: Pioglitazone (Actos)|Drug: Anti-diabetic agent other than pioglitazone or rosiglitazone Phase 1
Drug: Vildagliptin 100 mg qd|Drug: Metformin 1500 mg daily Phase 3
Drug: Alogliptin and glimepiride|Drug: Alogliptin and glimepiride|Drug: Alogliptin and metformin|Drug: Alogliptin and metformin Phase 2|Phase 3
I have to fragment each sentence in a way that Drugs get separated with phase mentioned in front of it
here is example for line 2
Drug: MK-3102|Drug: Matching placebo to MK-3102|Drug: Basal medication Phase 3
expected output
MK-3102 Phase 3
Matching placebo to MK-3102 Phase 3
Basal medication Phase
3
For lastline expected output is
Drug: Alogliptin and glimepiride|Drug: Alogliptin and glimepiride|Drug: Alogliptin and metformin|Drug: Alogliptin and metformin Phase 2|Phase 3
Alogliptin and glimepiride Phase 2|Phase 3
Alogliptin and glimepiride Phase 2|Phase 3
Alogliptin and metformin Phase 2|Phase 3
Alogliptin and metformin Phase 2|Phase 3
so anyhting between Drug: and the symbol | get separated with phase mentioned in front of line in second column.
Although I dont want duplicates as present in second line but if it is there I can manage. But good nto to have duplicates
Alogliptin and glimepiride Phase 2|Phase 3
Alogliptin and metformin Phase 2|Phase 3
But I can use another programm fo rthat lateron but separation is a bit difficult
pamu
November 5, 2012, 1:08am
6
Slight modification to ctsgnb's code..
sed "s/^Drug: //;s/Drug: /|/g;s/ *||* */|/g;s/ /@/g" file | awk -F"@" '{n=split($1,d,"|");for(j=1;j<=n;j++) {print d[j]":"p,$2}}'
Hi all,
Thanks for reply.
There seems to be soem error stillas the out put is like this
Drug: Ramipril:
Drug: Placebo:
Drug: Placebo Phase 3:
Drug: Etanercept:
Drug: Placebo Phase 1:
Phase 2:
Drug: 1,25-dihydroxy-vitamin D3 (calcitriol):
Drug: placebo Phase 2:
Drug: Pro insulin peptide:
Drug: Pro insulin peptide:
Drug: Saline Phase 1:
Phase 2:
Procedure: Islet transplant:
Drug: Deoxyspergualin:
Drug: Antithymocyte globulin:
Drug: Daclizumab or basiliximab:
Drug: Sirolimus:
Drug: Tacrolimus:
Drug: Etanercept Phase 2:
Drug: TAK-329:
Drug: TAK-329:
Drug: Insulin:
Drug: Placebo Phase 1:
Drug: Exenatide:
Drug: Rapid and long acting insulin:
Drug: long acting insulin + rapid acting + 1.25 mcg Exenatide Phase 4:
Drug: Insulin glargine (HOE901):
Drug: NPH insulin Phase 3:
Procedure: Islet transplant:
Drug: Belatacept:
Drug: Basiliximab:
Drug: Mycophenolate Mofetil Phase 2:
Drug: Insulin glargine new formulation (HOE901):
Drug: Insulin glargine (HOE901) Phase 2:
Drug: Insulin glargine new formulation (HOE901):
Drug: Insulin glargine (HOE901) Phase 3:
Drug: Insulin glargine new formulation (HOE901):
Drug: Insulin glargine (HOE901) (Lantus) Phase 3:
Drug: insulin detemir:
Drug: insulin NPH:
Drug: insulin aspart Phase 3:
Procedure: Islet Transplant:
pamu
November 5, 2012, 1:39am
8
Might there is difference between your input file and given file..
try this
sed "s/^Drug: //;s/Drug: /|/g;s/ *||* */|/g;s/ *Phase/@Phase/" file | awk -F"@" '{n=split($1,d,"|");for(j=1;j<=n;j++) {print d[j]":"p,$2}}'
Hi all,
Thanks for continuous help but this time my output file is completely blank!
I am attaching here my sample input file.
Kindly check it.
RudiC
November 5, 2012, 3:16am
10
The first line in your file is a killer - it does not obey any rules. You might want to deal with it separtely. Try this - works on linux/mawk:
awk '{gsub(/\n/,"")
Ar[++i]=$1}
/Phase/ {for (j in Ar) print Ar[j], $NF; delete Ar; i=0}
' RS="\|?[A-Za-z ]*: " FS=" *" OFS="\t" file
What it does is split the file at "Drugs: ", "other: " etc, register all first fields until a phase information shows up, and then print all registered first fields together with the respective phase info.
The exotic record and field separators will not work on all awk implementations.
RudiC
November 5, 2012, 3:28am
12
With your new sample file, use FS="\t"
in lieu of FS=" *"
pamu
November 5, 2012, 4:21am
13
There are few lines where phase is not present.
So i added previous phase of the line.
sed "s/^Drug: //;s/Drug: /|/g;s/ *||* */|/g;s/ *Phase/@Phase/" file | awk -F"@" '{if($2){s=$2};n=split($1,d,"|");for(j=1;j<=n;j++) {print d[j]":"p,s}}'
If you want to keep as it is blank. then use below.
sed "s/^Drug: //;s/Drug: /|/g;s/ *||* */|/g;s/ *Phase/@Phase/" file | awk -F"@" '{n=split($1,d,"|");for(j=1;j<=n;j++) {print d[j]":"p,$2}}'
Hi all,
Thankyou for your support but my output in my sytem seems just irregular as it was!
Drug: Ramipril:
Drug: Placebo:
Drug: Placebo Phase 3:
Drug: Etanercept:
Drug: Placebo Phase 1:
Phase 2:
Drug: 1,25-dihydroxy-vitamin D3 (calcitriol):
Drug: placebo Phase 2:
Drug: Pro insulin peptide:
Drug: Pro insulin peptide:
Drug: Saline Phase 1:
Phase 2:
Procedure: Islet transplant:
Drug: Deoxyspergualin:
Drug: Antithymocyte globulin:
Drug: Daclizumab or basiliximab:
Drug: Sirolimus:
Drug: Tacrolimus:
Drug: Etanercept Phase 2:
Drug: TAK-329:
Drug: TAK-329:
Drug: Insulin:
Drug: Placebo Phase 1:
Drug: Exenatide:
Drug: Rapid and long acting insulin:
Drug: long acting insulin + rapid acting + 1.25 mcg Exenatide Phase 4:
Drug: Insulin glargine (HOE901):
Drug: NPH insulin Phase 3:
Procedure: Islet transplant:
Drug: Belatacept:
Drug: Basiliximab:
Drug: Mycophenolate Mofetil Phase 2:
Drug: Insulin glargine new formulation (HOE901):
Drug: Insulin glargine (HOE901) Phase 2:
Drug: Insulin glargine new formulation (HOE901):
Drug: Insulin glargine (HOE901) Phase 3:
Drug: Insulin glargine new formulation (HOE901):
Drug: Insulin glargine (HOE901) (Lantus) Phase 3:
Drug: insulin detemir:
Drug: insulin NPH:
pamu
November 5, 2012, 4:50am
15
Have you tried my code?
In my previous post I already mentioned that. There are few line which don't have phase in your input file. Please look at my previous post.
Hello Pamu,
Yes I checked. Thanks for your help.
But my expected out put is like this if a sentence is mentioned like this:
Drug: MK-3102|Drug: Matching placebo to MK-3102|Drug: Basal medication Phase 3
expected output is
Code:
MK-3102 Phase 3
Matching placebo to MK-3102 Phase 3
Basal medication Phase
3
For lastline expected output is
Drug: Alogliptin and glimepiride|Drug: Alogliptin and glimepiride|Drug: Alogliptin and metformin|Drug: Alogliptin and metformin Phase 2|Phase 3
Alogliptin and glimepiride Phase 2|Phase 3
Alogliptin and glimepiride Phase 2|Phase 3
Alogliptin and metformin Phase 2|Phase 3
Alogliptin and metformin Phase 2|Phase 3
so anyhting between Drug: and the symbol | get separated with phase mentioned in front of line in second column.
Although I dont want duplicates as present in second line but if it is there I can manage. But good nto to have duplicates
Code:
Alogliptin and glimepiride Phase 2|Phase 3
Alogliptin and metformin Phase 2|Phase 3
But I can use another programm fo rthat lateron but separation is a bit difficult
[/CODE]
if there is not phase all the words in front of it will be blank
for eg drug: MK01 drug:VV09
MK01
VV09
so in front of these two words there are blank spaces without any phase. Thats what I expected.
pamu
November 5, 2012, 5:11am
17
Please check.
$ cat file
Drug: Pro insulin peptide|Drug: Pro insulin peptide|Drug: Saline Phase 1|Phase 2
Drug: SAR161271|Drug: Insulin glargine HOE901 Phase 1|Phase 2
Drug: Insulin glargine HOE901|Drug: Insulin glargine - New formulation HOE901 Phase 1
Drug: insulin glargine (HOE901)|Drug: insulin glargine- new formulation (HOE901) Phase 1
Drug: Insulin glargine|Drug: Insulin detemir Phase 4
Drug: Angiotensin II receptor antagonists (Candesartan)|Drug: Placebo
Drug: VIAject|Drug: Regular Human Insulin Phase 3
Drug: Fenofibrate|Drug: Inert lactose placebo Phase 3
$ sed "s/^Drug: //;s/Drug: /|/g;s/ *||* */|/g;s/ *Phase/@Phase/" file | awk -F"@" '{n=split($1,d,"|");for(j=1;j<=n;j++) {print d[j]"\t"p,$2}}'
Pro insulin peptide Phase 1|Phase 2
Pro insulin peptide Phase 1|Phase 2
Saline Phase 1|Phase 2
SAR161271 Phase 1|Phase 2
Insulin glargine HOE901 Phase 1|Phase 2 #both phases are present
Insulin glargine HOE901 Phase 1
Insulin glargine - New formulation HOE901 Phase 1
insulin glargine (HOE901) Phase 1
insulin glargine- new formulation (HOE901) Phase 1
Insulin glargine Phase 4
Insulin detemir Phase 4
Angiotensin II receptor antagonists (Candesartan) #here is a blank space
Placebo
VIAject Phase 3
Regular Human Insulin Phase 3
Fenofibrate Phase 3
Inert lactose placebo Phase 3
I don't know what you are trying..
1 Like
Thanks a lot its working now.
You re magician!