Hi all,
I have an input file like this
Now
I have to remove duplicates only in first column and nothing has to be changed in second and third column. so that output would be
Please let me know scripting regarding this
Hi all,
I have an input file like this
Now
I have to remove duplicates only in first column and nothing has to be changed in second and third column. so that output would be
Please let me know scripting regarding this
awk '++a[$1] > 1{$1=""}1' inputfile
Do you use a template for creating new threads? These always contain "Request to check" in the title and end with "Please let me know scripting regarding this"...
Wow, i really didn't know awk was that powerful/flexible
Hi
I checked I m getting result properly
for eg
the output is
I ahve to completely remove those entries in first column which are ampletely similar to each other.
Expected output? I think you've got what you asked for.
---------- Post updated at 05:45 AM ---------- Previous update was at 05:43 AM ----------
The expected output is something like this in which all other columns are as it is but only duplicates entries in first column are remove no other change et all. sorry I didnt remove all entries in first column and there other column entires are moving left hand side which suld not happen in expected output
Does this work for you?
awk '{
for(i=1;i<=NF;i++)
{
if(FNR==1)
{
count[$i,i]++
continue
}
count[$i,i]++
if(count[$i,i]==1)
break
else
$i=""
}
}1' inputfile
Thanks for help!
But there are stilll some errros in the output
Data is mixed up between columns:there is no clear indiaction of separation even as it was previously.
I have to just remove duplicates in first column I dont have to change anything else even not a single spacing.
#!/usr/bin/python
import sys
if len(sys.argv) < 2:
print "usage:",sys.argv[0],"<file_path>"
sys.exit(69)
f = open(sys.argv[1], 'r')
lines = f.readlines()
count = 0
index = 0
for item in lines:
if count != 0:
left = lines[count].split()
right = lines[count-1].split()
while left[index] == right[index]:
index += 1
print ' '.join(left[index:])
index = 0
else:
print lines[count].rstrip()
count += 1
What about this?
To use this code do the following:
text chmod +x duplicate.py
text ./duplicate.py path_to_file
Edit: very basic error checking, was a rush job not very good with python!
Hi
Thanks for reply
But there are still errors, data is mixed up and not in proper columns
if the input is
Serine/threonine protein kinase 12 AZD1152 DCL000452 Myeloid Leukemia Phase I/II
Serine/threonine protein kinase 12 AZD1152 DCL000452 Acute Myeloid Leukemia, Haematological malignancies Phase II
Serine/threonine protein kinase 12 MK-5108 DCL000572 Cancer; Neoplasms; Tumors Phase I
Serine/threonine protein kinase 12 TAK-901 DCL000657 Advanced malignancies Phase I
Serine/threonine protein kinase 12 AT-9283 DCL001068 Adult solid tumours, NHL, AML, ALL, CML, MDS and myelofibrosis Phase I/II
Serine/threonine protein kinase 12 CYC-116 DCL001070 Advanced solid tumours Terminated in Phase I
Serine/threonine protein kinase 12 GSK1070916 DCL001072 Advanced solid tumours Phase I
Serine/threonine protein kinase 12 PF-03814735 DCL001076 Advanced solid tumours Phase I
Serine/threonine protein kinase 12 PHA-739358 DCL001078 CML that relapsed after imatinib or BCR�CABL-targeted therapy; Metastatic Hormone Refractory Prostate Cancer (MHRPC) Phase II
Serine/threonine protein kinase 12 VX-689 DCL001083 Cancer; Neoplasms; Tumors Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform SF1126 DCL000228 Solid Tumors Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform TG100-115 DCL000246 Angioedema, Myocardial infarction Phase I/II
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform XL147 DCL000262 Endometrial Cancer Phase II
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform XL765 DCL000264 Solid tumours; non-small-cell lung cancer; malignant gliomas Phase I/II
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform AZD6482 DCL000476 Thrombosis Phase II
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform LY294002 DCL000600 Cancer Discontinued in Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform PI3K alpha DCL000601 Cancer Discontinued in Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform BEZ235 DCL001085 Advanced solid tumours; Advanced breast cancer Phase I/II
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform BGT226 DCL001086 Solid tumours; Advanced breast cancer; Cowden��s syndrome Phase I/II
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform BKM120 DCL001087 Metastatic Breast Cancer Phase I/II
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform GDC0941 DCL001088 Advanced solid tumours; non-Hodgkin��s lymphoma Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform GSK1059615 DCL001089 Advanced solid tumours; metastatic breast cancer; endometrial cancer; lymphoma Terminated in Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform PX-866 DCL001090 Advanced solid tumours Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform CAL-101 DCL001091 Chronic lymphocytic leukaemia; acute myeloid leukaemia; non-Hodgkin��s lymphoma Phase I
Phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit, gamma isoform GDC-0980 DCL001189 Advanced solid tumours, non-Hodgkin's lymphoma Phase I
Hexokinase D Lonidamine DCL000153 Benign Prostatic Hyperplasia, Prostate Disorders Terminated in Phase III
Hexokinase D PSN-101 DCL000201 Diabetes Mellitus Type 1 and 2 Phase I
Hexokinase D AZD1656 DCL000457 Type 2 Diabetes Mellitus Phase II
Hexokinase D AZD6370 DCL000475 Type 2 Diabetes Phase I completed
Hexokinase D R7201 DCL000614 Type 2 diabetes Phase II
Hexokinase D AZD5658 DCL001154 Obesity, Diabetes Phase I
mRNA of Clusterin OGX-011 DCL000186 Prostate Cancer, Breast Cancer, Lung Cancer Phase III
Kinesin-like protein KIF11 ARRY-520 DCL000053 Cancer/Tumors Phase I
Kinesin-like protein KIF11 Ispinesib DCL000139 Pediatric Phase I
Kinesin-like protein KIF11 Ispinesib DCL000139 Head and Neck Cancer, Renal Cell Carcinoma, Ovarian Cancer, Solid Tumors Phase II
Kinesin-like protein KIF11 Ispinesib DCL000139 Lung Cancer Phase II completed
Kinesin-like protein KIF11 SB-743921 DCL000224 Non-Hodgkin's Lymphoma, Cancer/Tumors Phase I/II
Kinesin-like protein KIF11 4SC-205 DCL001129 Solid tumour and malignant lymphoma Phase I
Neurotensin receptor type 1 CGX-1160 DCL000084 Acute or Chronic Pain Phase I completed
Neurotensin receptor type 1 Meclinertant DCL000163 Colorectal Cancer, Prostate Cancer, Schizophrenia, Schizoaffective Disorders, Psychosis, Depression, Lung Cancer Discontinued in Phase III
Ribosomal protein S6 kinase XL418 DCL000009 Solid Tumors Suspended in Phase I
Interstitial collagenase BMS 275291 DCL000003 Non-small Cell Lung Cancer, Hormone-refractory Prostate Cancer, Kaposi's Sarcoma Discontinued in Phase III
Interstitial collagenase Prinomastat DCL000004 Brain Cancer Discontinued in Phase III
Interstitial collagenase Prinomastat DCL000004 Lung Cancer, Prostate Cancer Trial halted
Interstitial collagenase Marimastat DCL000005 Pancreatic Cancer, Lung Cancer Discontinued in Phase III
Interstitial collagenase BB-3644 DCL000014 Cancer/Tumors Discontinued in Phase I
Interstitial collagenase XL784 DCL001039 Diabetic nephropathy Discontinued in Phase II
Interstitial collagenase Batimastat DPR000163 Cancers Discontinued in Phase I
Integrin beta
the output which I will get is
AZD1152 DCL000452 Myeloid Leukemia Phase I/II
Acute Myeloid Leukemia, Haematological malignancies Phase II
MK-5108 DCL000572 Cancer; Neoplasms; Tumors Phase I
AT-9283 DCL001068 Adult solid tumours, NHL, AML, ALL, CML, MDS and myelofibrosis Phase I/II
CYC-116 DCL001070 Advanced solid tumours Terminated in Phase I
ENMD-2076 DCL001071 Ovarian Cancer, Fallopian Cancer, Peritoneal Cancer Phase II
PF-03814735 DCL001076 Advanced solid tumours Phase I
PHA-739358 DCL001078 CML that relapsed after imatinib or BCR�CABL-targeted therapy; Metastatic Hormone Refractory Prostate Cancer (MHRPC) Phase II
VX-689 DCL001083 Cancer; Neoplasms; Tumors Phase I
Toll-like receptor 3 HspE7 (TLR3 agonist adjuvant) DCL000129 Anal intraepithelial neoplasia Discontinued in Phase I/II
Human Papillomavirus (HPV) Infections Discontinued in Phase I/II
5-hydroxy-tryptamine 3B receptor Cilansetron DCL000087 Irritable Bowel Syndrome (IBS), Diarrhea Phase III, Positive phase III results
Serine/threonine-protein kinase Chk2 XL844 DCL000017 Advanced solid tumours or lymphoma Suspended in Phase I
Fibronectin AS1409 DCL000055 Kidney Cancer, Melanoma Phase I
Plasma kallikrein Ecallantide DCL000108 Hereditary angioedema Approved
Glucosylceramidase Isofagomine tartrate DCL000138 Metabolic Disease Phase II
Protein kinase C gamma type Midostaurin DCL000165 Breast & colorectal cancer Phase I
Colon, breast, CLL, AML, GIST, solid tumours & non-Hodgkin's lymphoma Phase II
Alpha-galactosidase A Migalastat DCL000166 Fabry Disease Phase III
Calcitonin gene-related peptide 1 Olcegepant DCL000187 Migraine and Cluster Headaches Discontinued in Phase I/II
Cizolirtine DCL000753 Neuropathic pain Phase II
Heat shock protein HSP 90 Alvespimycin hydrochloride DCL000035 Ovarian Cancer, Refractory Hematological Malignancies Phase I
Refractory acute myelogenous leukemia; HER2-positive Metastatic Breast Cancer and Leukaemia Terminated in Phase II
AT13387 DCL000057 Cancer/Tumors Phase I
CNF1010 DCL000089 Solid Tumors, Chronic Myelogenous Leukemia Terminated in Phase I
IPI-504 DCL000137 Gastrointestinal Stromal Tumors Phase I
Non-small Cell Lung Cancer Phase I/II
Solid Tumors Phase Ib
Prostate Cancer Phase II
SNX-5422 DCL000231 Hematological Malignancies Phase I
STA-9090 DCL000236 Solid Tumors Phase I
Tanespimycin DCL000242 Breast Cancer, Melanoma Phase II
Multiple Myeloma Suspended in Phase III
Cathepsin G Dermolastin DCL000019 Chronic Obstructive Pulmonary Disease Halted in Phase I
Atopic Dermatitis, Alpha 1 Antitrypsin Deficiency Phase II
Emphysema Halted in Phase I
Integrin alpha-5 JSM 6427 DCL000012 Macular Degeneration Phase I
mRNA of Myb proto-oncogene protein LR3001 DCL000154 Myeloid Leukemia Phase II
Lysosomal alpha-glucosidase Celgosivir DCL000082 Hepatitis C Phase II
Glucobay DCL000309 Diabetes Mellitus Type 2 Phase IV
Basic fibroblast growth factor receptor 1 FGF-1 DCL000113 Peripheral Vascular Disease, Ulcers Phase I
Severe Coronary Heart Disease Phase II
SU-6668 DCL000342 Advanced solid tumours Discontinued
Phospholipase A2, membrane associated Varespladib DCL000258 Coronary Artery Disease, Atherosclerosis Phase II
Interleukin-2 receptor subunit beta Medusa IL-2 DCL000164 Cancer/Tumors Phase I/II
Histidine decarboxylase BF-Derm1 DCL000066 Skin Infections/Disorders Phase II
Tissue kallikrein Dermolastin DCL000019 Chronic Obstructive Pulmonary Disease Halted in Phase I
Atopic Dermatitis, Alpha 1 Antitrypsin Deficiency Phase II
Emphysema Halted in Phase I
Atrial natriuretic peptide receptor B CD-NP DCL000081 Myocardial infarction, Heart Disease Phase Ia
P-glycoprotein LY335979 DCL000157 Acute Myeloid Leukemia Phase III completed
Integrin beta-7 RhuMAb Beta7 DCL000622 Ulcerative colitis Phase I
Vedolizmab DCL000662 Ulcerative colitis, Crohn's d
and there is one error message after printing all entries:
MSX-122 DCL000173 Late-stage Solid Tumors Suspended in Phase I
KRH-2731 DPR000144 HIV Infection Preclinical
C-C chemokine receptor type 2 CCX915 DCL000080 Multiple Sclerosis Phase I
INCB3284 DCL000135 Rheumatoid Arthritis Discontinued in Phase II
Obese Insulin-resistant Subjects Discontinued in Phase IIa
INCB8696 DCL000546 Multiple scierosis Phase I
INCB-3284 DCL000845 Rheumatoid arthritis Discontinued in Phase I
MLN1202 DCL000883 Multiple Sclerosis Phase II completed
Metastatic Cancer; Unspecified Adult Solid Tumor, Protocol Specific Phase II
MCP-1 DPR000072 Rheumatoid arthritis Preclinical
RS-504393 DPR000102 Chronic obstructive pulmonary disease Preclinical
Traceback (most recent call last):
File "./duplicate.py", line 20, in ?
while left[index] == right[index]:
IndexError: list index out of range
bash-3.2$
Kindly cehck it
Hmm i think the problem here is the sample data we are using to build our code is not formatted the same as the raw data you are parsing. Can you please upload your data sets?
hmm
here is attached dataset!
Mani
Yep as i thought, it's tab delimited
can soemthing be done?
Yep, working on it now. I think i've found the problem, it seems you have some duplicate lines will post up some code soon.
---------- Post updated at 10:26 PM ---------- Previous update was at 10:09 PM ----------
The data is already in order so all you need to run the following code on the data set and my code should work give it a try
cat <data_file.txt> | uniq > <output_file>.txt
Hi, did this work for you?
Hi
Thanks for reply.
It s still shwoing some error. But thanks for ur patience
bash-3.2$ uniq -c sarattdnewdruggene.txt >sarattdnewdruggene3.txt
bash-3.2$ cat <sarattdnewdruggene.txt> | uniq > <sarattdnewdruggene4>.txt
bash: syntax error near unexpected token `|'
bash-3.2$ cat <sarattdnewdruggene.txt> | uniq > <sarattdnewdruggene4.txt>
bash: syntax error near unexpected token `|'
bash-3.2$ cat <sarattdnewdruggene.txt> | uniq | <sarattdnewdruggene4.txt>
bash: syntax error near unexpected token `|'
bash-3.2$ cat <sarattdnewdruggene.txt> | uniq <sarattdnewdruggene4.txt>
bash: syntax error near unexpected token `|'
bash-3.2$ cat <sarattdnewdruggene.txt> uniq <sarattdnewdruggene4.txt>
bash: syntax error near unexpected token `newline'
bash-3.2$
the code below should work
cat sarattdnewdruggene.txt | uniq > sarattdnewdruggene4.txt
Hi
Using this, the out put is exactly same as input no change
igot one mroe file like that
if I have input like this
And, I want output in which only repeatition in first cloumn has to be removed.here second columnmoves towards left after quoting but I dont want, I want second column shuld remaina s it is in second column
can you please post up your data source?
with your test data
CHRM1 P11229 Pirenzepine DAP000492 Peptic ulcer disease Approved T2D
CHRM1 P11229 Glycopyrrolate DAP001116 Anesthetic Approved T2D
CHRM1 P11229 Clidinium DAP001117 Abdominal/stomach pain Approved T2D
CHRM1 P11229 Dicyclomine DAP001118 Irritable bowel syndrome Approved T2D
CHRM1 P11229 Ethopropazine DAP001119 Parkinson's disease Approved T2D
CHRM1 P11229 Cycrimine DAP001120 Parkinson's disease Approved T2D
CHRM1 P11229 Benztropine DAP001121 Parkinson's disease Approved T2D
CHRM1 P11229 Trihexyphenidyl DAP001122 Parkinson's disease Approved T2D
CHRM1 P11229 Propantheline DAP001123 Excessive sweating (hyperhidrosis) Approved T2D
CHRM1 P11229 Oxyphenonium DAP001124 Spasm Approved T2D
CHRM1 P11229 Biperiden DAP001125 Parkinson's disease Approved T2D
CHRM1 P11229 Talsaclidine isomer DCL000268 Alzheimer's disease Discontinued T2D
CHRM1 P11229 Sabcomeline hydrochloride DCL000279 Cardiovascular diseases Phase IIa T2D
CHRM1 P11229 Talsaclidine fumarate DCL000303 Alzheimer's disease Discontinued T2D
CHRM1 P11229 Xanomeline tartrate DCL000328 Alzheimer's disease Phase II T2D
CHRM1 P11229 GSK573719 DCL000381 Chronic Obstructive Pulmonary Disease (COPD) Phase II T2D
CHRM1 P11229 GSK961081 DCL000397 Chronic Obstructive Pulmonary Disease (COPD) Phase II completed T2D
CHRM1 P11229 GSK1034702 DCL000402 Schizophrenia, Dementia Phase I completed T2D
CHRM1 P11229 Darotropium DCL000514 COPD Suspended in Phase II in GSK 2009 Report T2D
CHRM1 P11229 Darotropium + 642444 DCL000515 COPD Phase III T2D
CHRM1 P11229 Revatropate DCL000957 Chronic obstructive pulmonary disease Discontinued in Phase I T2D
FLT1 P17948 Sorafenib DAP000006 Advanced renal cell carcinoma Launched CAD
FLT1 P17948 Sorafenib DAP000006 Hepatocellular carcinoma, NSCLC, melanoma Phase III CAD
FLT1 P17948 Sorafenib DAP000006 Myelodyspalstic syndrome, AML, head & neck cancer, breast, colon, ovarian, pancreatic cancer Phase II CAD
FLT1 P17948 Ranibizumab DAP001260 Age-related macular degeneration Approved CAD
FLT1 P17948 Ranibizumab DAP001260 Diabetic macular edema and retinal vein occlusion Phase III CAD
my code resulted in this
CHRM1 P11229 Pirenzepine DAP000492 Peptic ulcer disease Approved T2D
Glycopyrrolate DAP001116 Anesthetic Approved T2D
Clidinium DAP001117 Abdominal/stomach pain Approved T2D
Dicyclomine DAP001118 Irritable bowel syndrome Approved T2D
Ethopropazine DAP001119 Parkinson's disease Approved T2D
Cycrimine DAP001120 Parkinson's disease Approved T2D
Benztropine DAP001121 Parkinson's disease Approved T2D
Trihexyphenidyl DAP001122 Parkinson's disease Approved T2D
Propantheline DAP001123 Excessive sweating (hyperhidrosis) Approved T2D
Oxyphenonium DAP001124 Spasm Approved T2D
Biperiden DAP001125 Parkinson's disease Approved T2D
Talsaclidine isomer DCL000268 Alzheimer's disease Discontinued T2D
Sabcomeline hydrochloride DCL000279 Cardiovascular diseases Phase IIa T2D
Talsaclidine fumarate DCL000303 Alzheimer's disease Discontinued T2D
Xanomeline tartrate DCL000328 Alzheimer's disease Phase II T2D
GSK573719 DCL000381 Chronic Obstructive Pulmonary Disease (COPD) Phase II T2D
GSK961081 DCL000397 Chronic Obstructive Pulmonary Disease (COPD) Phase II completed T2D
GSK1034702 DCL000402 Schizophrenia, Dementia Phase I completed T2D
Darotropium DCL000514 COPD Suspended in Phase II in GSK 2009 Report T2D
+ 642444 DCL000515 COPD Phase III T2D
Revatropate DCL000957 Chronic obstructive pulmonary disease Discontinued in Phase I T2D
FLT1 P17948 Sorafenib DAP000006 Advanced renal cell carcinoma Launched CAD
Hepatocellular carcinoma, NSCLC, melanoma Phase III CAD
Myelodyspalstic syndrome, AML, head & neck cancer, breast, colon, ovarian, pancreatic cancer Phase II CAD
Ranibizumab DAP001260 Age-related macular degeneration Approved CAD
Diabetic macular edema and retinal vein occlusion Phase III CAD
using this code
#!/usr/bin/python
import sys
if len(sys.argv) < 2:
print "usage:",sys.argv[0],"<file_path>"
sys.exit(69)
f = open(sys.argv[1], 'r')
lines = f.readlines()
count = 0
index = 0
for item in lines:
if count != 0:
left = lines[count].split()
right = lines[count-1].split()
while left[index] == right[index]:
index += 1
print ' '.join(left[index:])
index = 0
else:
print lines[count].rstrip()
count += 1