Hello
I have to separate certain entries from a Big file with so many drugs and description
I want to seaprate only Drug name which is mentioned as
#BEGIN_DRUGCARD DB00001
(means first drug description initiated) ..same way DB00002...and so on
and in description I have to separate
# Drug_Target_1_Name:
# Drug_Target_1_GenBank_ID_Gene:
# Drug_Target_1_GenBank_ID_Protein:
[/CODE]
or 2,3, if also mentioned.
So that in the out put file
I will get
#BEGIN_DRUGCARD DB00001 Drug_Target_1_Name(whole name is mentioned
# Drug_Target_1_GenBank_ID_Gene:
# Drug_Target_1_GenBank_ID_Protein:
And, than
#BEGIN_DRUGCARD DB00001 same number of targets mentioned with Gen Bank ID of geen and protein
Please let me know any programm if possible I have attached a sample file.Kindly check it
Thanks
Mani
Try this:
awk '/^#BEGIN_/||/^# Drug_Target_[1-9]/' infile
Hello
Thanks for the reply and help regarding scripts.. after running the above mentioned script I m gettign following result
awk '/^#BEGIN_/||/^# Drug_Target_[1-9]/' infile
#BEGIN_DRUGCARD DB00001
# Drug_Target_1_Cellular_Location:
# Drug_Target_1_Chromosome_Location:
# Drug_Target_1_Drug_References:
# Drug_Target_1_Essentiality:
# Drug_Target_1_GenAtlas_ID:
# Drug_Target_1_GenBank_ID_Gene:
# Drug_Target_1_GenBank_ID_Protein:
# Drug_Target_1_GeneCard_ID:
# Drug_Target_1_Gene_Name:
# Drug_Target_1_Gene_Sequence:
# Drug_Target_1_General_Function:
# Drug_Target_1_General_References:
# Drug_Target_1_HGNC_ID:
# Drug_Target_1_HPRD_ID:
# Drug_Target_1_ID:
# Drug_Target_1_Locus:
# Drug_Target_1_Molecular_Weight:
# Drug_Target_1_Name:
# Drug_Target_1_Number_of_Residues:
# Drug_Target_1_PDB_ID:
# Drug_Target_1_Pathway:
# Drug_Target_1_Pfam_Domain_Function:
# Drug_Target_1_Protein_Sequence:
# Drug_Target_1_Reaction:
# Drug_Target_1_Signals:
# Drug_Target_1_Specific_Function:
# Drug_Target_1_SwissProt_ID:
# Drug_Target_1_SwissProt_Name:
# Drug_Target_1_Synonyms:
# Drug_Target_1_Theoretical_pI:
# Drug_Target_1_Transmembrane_Regions:
#BEGIN_DRUGCARD DB00002
# Drug_Target_10_Cellular_Location:
# Drug_Target_10_Chromosome_Location:
# Drug_Target_10_Drug_References:
# Drug_Target_10_Essentiality:
# Drug_Target_10_GenAtlas_ID:
# Drug_Target_10_GenBank_ID_Gene:
# Drug_Target_10_GenBank_ID_Protein:
# Drug_Target_10_GeneCard_ID:
# Drug_Target_10_Gene_Name:
# Drug_Target_10_Gene_Sequence:
# Drug_Target_10_General_Function:
# Drug_Target_10_General_References:
# Drug_Target_10_HGNC_ID:
# Drug_Target_10_HPRD_ID:
# Drug_Target_10_ID:
# Drug_Target_10_Locus:
# Drug_Target_10_Molecular_Weight:
# Drug_Target_10_Name:
# Drug_Target_10_Number_of_Residues:
# Drug_Target_10_PDB_ID:
# Drug_Target_10_Pathway:
# Drug_Target_10_Pfam_Domain_Function:
# Drug_Target_10_Protein_Sequence:
# Drug_Target_10_Reaction:
# Drug_Target_10_Signals:
# Drug_Target_10_Specific_Function:
# Drug_Target_10_SwissProt_ID:
# Drug_Target_10_SwissProt_Name:
# Drug_Target_10_Synonyms:
# Drug_Target_10_Theoretical_pI:
# Drug_Target_10_Transmembrane_Regions:
# Drug_Target_11_Cellular_Location:
# Drug_Target_11_Chromosome_Location:
# Drug_Target_11_Drug_References:
# Drug_Target_11_Essentiality:
# Drug_Target_11_GenAtlas_ID:
# Drug_Target_11_GenBank_ID_Gene:
# Drug_Target_11_GenBank_ID_Protein:
# Drug_Target_11_GeneCard_ID:
# Drug_Target_11_Gene_Name:
# Drug_Target_11_Gene_Sequence:
# Drug_Target_11_General_Function:
# Drug_Target_11_General_References:
# Drug_Target_11_HGNC_ID:
# Drug_Target_11_HPRD_ID:
# Drug_Target_11_ID:
# Drug_Target_11_Locus:
But I want output shuld contain the entries mentioned after genbank ID and Genbank protein and proteinf name
so output can be
DRUGCARD DB00001 Drug_Target_1_GenBank_ID_Gene:0000(wahtever number)
# Drug_Target_1_GenBank_ID_Protein:(whatever ID)
# Drug_Target_1_Gene_Name: (the name mentioned)
And if I can get in different column these entries than it will be very easy to recoginse and arrange whole list of all Drug cards.
Please let me know if u have any idea.
Thanks
Mani
How about this then:
awk '
/^#*BEGIN_/{gsub(/^#*BEGIN_/,"",$0);gsub(/\n/,"",$0);N=$0;A=x}
A&&/^ Drug_Target_[1-9]*_(Gene_Name|GenBank_ID_Protein)/&&gsub(/\n/,"",$0) { print "# " $0 }
N&&/^ Drug_Target_[1-9]*_GenBank_ID/{gsub(/\n/,"",$0);print N,$0; N=x;A=1}' RS='\n#' infile