Retrieve many entries using awk

Hi all

I have a problem similar to someone else while getting output using awk and retrieving certain entries.

From the attached sample file of big file I have to retreive following headings in columns from each drug card as there are many drug cards..

# Drug_Target_.*_Gene_Name

# Brand_Name

# Generic_Name

# Drug_Type

# Indication

#Mechanism_ Of_ Action

Using below code, my output is a bit random and wired even though some part are correct as mehanism keep on repeating agian and again

some time starts with the next line rather than being fit in columns

Brand name keep going wrong sometime!!

I am trying to retreive

awk 'k>0 {if (a[k] && k==2) {print a[1]":"a[2]":"a[3]":"a[4]":"a[5]":"a]6]; a[1]=a[2]=a[3]=a[4]=a[5]=a[6]"";} a[k]=a[k]?a[k]","$0:$0; k=0;} /^# / {k=1;} /^# Generic/ {k=1;} /^# Brand_Name/ {k=2;}  /^# Drug_Type/ {k=3;} /^# Indication/ {k=5;} /^# Mechanism_ Of_ Action / {k=6;} END {if (a[1]) print a[1]":"a[2]":"a[3]":"a[4]":"a[5]":"a[6];}' drugbank.txt >drugbanknew.txt

Can anybody check my code in his/her system and let me know wrong things ?High thanks

Not sure I understand what you want to achieve. Pls post desired output.

BTW, your script above contains a few typos...

Hi Rudi

I have to fetch above mentioned bold headings in my first post as columns in my expected output.
This input file is sample of my big input file which contain many drug cards
So,

My out put shoulbe like following six coulmns

F2 lepirudin refuldan approved " sentence under indication" "sentence under machanism of action"

As my real input file is big so there will be many rows with different names like this and with my current code these rows are overlapping!

see my post #2.

Sorry I was not able to get your post#2 could you please explain which one it is.

High thanks for this

Post #2

Not sure I understand what you want to achieve. Pls post desired output. 

You have sample input, but we would like to see sample output. What do you like to get from your script.

Hi

For my above attached sample input expected output is 6 columns like this:

F2    Lepirudin    Refludan    Approved    For the treatment of heparin-induced thrombocytopenia     Lepirudin forms a stable non-covalent complex with alpha-thrombin, thereby abolishing its ability to cleave fibrinogen and initiate the clotting cascade. The inhibition of thrombin prevents the blood clotting cascade.

here 6 columns represent following headings from my input file:

# Drug_Target_.*_Gene_Name

# Brand_Name

# Generic_Name

# Drug_Type

# Indication

#Mechanism_ Of_ Action

In the same way I have a big file wiht many drug cards and each drug card containing these entries which I have to fetch in 6 columns.

I have also mentioned the code which I am using.

Try this:

awk 'BEGIN {cnt = split ("# Drug_Target_.*_Gene_Name|# Brand_Name|# Generic_Name|# Drug_Type|# Indication|# Mechanism_Of_Action", SA, "|")}
     {for (i=1; i<=cnt; i++) if (match ($1, SA)) Out=$2}
     END {for (i=1; i<=cnt; i++) printf "%-28s", SA;  printf "\n";
          for (i=1; i<=cnt; i++) printf "%-28s", Out; printf "\n" }
    ' FS="\n" RS="\n\n" /tmp/Input\ file.txt 
# Drug_Target_.*_Gene_Name  # Brand_Name                # Generic_Name              # Drug_Type                 # Indication                # Mechanism_Of_Action       
F2                          Refludan                    Lepirudin                   Approved                    For the treatment of heparin-induced thrombocytopeniaLepirudin form

Hi

Thanks for reply.

Above command is giving me only one line output rom my big input file and it's like this:

bash-3.2$ awk 'BEGIN {cnt = split ("# Drug_Target_.*_Gene_Name|# Brand_Name|# Generic_Name|# Drug_Type|# Indication|# Mechanism_Of_Action", SA, "|")}
>      {for (i=1; i<=cnt; i++) if (match ($1, SA)) Out=$2}
>      END {for (i=1; i<=cnt; i++) printf "%-28s", SA;  printf "\n";
>           for (i=1; i<=cnt; i++) printf "%-28s", Out; printf "\n" }
>     ' FS="\n" RS="\n\n" drugbank.txt
# Drug_Target_.*_Gene_Name  # Brand_Name                # Generic_Name              # Drug_Type                 # Indication                # Mechanism_Of_Action       
CFTR                        Kalydeco                    Ivacaftor                   Approved                    For the treatment of cystic fibrosis (CF) in patients age 6 years and older who have a G551D mutation in the CFTR gene.Cystic fibrosis is caused by any one of several defects in a protein, cystic fibrosis transmembrane conductance regulator, which regulates fluid flow within cells and affects the components of sweat, digestive fluids, and mucus. The defect, which is caused by a mutation in the individual's DNA, can be in any of several locations along the protein, each of which interferes with a different function of the protein. One mutation, G551D, lets the CFTR protein reach the epithelial cell surface, but doesn't let it transport chloride through the ion channel. Ivacaftor is a potentiator of the CFTR protein. The CFTR protein is a chloride channel present at the surface of epithelial cells in multiple organs. Ivacaftor facilitates increased chloride transport by potentiating the channel-open probability (or gating) of the G551D-CFTR protein.
bash-3.2$ 

I tried another similar file it has given me only one result

It seems to me error is related to check the whole file with many similar entries!

This is EXACTLY what you have requested, there is NO error. Your sample file had one single record only, you did not provide a representative sample file, no hint was given on how records can be separated, identified, or checked for completeness, nor how data fields are interrelated. Pls use the code example in my post and extend/improve it to your expanded requirement.

I checked regarding foreach and continue commands to retrieve all entreis but it doesnt wrk!

I didnt understand this symbol "%-28s" so unable to proceed further. Your guidance will be appreciated.

I made changes like this

awk 'BEGIN {cnt = split ("# Drug_Target_.*_Gene_Name|# Brand_Name|# Generic_Name|# Drug_Type|# Indication|# Mechanism_Of_Action", SA, "|")}
     foreach (cnt){{for (i=1; i<=cnt; i++) if (match ($1, SA)) Out=$2}
     END {for (i=1; i<=cnt; i++) printf "%-28s", SA;  printf "\n";
          for (i=1; i<=cnt; i++) printf "%-28s", Out; printf "\n" }}
    ' FS="\n" RS="\n\n"