Blocks into table

please help, I have a huge file with blocks of data which I need to convert to a tabular format.

Input

sample

[Term]
id: GO:0000017
name: alpha-glucoside transport
namespace: biological_process
def: "The directed movement of alpha-glucosides into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. Alpha-glucosides are glycosides in which the sugar group is a glucose residue, and the anomeric carbon of the bond is in an alpha configuration." [GOC:jl, http://www.biochem.purdue.edu/, ISBN:0198506732]
is_a: GO:0042946 ! glucoside transport

[Term]
id: GO:0000018
name: regulation of DNA recombination
namespace: biological_process
def: "Any process that modulates the frequency, rate or extent of DNA recombination, a DNA metabolic process in which a new genotype is formed by reassortment of genes resulting in gene combinations different from those that were present in the parents." [GOC:go_curators, ISBN:0198506732]
subset: gosubset_prok
is_a: GO:0051052 ! regulation of DNA metabolic process
relationship: regulates GO:0006310 ! DNA recombination


Each block starts with [Term] and has fields id, name, namspace etc.
Note that the def sometimes carries over to the next line in the input, I would like that to be in a column in the output

I want to have a table with the four columns

id name namespace def

Desired output (tab delimited possibly)

id name namespace def
GO:0000017 alpha-glucoside transport biological_process "The directed movement of alpha-glucosides into, out of or within a cell, or between cells, by means of some agent such as a transporter or pore. Alpha-glucosides are glycosides in which the sugar group is a glucose residue, and the anomeric carbon of the bond is in an alpha configuration." [GOC:jl, http://www.biochem.purdue.edu/, ISBN:0198506732]
GO:0000018 regulation of DNA recombination biological_process "Any process that modulates the frequency, rate or extent of DNA recombination, a DNA metabolic process in which a new genotype is formed by reassortment of genes resulting in gene combinations different from those that were present in the parents." [GOC:go_curators, ISBN:0198506732]

With this specification, I'm not sure how we can help you. How are we supposed to determine if a def: field carries over to the next line? The def contents seem to contain quoted and unquoted colons, so how can we guess whether or not the line following a def: line is the start of a new field or a continuation of the previous line?

What does:

mean? Do you want a fifth column in your output file that indicates that your input file had a multi-line definition for "def"?

You said your desired output should be

but your sample output is space delimited (and has spaces within fields); not tab delimited.

awk '
     NR == 1 {
               print "id", "name", "namespace", "def"
             }

 function p(){
                gsub(/.*: /,x)
                return $0
             }

       /^id:/{ 
                id = p()  
             }

     /^name:/{  
                name = p() 
             }

/^namespace:/{ 
                namespace = p()
             }

      /^def:/{ 
                def = p() 
                print id,name,namespace,def       
             } 
    
   ' OFS="\t" file
1 Like

Appreciating Don Cragun's comments, esp. on colums spilling into next lines, this will do what you asked for (except for fields spilling), and you can define any field and their sequence by modifying the Head parameter:

awk     'function prline ()     {for (i=1; i<=n; i++) printf "%s\t", Z[S":"]; printf "\n"}

         NR==1          {gsub (FS, OFS, Head)
                         print Head
                         i=n=split (Head,S)
                         while (i>0) SRCH[S":"]=i--
                         next}

         $1 in SRCH     {Z[$1]=$0; sub (/^[^:]*: /,"", Z[$1])}

         /^\[Term\]/    {prline ()}
         END            {prline ()}
        ' OFS="\t" Head="name id is_a" file
name    id    is_a
alpha-glucoside transport    GO:0000017    GO:0042946 ! glucoside transport    
regulation of DNA recombination    GO:0000018    GO:0051052 ! regulation of DNA metabolic process