bash: need to have egrep to return a text string if the search pattern has NOT been found

Hello all,

after spending hours of searching the web I decided to create an account here. This is my first post and I hope one of the experts can help.

I need to resolve a grep / sed / xargs / awk problem.

My input file is just like this:

----------------------------------

root@Ubuntu-12:~# cat myfile 
article1
data.........x
colour....blue
number.........15
name...smith
month...................july

article2
colour....yellow
number.........423489
something....x
month...................january

article3
colour....orange
number.........7
name....jason
month...................may
value.....4
much
more
lines
root@Ubuntu-12:~#

----------------------------------

This is the code I currently use (example):

grep "^article[0-9]$" -A5 myfile | while read x ; do echo "$x" | egrep "article|colour|number|name|month" | \
awk -F . '{print $NF}' ; done | xargs -L5 | \
awk 'BEGIN {printf("%15s %15s %15s %15s %15s\n" ,"Article", "Colours", "Numbers", "Names", "Month")} {printf("%15s %15s %15s %15s %15s\n", $1, $2, $3, $4, $5)}'

Unfortunately the output looks like this:

        Article         Colours         Numbers           Names           Month
       article1            blue              15           smith            july
       article2          yellow          423489         january        article3
         orange               7           jason             may                

As we can see the format is screwed up because we are egrep'ping for 5 values. This was successful for "article1" but "name...xx" is missing in "article2". Therefore "article3" is used as the 5th column in row 2 rather than in column1 of row 3.

So xargs is parsing the wrong format into awk which eventually shifts the table:

grep "^article[0-9]$" -A5 myfile | while read x ; do echo "$x" | egrep "article|colour|number|name|month" | awk -F . '{print $NF}' ; done | xargs -L5
article1 blue 15 smith july
article2 yellow 423489 january article3
orange 7 jason may

------------------------------------

Now the question. Is there a way that egrep, when searching for 5 strings but only finding 4, is replacing a missing string with a replacement word like "missing"? This would ensure xargs -L5 is happy and awk keeps the format for the table.

Or is there a more efficient way of doing this?

The input text file is just an example for a much larger file with hundreds of thousands of lines.

What is the expected output, based on the data provided?

Ps. The input data looks pretty random to me. Is there a formal file structure? Can you explain it?
There is no way that someone can write code to processs the sample input provided - it's full of abstracts and random comments.

This is how I would have approached it:

awk -F . '
    function dump( )
    {
        if( stuff["article"] )
            printf( "%s %s %s %s %s\n", stuff["article"], stuff["colour"], stuff["number"], stuff["name"], stuff["month"] );
        else
            printf( "Article Colours Numbers Names Month\n" );
        delete stuff;
    }

    /^article/ {
        dump( );
        stuff["article"] = $NF;
        next;
    }

    { stuff[$1] = $NF; }

    END { dump(); }

' input-file

EDIT: Crossed with Methyl; I made the assumption that the 'article' could be treated as a division. Of course, if that assumption is wrong it all goes out the window.

The expected output is:

Article         Colours         Numbers           Names           Month
article1        blue            15                smith           july
article2        yellow          423489            --MISSING--     january        
article3        orange          7                 jason           may

---------- Post updated at 08:02 AM ---------- Previous update was at 07:43 AM ----------

The only pattern from the big input file is that the word

article

is initiating the block of text I am interested in. Then within the next 5 lines after the word

article

there should be the words

colour, number, name, month

. But sometimes some of the 4 words I am looking for don't exist.

The ideal solution would create an output table and replace the missing word(s) with a replacement word, just to highlight that it does not exist.

Small tweak to the previously posted script should do what you want:

awk -F . '
    function dump( )
    {
        if( stuff["article"] )
            printf( "%10s %10s %10s %10s %10s\n", stuff["article"], stuff["colour"], stuff["number"], stuff["name"], stuff["month"] );
        else
            printf( "%10s %10s %10s %10s %10s\n", "Article", "Colours", "Numbers", "Names", "Month" );
        stuff["article"] = "";
        stuff["colour"] = stuff["number"] = stuff["name"] = stuff["month"]  = "+MISSING+";
    }
    /^article/ {
        dump( );
        stuff["article"] = $NF;
        next;
    }

    { stuff[$1] = $NF; }

    END { dump(); }

' input-file
2 Likes

The following approach leverages awk's multiline record abilities (assumes each article block is delimited by at least one blank line) and shamelessly pilfers agama's solution. :wink:

BEGIN {
    RS=""; FS="\n"; fmt="%-10s %-10s %-10s %-10s %-10s\n"
    printf fmt, "Article", "Colours", "Numbers", "Names", "Month"
}

{
    a["colour"] = a["number"] = a["name"] = a["month"] = "+MISSING+"
    for (i=1, i<=NF, i++) {
        split($i, b, /\.+/)
        if (b[1] in a)
            a[b[1]] = b[2]
    }
    printf fmt, $1, a["colour"], a["number"], a["name"], a["month"]
}

Regards,
Alister

1 Like

That's awesome. Thank you so much.

---------- Post updated at 07:54 AM ---------- Previous update was at 07:49 AM ----------

@ agama, there was just one little typo:

is:

printf( "%10s %10s %10s %10s %10s\n", "Article", "Colours", "Numbers", "Names", :Month" );

should be:

printf( "%10s %10s %10s %10s %10s\n", "Article", "Colours", "Numbers", "Names", "Month" );
1 Like