regular expression matching whole words

Storms · May 25, 2012, 7:09pm

Hi

Consider the file

this is a good line

when running

grep '\b(good|great|excellent)\b' file5

I expect it to match the line but it doesn't... what am i doing wrong??
(ultimately this regex will be in a awk script- just using grep to test it)

Thanks,

Storms

agama · May 25, 2012, 7:19pm

For grep to work with regular expressions you need to enable it (preferred) or use egrep:

grep -E "(good|great|excellent)" filename

Storms · May 25, 2012, 7:27pm

sorry for my denseness but how can i get it to work in the awk script?? the following doesnt seem to match the line

if ($0 ~ /^.*\b(good|two|three)\b.*$/) { print "match" }

agama · May 25, 2012, 8:25pm

The \b escape pattern doesn't work in my version of awk. I prefer match() to the ~ syntax, but either should work:

awk '
    {
        if( $0 ~ /[[:space:]](foo|bar|goo)[[:space:]]/ )
            print "" $0;

        if( match( $0, "[[:space:]](foo|bar|goo)[[:space:]]" ) )
            print;
    }
'

Note that the leading ^.* and trailing .*$ are unneeded. The leading space imples that none of these words can be at the beginning of the line, while the trailing space imples that they may not be the last word on the line. If you need either change to something like:

if( match( $0, "[[:space:]]*(foo|bar|goo)[[:space:]]*" ) )

to indicate that zero or more space characters may precede/follow the word.

Storms · May 25, 2012, 8:40pm

thanks for that, after your reply i did some further googling and found that \y works in place of \b in awk. I'm using this to match whole words...

so it matches good, but not goodd

if (match($0, /\y(good|excellent|three)\y/)) { print "match", $0 }

Scrutinizer · May 26, 2012, 1:34am

grep works with regular expressions (BRE) by default. Did you mean extended regular expressions (ERE) that support alternation (|) and enabling with the "-E" switch?

That will not fly, since "may" allows too much liberty. A word like "goods" would match too. And what about punctuation? What constitutes a word?

storms:

thanks for that, after your reply i did some further googling and found that \y works in place of \b in awk. I'm using this to match whole words...

so it matches good, but not goodd
if (match($0, /\y(good|excellent|three)\y/)) { print "match", $0 }

\y is a GNU extension and will not work across awks. An alternative would be to use \< and \> instead:

gawk '/\<(good|excellent|three)\>/{ print "match", $0 }'

But this isn't universal either

A universal awk approach would be something like this I guess:

awk -F'[[:space:][:punct:]]*' '{for(i=1;i<=NF;i++)if($i~/^(good|great|excellent)$/){print; next}}'

A special case would perhaps need to be made for the underscore character...