grep fixed string with regex

teresaejunior · November 8, 2010, 4:24am

Hello, all! Maybe the title is badly formulated, you can help me with that...!

I'm using the GNU grep, and I need to make sure that grep will extract only what I tell it to.

I have the following regular expression: [a-z_][a-z0-9_-]*[$]?

Well, I need to make sure I grep only a word which may start with a lowercase letter or underline, the following characters may contain numbers and dashes aswell, and the last character can be a dollar sign.

Would you help me with this command?

$ echo "u_s9Ae-u" | grep "[a-z_][a-z0-9_-]*[$]\?" # should not grep "A"
u_s9Ae-u

$ echo "u_s9Ae-u" | grep -x "[a-z_][a-z0-9_-]*[$]\?"
u_s9Ae-u

$ echo "u_s9Ae-u" | grep -o "[a-z_][a-z0-9_-]*[$]\?"
u_s9
e-u

Tried many different commands with no success...

Best regards!
Teresa and Junior

bakunin · November 8, 2010, 4:43am

It usually pays to simply re-read the definition you start with when constructing regexps. In this case:

1.) only a word
If we use the old IBM definition of a "word": A word is a sequence of non-blank characters, separated by blanks. ; we end up with something like:

grep '[<b><tab>]*[^<b><tab>][^<b><tab>][<b><tab>]*'

We search for an (optional) blank/tab character, followed by one or more non-blanks/non-tabs, followed by an optional tab/blank.

2.) which may start with a lowercase letter or underline,
Ok, we fine-tune our definition of the word:

grep '[<b><tab>]*[^_a-z][^<b><tab>][<b><tab>]*'

We search for an (optional) blank/tab character, followed by one underline or lowercase letter, more non-blanks/non-tabs, followed by an optional tab/blank.

3.) the following characters may contain numbers and dashes aswell,
more fine-tuning on what we mean by "word" here, i think it is self-explanatory now:

grep '[<b><tab>]*[^_a-z][-_a-z0-9][-_a-z0-9]*[<b><tab>]*'

4.) and the last character can be a dollar sign.
still more fine-tuning:

grep '[<b><tab>]*[^_a-z][-_a-z0-9][-_a-z0-9]*\$*[<b><tab>]*'

Probably we could drop the ending "[<b><tab>]*" now, because it might be superfluous - you will have to decide that by running the regexp against your data. Replace "<b>" and "<tab>" with literal blanks/tabs when you enter the code, i just used this to make them visible.

I hope this helps.

bakunin

teresaejunior · November 8, 2010, 4:56am

Thank you, bakunin! But it still greps the "A", or I'm doing something wrong... The idea is: we prompt the user for a string, and then we check if it matches the criteria. So the echo thing is actually used, and the pipe later:

$ echo " u_sA9e-u " | grep '[ ]*[^_a-z][-_a-z0-9]*\$*[ ]*'
 u_sA9e-u 

$ echo "u_sA9e-u" | grep '[ \t]*[^_a-z][-_a-z0-9]*\$*[ \t]*'
u_sA9e-u

$ echo "u_sA9e-u" | grep -w '[ \t]*[^_a-z][-_a-z0-9]*\$*[ \t]*'

$ echo "u_sA9e-u" | grep -x '[ \t]*[^_a-z][-_a-z0-9]*\$*[ \t]*'

$ echo "u_sA9e-u" | grep -o '[ \t]*[^_a-z][-_a-z0-9]*\$*[ \t]*'
A9e-u

And I tried a bunch of different commands again with no luck... Would you try it there?

Best regards!
Teresa e Junior

Scrutinizer · November 8, 2010, 5:02am

@Bakunin, we should also take words at the start (^) or the end of the line ($). Using [<b><tab>]* with no further anchors means that it may match part of a word too, since we are allowing occurrence on both sides to be zero.

So I think we need something lie this.:

grep -E '([[:space:]]|^)[a-z_][a-z0-9_-]*[$]?([[:space:]]|$)'

we cannot use word GNU word boundaries (\b) here since dashes are part of the allowed character set.

teresaejunior · November 8, 2010, 5:06am

scrutinizer:

@Bakunin, we should also take words at the start (^) or the end of the line ($). Using [<b><tab>]* with no further anchors means that it may match part of a word too, since we are allowing occurrence on both sides to be zero.

So I think we need something lie this.:
grep -E '([[:space:]]|^)[a-z_][a-z0-9_-]*[$]?([[:space:]]|$)'
we cannot use word GNU word boundaries (\b) here since dashes are part of the allowed character set.

$ echo "u_sA9e-u" | grep -E "([[:space:]]|^)[a-z_][a-z0-9_-]*[$]?([[:space:]]|$)"
u_sA9e-u

Any ideas?

Scrutinizer · November 8, 2010, 5:15am

Strange, when I do this I get:

$ echo "u_sA9e-u" | grep -E "([[:space:]]|^)[a-z_][a-z0-9_-]*[$]?([[:space:]]|$)"
$ echo "u_sx9e-u" | grep -E "([[:space:]]|^)[a-z_][a-z0-9_-]*[$]?([[:space:]]|$)"
u_sx9e-u

Are you sure you are using GNU grep?

teresaejunior · November 8, 2010, 5:30am

Hello!

I have the following alias: alias grep='grep --color=auto', and the difference between the following commands is that the one which outputs colors is with the 'x', the other is black and white... To bypass the alias I tried "\grep", but it doesn't change.

$ echo "u_sA9e-u" | grep -E "([[:space:]]|^)[a-z_][a-z0-9_-]*[$]?([[:space:]]|$)"
u_sA9e-u
$ echo "u_sx9e-u" | grep -E "([[:space:]]|^)[a-z_][a-z0-9_-]*[$]?([[:space:]]|$)"
u_sx9e-u

$ apt-cache show grep
Description: GNU grep, egrep and fgrep
 'grep' is a utility to search for text in files; it can be used from the
 command line or in scripts.  Even if you don't want to use it, other packages
 on your system probably will.
 .
 The GNU family of grep utilities may be the "fastest grep in the west".
 GNU grep is based on a fast lazy-state deterministic matcher (about
 twice as fast as stock Unix egrep) hybridized with a Boyer-Moore-Gosper
 search for a fixed string that eliminates impossible text from being
 considered by the full regexp matcher without necessarily having to
 look at every character. The result is typically many times faster
 than Unix grep or egrep. (Regular expressions containing backreferencing
 will run more slowly, however.)

Though I noticed the following behavior:

$ var=Aa90
$ echo ${var//[a-z]/}
90
echo ${var//['a-z']/}
a90

---------- Post updated at 08:30 AM ---------- Previous update was at 08:26 AM ----------

$ bash --posix
bash-4.1$ echo "u_sA9e-u" | grep -E "([[:space:]]|^)[a-z_][a-z0-9_-]*[$]([[:space:]]|$)"
u_sA9e-u
bash-4.1$

Scrutinizer · November 8, 2010, 6:15am

Try:

echo "u_sA9e-u" | grep -E '([[:space:]]|^)[a-z_][a-z0-9_-]*[$]?([[:space:]]|$)'

teresaejunior · November 8, 2010, 6:20am

Thank you!

$ echo "u_sA9e-u" | grep -oE '([[:space:]]|^)[a-z_][a-z0-9_-]*[$]?([[:space:]]|$)'
$ echo "u_s9e-u" | grep -oE '([[:space:]]|^)[a-z_][a-z0-9_-]*[$]?([[:space:]]|$)'
u_s9e-u

bakunin · November 9, 2010, 4:14am

teresaejunior:

Thank you, bakunin! But it still greps the "A", or I'm doing something wrong...
$ echo " u_sA9e-u " | grep '[ ]*[^_a-z][-_a-z0-9]*\$*[ ]*'
 u_sA9e-u 
And I tried a bunch of different commands again with no luck... Would you try it there

Sorry, my mistake. somehow a superfluous caret slipped by:

grep '[ ]*[^_a-z][-_a-z0-9]*\$*[ ]*'

grep '[ ]*[_a-z][-_a-z0-9]*\$*[ ]*'

The caret means a logical NOT in this case, so in fact the meaning was inverted. ("[ab]" finds all "a"s and "b"s, ""[^ab]" finds anything except these).

Of course Scrutinizers comment about words at the beginning or end of a line is correct. I didn't intend to give a complete solution, just a hint how to construct such regexps for yourself. See also below.

In the case of your echo statement you already made sure there is a leading and a trailing blank therefore you should remove the conditionals from the word delimiters in the regexp, which makes it work:

echo " u_sA9e-u " | grep '[ ][_a-z][-_a-z0-9][-_a-z0-9]*\$*[ ]'

Scrutinizers solution is correct, but makes use of the extended regular expression syntax of the GNU-grep. This may or may not be a problem in your case. Standard POSIX-greps won't understand the pipe symbol ("|") as logical OR. If this doesn't matter in your case you should go for this solution, as it is the most comprehensive one possible.

If you would need to port the solution to different Unixes you wouldn't want to rely on this extended regular expression syntax because some "grep"s only support the standard (POSIX) features. The solution would be in this case to grep for words on the end/beginning of a line separately. The difference would be to replace the anchoring leading/trailing spaces by beginning-of-line-("^")/end-of-line-("$")-symbols.

search in the middle of the line:

grep '[ ][_a-z][-_a-z0-9][-_a-z0-9]*\$*[ ]'

search the beginning of the line:

grep '^[_a-z][-_a-z0-9][-_a-z0-9]*\$*[ ]'

search the end of the line:

grep '[ ][_a-z][-_a-z0-9][-_a-z0-9]*\$*$'

I hope this helps.

bakunin

Scrutinizer · November 9, 2010, 4:28am

Hi Bakunin,

The OP stated she was using GNU grep at the beginning. However please note that I suggested a general egrep solution. The Posix specification states that egrep is deprecated and that grep -E should be used instead (see grep: rationale). As such grep -E is using ERE's and certainly understands alternation, so IMO it should work with any Posix-compliant grep.

S.

bakunin · November 9, 2010, 4:43am

I have overlooked the O/Ps requirement, but saw it now, as you mentioned it. The information about EREs in POSIX was news to me, thanks for the info. I stand corrected.

bakunin