Question about REGEX Patterns and Case Sensitivity?

mrm5102 · October 15, 2012, 4:18pm

Hello All,

I'm in the middle of a script and I'm doing some checks with REGEX (i.e. using the '[[' ).

I'm wondering if this example is correct or if its just a coincidence. But I thought that if I did not use the "shopt -s nocasematch"
that at least the first one should print "FALSE" but it prints "TRUE"..?

For Example:

#!/bin/bash

MY_VAR="HELLO"

### This prints "TRUE"
PATTERN_1="^[a-z]*"
if [[ $MY_VAR =~ $PATTERN_1 ]]
 then
    echo "TRUE"
else
    echo "FALSE"
fi

echo "-------------------------"

### This prints "FALSE"
PATTERN_2="^[A-z]*"
if [[ $MY_VAR =~ $PATTERN_2 ]]
 then
    echo "TRUE"
else
    echo "FALSE"
fi

echo "-------------------------"

### This prints "TRUE"
PATTERN_3="[a-Z]*"
if [[ $MY_VAR =~ $PATTERN_3 ]]
 then
    echo "TRUE"
else
    echo "FALSE"
fi

The OUTPUT:

TRUE
-------------------------
FALSE
-------------------------
TRUE

I remember being told before that the pattern "[A-z]" is NOT the same as doing "[A-Za-z]" like it would be in Perl...
So I'm wondering why the pattern "[a-Z]", which is the last if statement in the code above, returns "TRUE", when
the 2nd if statement above "[A-z]" returns "FALSE"...?

I tried changing the Variable "$MY_VAR" from all upper case to all lowercase, but I still get the same output...
And lastly, if I include the "shopt -s nocasematch" they all return "TRUE"...

If anyone has any thoughts/suggestions that would be great!

FYI:
Bash Version: 4.1.10

Thanks in Advance,
Matt

spacebar · October 15, 2012, 5:25pm

I tested you code in bash version(4.1.10(4)) and with shell option(nocasematch) set or not set(i.e. shopt -p) it prints 'TRUE' and the reason is, at least the way i understand it is because the '*' means 0 or more matches.
Anyway, I would recommend using one of the POSIX Character Classes:

[[:alpha:]] matches alphabetic characters. This is equivalent to A-Za-z.
[[:lower:]] matches lowercase alphabetic characters. This is equivalent to a-z.
[[:upper:]] matches uppercase alphabetic characters. This is equivalent to A-Z.

Don_Cragun · October 15, 2012, 9:34pm

Assuming you're running on a system with a code set based on ASCII (i.e., not an IBM or Amdahl [if you remember them] mainframe); then [a-z] is a range expression that matches the 26 lowercase alphabetic characters; [A-z] is a range expression that matches the 52 uppercase and lowercase alphabetic characters and the \ , ^ , _ , and ` characters; and [a-Z] is a range expression that is either treated as an error or as a request to match the empty set (depending on your implementation) because a follows Z in ASCII.

mrm5102 · October 16, 2012, 10:02am

Hey Spacebar, thanks for the reply...

Sorry, I probably should have mentioned what I'm trying to do. Duhh, sorry about that...
Basically, I'm trying to "verify" some user input in the script. The user should enter some text. Then I check that text in the script to
make sure that the user's input "BEGINS" with an ALL lowercase string. I'll give the "[:lower:]" Character Class a try.
Maybe that will work...

Hey Don Cragun, thanks for your reply.

Is this the info your talking about, for what character encoding I'm using..? Also, the second one below I ran the "file" command
on one of my 'test' scripts to see what its encoding was...

# echo $LANG
en_US.utf8

# file -bi test_bashScript
text/x-shellscript; charset=us-ascii

Also, your saying the "[A-z]" range should work? I thought that everytime I tried using that, it would always, no matter the input,
would return "true" or "False", I forget exactly what the return value was. But I do remember that it always had the same
result everytime...

Basically, I just want to make sure that the entire "first" string that the user enters is in all lowercase...

And I'm just VERY confused why if the input string is "HELLO" (all uppercase) and the following test (below) is returning TRUE...??

#!/bin/bash

MY_VAR="HELLO"

### This pattern SHOULD match a string that begins with ONLY "lowercase letters", zero or more times...
PATTERN_1="^[a-z]*"

### This prints "TRUE"
if [[ $MY_VAR =~ $PATTERN_1 ]]
 then
    echo "TRUE"
else
    echo "FALSE"
fi

Any idea why I'm getting "TRUE" when the input is ALL uppercase letters..?

Thanks Again,
Matt

elixir_sinari · October 16, 2012, 10:17am

I think shell patterns are anchored by default.
Try with:

PATTERN_1="[[:lower:]]*"

and

if [[ $MY_VAR == $PATTERN_1 ]]

mrm5102 · October 16, 2012, 11:09am

Hey elixir_sinari, thanks for the reply...

I think the reason I couldn't get that "[:lower:]" character class to work was because I didn't enclose it in another set of square
brackets... Seems to work to a degree..

I'm just still baffled why the pattern "[a-z]*" matches the string "HELLO" when they are ALL uppercase....

Anyway, thanks for the suggestion...

Thanks Again,
Matt