Match string against character class in bash

urello · March 7, 2014, 5:01am

Hello,
I want to check whether string has only numeric characters. The following code doesn't work for me

#!/usr/local/bin/bash
if [[ $1 =~ [:digit:] ]]; then
echo "true"
else
echo "False"
fi

[root@freegtw /data/termit]# ./yyy '346'
False
[root@freegtw /data/termit]# ./yyy 'aaa'
False

I'm searching for solution using character classes, not regex. Thanks in advance.

balajesuri · March 7, 2014, 5:13am

REGEX="^[[:digit:]]*$"
if [[ $1 =~ $REGEX ]]
then
    echo yes
else
    echo no
fi

OR for portability:

if echo $1 | grep -q "^[0-9]*$"
then
    echo yes
else
    echo no
fi

SriniShoo · March 7, 2014, 5:41am

grep -q "^[0-9][0-9]*$"

 awk -v var="${1}" 'BEGIN {if(var * 1 == var) {print "numeric"} else {print "non-numeric"}}

---------- Post updated at 06:41 AM ---------- Previous update was at 06:14 AM ----------

Hi Balaji, I guess, this will fail if ${1}=''

urello · March 7, 2014, 7:07am

Thanks a lot. It works for me

balajesuri · March 7, 2014, 8:21am

True, there are no validation checks. It was just meant for the OP to get started

alister · March 7, 2014, 5:24pm

The simplest solution is to check for the presence of a character which does not match the class in question. In other words, negate the class: [^[:digit:]] .

Regards,
Alister

---------- Post updated at 05:24 PM ---------- Previous update was at 11:56 AM ----------

The range expression [0-9] is only defined in the C/POSIX locale. If the solution only needs to function in that locale, it's still a good idea to set it explicitly in the command's environment, e.g. LC_COLLATE=C grep ... . Aternatively, you can leave the locale unspecified, and explicitly enumerate each digit, for a cross-locale portable solution: [0123456789] . If the digits do not need to be so rigidly defined, then it's simplest to use the character class, [[:digits:]] .

This, in my opinion, is a terrible solution because it depends on a great deal of subtle behavior and because it mistakenly assumes that -v can assign arbitrary text. Even an expert AWK hacker probably cannot say with certainty how that will behave across implementations.

There are always some ambiguities in the standards and there are always some disparities between implementations. Your awk one-liner, unfortunately, resides in those grey areas.

One thing that the standard is clear on is that the right side of command line assignments, value in name=value is parsed as a string token.

POSIX states that a -v option argument, name=value in -v name=value , must take the form of an assignment operand, but says nothing about its behavior, aside from when it takes effect (before even a BEGIN section). It seems reasonable to assume that implementors will treat them as string tokens as well.

Parsing AWK string tokens involves escape sequence processing.

In short, there is no way to naively pass arbitrary text into awk using command line assignments (with or without the -v option).

For more details, refer to the OPTIONS and OPERANDS sections near the beginning of the POSIX AWK man page.

The following script feeds three strings to your awk code. None of those strings is numeric -- each one contains a backslash and a letter -- yet your code will return "numeric" in most cases.

In the following, original-awk is nawk.

isnumeric.sh:

for x in  '123\f' '123\t' '123\n'; do
	printf '\nTesting %s ...\n' "$x"
	for awk in gawk mawk original-awk; do
		printf '%s: ' $awk
		$awk -v var="${x}" 'BEGIN {if (var * 1 == var) {print "numeric"} else {print "non-numeric"}}'

	done
done

Produces:

$ sh isnumeric.sh

Testing 123\f ...
gawk: numeric
mawk: numeric
original-awk: non-numeric

Testing 123\t ...
gawk: numeric
mawk: numeric
original-awk: numeric

Testing 123\n ...
gawk: numeric
mawk: non-numeric
original-awk: numeric

The above should make it clear that your awk suggestion cannot handle arbitrary text. Note that not only do the implementations disagree, but that they do so inconsistently.

The results are also locale dependent, because converting text to a numeric involves stripping leading/trailing blanks, and membership in the blank class is locale dependent.

In the C/POSIX locale, of \f, \t, and \n, only \t is a member of [[:blank:]]. The correct result should be: 123\f => non-numeric, 123\t => numeric, 123\n => non-numeric. In my testing, gawk was worst with 1 of 3 correct. mawk and nawk tied with 2 of 3 correct.

If you wanted to use AWK for this, I would recommend reading the text on stdin instead of from the command line. I would also recommend using a regular expression match operation instead of mulitple implicit type conversions.

Unrelated tangent: For a reason that I cannot fathom, ubuntu 12.04 LTS installs nawk as /usr/bin/original-awk while /usr/bin/nawk is left as a symlink to /usr/bin/gawk (via /etc/alternatives/nawk). Before installing gawk, nawk pointed to /usr/bin/mawk (again, via /etc/alternatives/nawk). If that's normal, I'm at a loss for words. I hope, for the sake of Ubuntu userland sanity, that this is just an aberration confined to this particular install.

Regards,
Alister