Bash regex

kerloi · April 15, 2011, 10:06am

Hello everybody,
I'm clearly not an expert in bash scripting as I've written maybe less than 10 scripts in my life. I'm trying to strip an xml string removing every tag in it. I'm using bash substitution to do so, but apparently I missed something about what is a regex for bash ...

As an example, my input is:
VAR='<value key="Qt4ProjectManager.Qt4BuildConfiguration.BuildDirectory" type="QString">/home/share/path/to/build/directory</value>'

I use the command:
echo ${VAR#<[^>]*>}

I thought it was supposed to remove the shortest match of a substring starting with < and ending with >. But the output is the exact input string ...

The regex I know are those I use with flex so maybe it is not the same.

PS: The desired output for the example is '/home/share/path/to/build/directory'

panyam · April 15, 2011, 10:18am

 
echo $VAR | sed 's!<value.*>\(.*\)</value>.*!\1!'

kurumi · April 15, 2011, 10:21am

Ruby(1.9+)

$ echo $VAR
<value key="Qt4ProjectManager.Qt4BuildConfiguration.BuildDirectory" type="QString">/home/share/path/to/build/directory</value>
$ echo $VAR|ruby -e 'puts gets[/<value.*?>(.*)<\/value>/,1]'
/home/share/path/to/build/directory

alister · April 15, 2011, 11:07am

Bash parameter substitution and pathname expansion (file globbing) do not use regular expressions.

The portable subset of pattern matching features used in parameter expansion and pathanme expansion isn't very powerful. It's documented @ Shell Command Language

Bash (and ksh) support more useful functionality, so either reference the relevant section of your man page or visit Pattern Matching - Bash Reference Manual

Quick tips:
The preferred way to negate a bracketed list of characters is with a "!", though the "^" usually works (older syntax).

In regex grammar, an * means that the preceding character or subexpression can match any number of times, including none. In the shell's pattern matching grammar, * is not a quantifier/repeater; it is a wildcard that itself represents any number of any characters (none included).

. is not special. It stands for a dot.

? is a wildcard that matches any single characters (it does not mean that the previous character is optional).

So, what does your original pattern actually accomplish?
${VAR#<[^>]*>} tries to match from the beginning of VAR's value a '<' followed by one and only one character so long as it is not a '>' followed by as few characters as possible (since # is not greedy) until the first occurrence of a ">". This pattern requires that at least one character be present between '<' and the first '>'. Looking at it from a regular expressionist's point of view, it seems the intent is to allows the space between '<' and '>' to be empty. If so, the proper pattern is ${VAR#<*>} .

All that said, I don't know why your result is the unchanged value of $VAR.
Given your sample data, both your pattern and my suggested pattern return /home/share/path/to/build/directory</value> .

Perhaps you can printf %s "$VAR" | od -c -tx1 to take a look at VAR's exact contents (it should print the character over its hexadecimal byte value). Perhaps there's an "invisible" character at the beginning?

Regards,
Alister

---------- Post updated at 11:07 AM ---------- Previous update was at 10:59 AM ----------

Missed that crucial bit. The following should work on most sane, posix-like shells:

temp=${VAR#<*>}
echo "${temp%<*}"

Regards,
Alister

kerloi · April 18, 2011, 3:18am

Thanks all for your replies and for your explainations alister. But it is still not working with the bash only syntaxe :

${VAR#<*>}

My bash version is : GNU bash, version 3.2.25(1)-release (x86_64-redhat-linux-gnu) I know that bash is currently on version 4.X so ...

But panyam an kurumi's solutions works great.
Thanks a lot to all of you.