Find all matching words in text according to pattern

Grunspanix · June 25, 2013, 6:55am

Hello dear Unix shell professionals,
I am desperately trying to get a seemingly simple logic to work. I need to extract words from a text line and save them in an array. The text can look anything like that:

aaaaaaa${important}xxxxxxxx${important2}ooooooo${importantstring3}...

I am handicapped though in different regards:

Can't use perl
Stuck on a ancient GNU bash, version 3.00.16(1)-release (powerpc-ibm-aix5.1)
grep -o is not installed

My attempt was this:

line="aaaaaaa${important}xxxxxxxx${important2}ooooooo${importantstring3}";
if [[ $line =~ '(\${[^{]*})' ]]; 
    then
        echo "matching[1]: ${BASH_REMATCH[1]}";
        echo "matching[2]: ${BASH_REMATCH[2]}";
        echo "matching[3]: ${BASH_REMATCH[3]}";
    fi;

Output:

matching[1]: ${important}
matching[2]:
matching[3]:

So it prints the first match correctly, however it ignores all the remaining matches. Please anyone help me with this, I am stuck here for 2 days now :(. If it works with "awk", it should be fine too, but I can't figure out the syntax. Beware that I use a old shell.

radoulov · June 25, 2013, 7:38am

Try this:

line='aaaaaaa${important}xxxxxxxx${important2}ooooooo${importantstring3}'
IFS=\$ read -a _a <<< "$line" 
_regex='(\{[^}]+})'
for _e in "${_a[@]}"; do
  [[ $_e =~ $_regex ]] &&
    _n+=( "\$${BASH_REMATCH[0]}" )
done
# your matches are in the _n array

For example:

$ line='aaaaaaa${important}xxxxxxxx${important2}ooooooo${importantstring3}'
_regex='(\{[^}]+})'
$ IFS=\$ read -a _a <<< "$line"
$ _regex='(\{[^}]+})'
$ for _e in "${_a[@]}"; do
>   [[ $_e =~ $_regex ]] &&
>     _n+=( "\$${BASH_REMATCH[0]}" )
> done
# your matches are in the _n array:
$ # your matches are in the _n array:
$ declare -p _n
declare -a _n='([0]="\${important}" [1]="\${important2}" [2]="\${importantstring3}")'

Grunspanix · June 25, 2013, 9:25am

Wow! Awesome solution! Many thanks!!!!!

I had to convert parts of it to make it compatible to my old shell, as I got a syntax error but all in all it works perfectly! I even tried to trick it with random "$" or random braces "{", but it still only outputs the correct ones!

line='aaaa$}aaa${important}xxxxxxxx${important2}oo{o$}oo$oo${importantstring3}'
IFS=\$ read -a words <<< "$line" 
regex='(\{[^}]+})'
for e in "${words[@]}"; do
    if [[ $e =~ $regex ]]; then    
        echo "\$${BASH_REMATCH[0]}";
    fi;
done

Thanks again, you made a very happy user

---------- Post updated at 08:25 AM ---------- Previous update was at 07:05 AM ----------

Though I am satisfied with the solution, as I assume it will not produce errors, I have found something where I could trick it. If I use this line:

line='aaaa$aa{yyy}aaaaaa${important}xxxx

It will print ${yyy} as matching. That is because it only uses the "$" as separator and indirectly allows random characters to follow afterwards. I still wonder if there isn't any regex which will cover this (sorry, I am not the best at expressions and think in pseudo code, but somehow it bugs me):

First one would need to determine that these 2 characters must always come first:
[\$][\{]

Then comes a term where everything is allowed, except these:
[everything allowed except \$,\{]

The previous term is read until the closing bracket comes:
[\}].

This is my naive thinking, but it seems the thought process is easier than the actual implementation.

radoulov · June 25, 2013, 9:29am

Something like this:

IFS=\$ read -a words <<< "$line" 
regex='^(\{[^}]+})'
for e in "${words[@]}"; do
    if [[ $e =~ $regex ]]; then    
        echo "\$${BASH_REMATCH[0]}";
    fi;
done

You said that you can't use Perl

% perl -le'print join $/, shift =~ /\${.*?}/g' 'aaaa$}aaa${important}xxxxxxxx${important2}oo{o$}oo$oo${importantstring3}'
${important}
${important2}
${importantstring3}
% perl -le'print join $/, shift =~ /\${.*?}/g' 'aaaa$aa{yyy}aaaaaa${important}xxxx'
${important}

Grunspanix · June 25, 2013, 1:36pm

Damn, thanks again!
This works perfectly, although in this case I initially wasn't sure why it worked. But now I realize: you use the first as anchor character "^" to define, that at the beginning of the line the following expression in '(...)' must follow. I was confused initially as the grymoire docs described the anchor to be used "on the beginning of a line" - and then I wasn't sure where the "line" was in this case. Was it the original "$line" or the splitted parts of the line? Obviously in this case every splitted part is its own "line". Thats why it works. Eventually I understood

Regarding Perl: yeah, there was the choice between perl or bash scripts and then the thought came "use something which is always available and more down-to-earth" - and the decision fell to default shell scripts.

While it is an interesting learning experience I have previously used some perl and it was way more comfortable. I am not sure the pure shellscripting decision was right after all, especially seeing that perl is installed on most unix machines anyways...sigh, but what can you do.

radoulov · June 25, 2013, 3:14pm

Correct, perhaps "the beginning of the string" would be more appropriate.

That's OK, actually. I almost always use only pure shell scripting too, but Perl makes the string manipulation really, really easy.
Moreover, Perl is often available even where bash is not (an old HP-UX springs to mind :)).