Non-greedy pattern matching in shell script

Hi all,

Is Perl included by default in Ubuntu? I'm trying to write a program using as few languages as possible, and since I'm using a few Perl one-liners to do non-greedy matching, it's considered another language, and this is a bad thing.

Basically, I'm using a Perl one-liner to grab XML between tags, where $2 is the name of the tag and $3 is the nth tag with that name:

perl -pe "s/(.*?<$2>){$3}(.*?)<\/$2>.*/\2/"

To escape forward slashes in XML:

content=$(echo "$4" | perl -pe "s/<\//<\\\\\//")

And to grab an XML tag based on both its tag and content, where $2 is the name of the tag, $3 is the nth tag with that name, and $content is an XML string escaped as above:

perl -pe "s/(.*?<$2>){$3}(.*$content.*?)<\/$2>.*/\2/"

I can't use sed because it doesn't have non-greedy matching, I can't use grep because it doesn't have non-greedy matching without Perl-like extensions, and to my knowledge Bash cannot do something this complicated on its own.

Does anyone know of another way I can do this, so it's not "another language" we have to use to maintain with?

Thanks,
Zel2008

Try replacing your <tag> and </tag> with two single characters like ~ and @

You can then use [^~]* . Just incase these two special characters appear in the input replace them with two unique strings and replace these back when done:

sed -r -e "s,~,UNIQUE_STR1,g" \
    -e "s,@,UNIQUE_STR2,g" \
    -e "s,<${2}>,~,g" \
    -e "s,</${2}>,@,g" \
    -e "s/([^~]*~){$3}([^@]*)@.*/\2/" \
    -e "s,UNIQUE_STR1,~,g" \
    -e "s,UNIQUE_STR2,@,g" ${1}

This assumes the whole document is on 1 line which is likely to cause issues with sed when your XML gets large so it's not ideal, but a good example of the concept.

Another approach is to use the awk Record Separator (RS) by replacing the start and end tags with a single character:

sed -e "s,~,UNIQUE_STR,g" \
    -e "s,<${2}>,~,g" \
    -e "s,</${2}>,~,g" ${1} | \
awk "NR==${3}*2" RS=\~ | \
sed -e "s,UNIQUE_STR,~,g"

Now, awk can simply select the N*2 record for the required data.

Again we replace the UNIQUE_STR with ~ for the final result.

Thanks Chubler,

I'll try this out and see how it works, thanks. Is sed included by default in Ubuntu? We have a major requirement that things not be too difficult to maintain, and we don't want to risk needing to reinstall sed to make things work.

Thanks,
Zel2008

Yes sed is available by default on Ubuntu, it's POSIX so should be pretty widely available. However the -r option, though availble on Ubuntu isn't POSIX so is not as portable.

Solution 2 is pretty portable and should work on most systems though Solaris may need nawk instead of awk.