Non-greedy pattern matching in shell script

Zel2008 · July 28, 2014, 4:11pm

Hi all,

Is Perl included by default in Ubuntu? I'm trying to write a program using as few languages as possible, and since I'm using a few Perl one-liners to do non-greedy matching, it's considered another language, and this is a bad thing.

Basically, I'm using a Perl one-liner to grab XML between tags, where $2 is the name of the tag and $3 is the nth tag with that name:

perl -pe "s/(.*?<$2>){$3}(.*?)<\/$2>.*/\2/"

To escape forward slashes in XML:

content=$(echo "$4" | perl -pe "s/<\//<\\\\\//")

And to grab an XML tag based on both its tag and content, where $2 is the name of the tag, $3 is the nth tag with that name, and $content is an XML string escaped as above:

perl -pe "s/(.*?<$2>){$3}(.*$content.*?)<\/$2>.*/\2/"

I can't use sed because it doesn't have non-greedy matching, I can't use grep because it doesn't have non-greedy matching without Perl-like extensions, and to my knowledge Bash cannot do something this complicated on its own.

Does anyone know of another way I can do this, so it's not "another language" we have to use to maintain with?

Thanks,
Zel2008

Chubler_XL · July 28, 2014, 4:50pm

Try replacing your <tag> and </tag> with two single characters like ~ and @

You can then use [^~]* . Just incase these two special characters appear in the input replace them with two unique strings and replace these back when done:

sed -r -e "s,~,UNIQUE_STR1,g" \
    -e "s,@,UNIQUE_STR2,g" \
    -e "s,<${2}>,~,g" \
    -e "s,</${2}>,@,g" \
    -e "s/([^~]*~){$3}([^@]*)@.*/\2/" \
    -e "s,UNIQUE_STR1,~,g" \
    -e "s,UNIQUE_STR2,@,g" ${1}

This assumes the whole document is on 1 line which is likely to cause issues with sed when your XML gets large so it's not ideal, but a good example of the concept.

Another approach is to use the awk Record Separator (RS) by replacing the start and end tags with a single character:

sed -e "s,~,UNIQUE_STR,g" \
    -e "s,<${2}>,~,g" \
    -e "s,</${2}>,~,g" ${1} | \
awk "NR==${3}*2" RS=\~ | \
sed -e "s,UNIQUE_STR,~,g"

Now, awk can simply select the N*2 record for the required data.

Again we replace the UNIQUE_STR with ~ for the final result.

Zel2008 · July 29, 2014, 9:31am

Thanks Chubler,

I'll try this out and see how it works, thanks. Is sed included by default in Ubuntu? We have a major requirement that things not be too difficult to maintain, and we don't want to risk needing to reinstall sed to make things work.

Thanks,
Zel2008

Chubler_XL · July 29, 2014, 10:53am

Yes sed is available by default on Ubuntu, it's POSIX so should be pretty widely available. However the -r option, though availble on Ubuntu isn't POSIX so is not as portable.

Solution 2 is pretty portable and should work on most systems though Solaris may need nawk instead of awk.