sed filtering lines by range fails 1-line-ranges

The following is part of a larger project and sed is (right now) a given. I am working on a recursive Korn shell function to "peel off" XML tags from a larger text. Just for context i will show the complete function (not working right now) here:

function pGetXML
{
typeset chTag="$1"
typeset chOpt="$1"
typeset chLine=""

if [ "${chOpt#*/}" = "${chOpt}" ] ; then
     chOpt=""
else
     chOpt="${chOpt#*/}"
     chTag="${chTag%/*}"
fi

print -u2 - "inside pGetXML...."
print -u2 - "chTag=${chTag}"
print -u2 - "chOpt=${chOpt}"
print -u2 - "Args=$*\n"

if [ -n "$chTag" ] ; then
     shift
     sed -n '/<'"$chTag"'[^>]*'"$chOpt"'[^>]*>/,/<\/'"$chTag"'[^>]*>/p' |\
     pGetXML $*
else
     while read chLine ; do
          pStripTags "$chLine"
     done
fi

return 0
}

The function will be called like

pGetXML "arg1/type=opt1" "arg2/type=opt2" "Value"...

and is intended to "peel off" layers of XML tags from a file organized like this:

<arg1 type=opt1>
     <arg2 type=opt2>
          <Value>blabla</Value>
     </arg2>
     <othertag>
          <Value>foo bar</Value>
     </othertag>
</arg1>

The function should first print everything from "<arg1>" to "</arg1>" (the "option" is used because there could be other tags with the same name i am not interested in, like "<arg1 type=else>"), in the second instance filter from that only the lines "<arg2>...</arg2>" and in the third pass only the lines "<Value>...</Value>". The function "pStripTags" simply strips off the tags leaving the text inside.

Well, this is what was intended and it kind of works, but in the last step "sed" fails to do as expected when opening and closing tag of the range is on eht same line. I am at this stage down to this portion of the text (this is verified):

     <arg2 type=opt2>
          <Value>blabla</Value>
     </arg2>

and the sed command (verified with "set -xv") is this:

sed -n '/<Value[^>]*[^>]*>/,/<\/Value[^>]*>/p'

I would have expected it to only print line 2, but it doesn't. Instead it prints line 2 and 3.

The objective is to create a sed script that will fit into the recursive function. Any pointers will be welcome.

bakunin

from the man page:

Not sure how to circumvent... will the </value> tag be always be in the same line ?

1 Like

Thanks for the man page quote. Either i am blind or the AIX man page doesn't mention this detail. But this is at least an explanation.

No, this is what led me to trying my solution in first place. As you can see from the example text all but the innermost tags are on separate lines.

I will post the reworked script as soon as i have it ready. Thanks.

bakunin

Hi bakunin, you may replace your sed script with this:

sed -n '
:strt
/<'"$chTag"'[^>]*'"$chOpt"'[^>]*>/{
/<\/'"$chTag"'[^>]*>/{
p
d
}
N
b strt
}'

In case of a range of addresses, sed will find a line matching the first address and will not try to match the second address too at that line. The second address will be attempted to be matched on subsequent lines. Hence, the problem.

Well, I come up with this:

sed -rn '/<Value[^>]*[^>]*>/{h;
                 /<\/Value[^>]*>/!b nxt;g;p;b end
         : nxt    {n; /<\/Value[^>]*>/!{H;b nxt}
                     /<\/Value[^>]*>/H;x;p;b end
                  }
         : end}
        '

which prints out one liners as well as multiliners between <Value> tags... give it a shot and report back.

Many thanks for your helpful suggestions.

I modified the function a bit and noticed, that i don't need the last step "pStripTags" if i modify the sed-script to strip the tags immediately. Here is the revised function. I have added "tee -a <tracefile>" commands to control the various steps of the recursion. For production they can safely be removed as they only serve debugging purposes:

# ------------------------------------------------------------------------------
# pGetXML                        extract certain values from a layered XML code
# ------------------------------------------------------------------------------
# Author.....: bakunin, with help of various unix.com members
# last update: 2012 08 23    by: bakunin
# ------------------------------------------------------------------------------
# Revision Log:
#
# ------------------------------------------------------------------------------
# Usage:
#     pGetXML tag1[/option1] [tag2[/option2] ..]
#
#
#     Example:
#          cat file | pGetXML foo/opt1 bar/opt2
#          will search for a range of "<foo ...opt1..> ... </foo>" and in the
#          resulting stream search for a range of "<bar ..opt2..> ... </bar>
#          The result will be reformatted to a single line and the enclosing
#          tags will be removed. This text:
#
#          <foo type=opt2>
#               <sometag>
#          </foo>
#          <foo type=opt1>
#               <bar>
#                    somevalue
#               </bar
#               <bar type=opt2>searched_for</bar>
#          </foo>
#
#          will result only in "searched_for", because in the first foo-tag the
#          option doesn't match, the same goes for the first bar-tag 
#
# Prerequisites:
# - none
# ------------------------------------------------------------------------------
# Documentation:
# Extracts values from an XML file of nested tags presented at <stdin>.
# The given list of tags is searched recursively. Only the tag name has to
# be given, so
#
#             pGetXML foo
#
# will return the content of "<foo> .. </foo>". It is possible to refine tags
# by using "options", which will be searched for in the tag definition (see
#  below).
#
# Output goes to <stdout>.
#
#     Parameters: tag1[/opt1] [tag2[/opt2] ..tagN[/optN]] 
#     returns: void
# ------------------------------------------------------------------------------
# known bugs:
#
#     none
# ------------------------------------------------------------------------------
# ..........................(C) 2012 bakunin ..................................
# ------------------------------------------------------------------------------

function pGetXML
{
typeset chTag="$1"
typeset chOpt="$1"
typeset chLine=""

if [ "${chOpt#*/}" = "${chOpt}" ] ; then
     chOpt=""
else
     chOpt="${chOpt#*/}"
     chTag="${chTag%/*}"
fi

# DEBUG start
#      print -u2 - "inside pGetXML...."
#      print -u2 - "chTag=${chTag}"
#      print -u2 - "chOpt=${chOpt}"
#      print -u2 - "Args=$*\n"
# DEBUG end

if [ -n "$chTag" ] ; then
     shift
     sed -n '/<'"$chTag"'[^>]*'"$chOpt"'[^>]*>/ {
               :next
               /<\/'"$chTag"'[^>]*>/! {
                    N
                    b next
               }
             }
             /<\/'"$chTag"'[^>]*>/ {
               s/\n//g
               s/^.*<'"$chTag"'[^>]*'"$chOpt"'[^>]*>//
               s/<\/'"$chTag"'[^>]*>.*$//p
             }' |\
     tee -a xxx.$(date +'%H%M%N').out |\
     pGetXML $*
else
     tee -a xxx.last.out |\
     while read chLine ; do
          print - "$chLine"
     done
fi

return 0
}

bakunin