How can I extract XML block around matching search string?

kchinnam · February 12, 2016, 3:25pm

I want to extract XML block surrounding search string
Ex: print XML block for string "myapp1-ear" surrounded by "<application> .. </application>"
Input XML:

<?xml version="1.0" encoding="UTF-8"?>
<deployment-request>
  <requestor>
    <first-name>kchinnam</first-name>
    <last-name>Group</last-name>
    <email-address>kchinnam@some.com</email-address>
  </requestor>
  <notify-list>
    <email-address>kchinnam@some.com</email-address>
  </notify-list>
  <application>
    <application-name>myapp1-ear</application-name>
    <ear-file-name>myapp1-ear.ear</ear-file-name>
    <edition/>
    <shared-library-name/>
  </application>
  <application>
    <application-name>myapp2-ear</application-name>
    <ear-file-name>myapp2-ear.ear</ear-file-name>
    <edition/>
    <shared-library-name/>
    <CookieSettings>
      <path>/</path>
    </CookieSettings>
    <options/>
  </application>
</deployment-request>

Expected Output XML:

  <application>
    <application-name>myapp1-ear</application-name>
    <ear-file-name>myapp1-ear.ear</ear-file-name>
    <edition/>
    <shared-library-name/>
  </application>

Can I do something like
strear=myapp1-ear; sed -n '/$strear/ /<application>/, /<\/application>/' <xmlfile.xml>

---------- Post updated at 03:25 PM ---------- Previous update was at 02:05 PM ----------

I tried perl regex "pcregrep", it is not working.

 
pcregrep -M '\{(<application>.*myapp1-ear.*<\/application>)\}' xmlfile.xml

Don_Cragun · February 12, 2016, 4:09pm

You haven't said what operating system or shell you're using, but for things like this I usually use awk . This seems to do what you want:

#!/bin/ksh
strear='myapp1-ear'

awk -v app_name="$strear" '
/<application>/	{
	cnt = copy = 0
}
$0 ~ "<application-name>" app_name "</application-name>" {
	copy = 1
}
{	line[++cnt] = $0
}
/<\/application>/ {
	if(copy) {
		copy = 0
		for(i = 1; i <= cnt; i++)
			print line
	}
}' xmlfile.xml

If you're running this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

kchinnam · February 12, 2016, 4:45pm

Don,
Your solution is working. Thanks a lot for your effort.
Here is my bash and OS version.. so I got better ammo here :-).

GNU bash, version 3.2.51(1)-release (x86_64-suse-linux-gnu)

I am wondering if solution can be simplified if we can break text into groups with delimiter:

<application>..</application>

Then simply select the group string that has --> $strear

Don_Cragun · February 12, 2016, 9:32pm

If the awk script I suggested was too complicated for you, you could try this simple ed script:

#!/bin/ksh
strear='myapp1-ear'

ed -s xmlfile.xml <<EOF
g/<application-name>$strear<\/application-name>/?<application>?,/<\/application>/p
EOF

This will work with any shell that recognizes basic Bourne shell syntax (so you can use bash instead of ksh if you want to.

If I knew more details about your XML file tags, the BREs in the above script could probably be significantly simplified. With the limited information provided, these verbose BREs should accurately perform the requested operation as long as the opening <application> and closing </application> tags are on lines by themselves as shown in your sample data.

If you still don't like this, feel free to use your better ammo.

Aia · February 12, 2016, 9:57pm

Please, try this Perl version, the highlighted is the search parameter you might want to change. kchinnam.xml is the modified file I used against.

cat kchinnam.xml

<application>
        <application-name>myapp1-ear</application-name>
        <ear-file-name>myapp1-ear.ear</ear-file-name>
        <edition></edition>
        <shared-library-name></shared-library-name>
</application>
<nothinghere>
    <test>test-in-case-of-other-blocks-inserted</test>
</nothinghere>
<application>
        <application-name>myapp2-ear</application-name>
        <ear-file-name>myapp2-ear.ear</ear-file-name>
        <edition></edition>
        <shared-library-name></shared-library-name>
                <CookieSettings>
                           <path>/</path>
                  </CookieSettings>
        </options>
</application>

perl -ne 'BEGIN{$/="</application>\n";} print m|(<application>.*myapp1-ear.*$/)|ms' kchinnam.xml

<application>
        <application-name>myapp1-ear</application-name>
        <ear-file-name>myapp1-ear.ear</ear-file-name>
        <edition></edition>
        <shared-library-name></shared-library-name>
</application>

perl -ne 'BEGIN{$/="</application>\n";} print m|(<application>.*myapp2-ear.*$/)|ms' kchinnam.xml.xml

<application>
        <application-name>myapp2-ear</application-name>
        <ear-file-name>myapp2-ear.ear</ear-file-name>
        <edition></edition>
        <shared-library-name></shared-library-name>
                <CookieSettings>
                           <path>/</path>
                  </CookieSettings>
        </options>
</application>

kchinnam · February 12, 2016, 10:40pm

Don ed solution worked great.. I never used it, I need to understand how its working. syntax looks very close to sed . I wish I could use a single like sed for this.

Aia,, your solution worked when I removed prefix spaces with tag <application> . I tried below to allow spaces, its not working..

perl -ne 'BEGIN{$/="\s.*</application>\n";} print m|(\s.*<application>.*myapp1-ear.*/)|ms' xmlfile.xml

Can we tell it to ignore spaces before and after </application> tag?

Aia · February 12, 2016, 10:50pm

Is your posted data not a true representation of the real file?

kchinnam · February 12, 2016, 11:41pm

Aia, sorry for the confusion. Corrected input from initial post.
Ideally I want to select

all lines from XML until </notify-list> & <application> XML block matching search string say myapp1-ear .

looney · February 13, 2016, 1:14am

Hi,
Also try

awk '/<application-name>myapp1-ear/ {print "\t<application>";c=1; print;next} c {print} /<\/application>/{ c=0}' xmlfile.xml

Aia · February 13, 2016, 1:25am

I see you have corrected the input xml from post #1.
This is how your xml looks like, by turning the showing tabs:

cat -T kchinnam.xml

<?xml version="1.0" encoding="UTF-8"?>
<deployment-request>
        <requestor>
                <first-name>kchinnam</first-name>
                <last-name>Group</last-name>
                <email-address>kchinnam@some.com</email-address>
        </requestor>
        <notify-list>
                <email-address>kchinnam@some.com</email-address>
        </notify-list>
^I^I<application>
^I^I^I^I<application-name>myapp1-ear</application-name>
^I^I^I^I<ear-file-name>myapp1-ear.ear</ear-file-name>
^I^I^I^I<edition></edition>
^I^I^I^I<shared-library-name></shared-library-name>
^I^I</application>
^I^I<application>
^I^I^I^I<application-name>myapp2-ear</application-name>
^I^I^I^I<ear-file-name>myapp2-ear.ear</ear-file-name>
^I^I^I^I<edition></edition>
^I^I^I^I<shared-library-name></shared-library-name>
^I^I^I^I^I^I<CookieSettings>
^I^I^I^I^I^I^I^I   <path>/</path>
^I^I^I^I^I^I  </CookieSettings>
^I^I^I^I</options>
^I^I</application>

All those ^I is a tab. The indentation is a mix between normal spaces and tabs.
Your latest request "appears" to be to select from XML each line until </notify-list> and <application> and stop. That does not make much sense, since it would yield:

<?xml version="1.0" encoding="UTF-8"?>
<deployment-request>
        <requestor>
                <first-name>kchinnam</first-name>
                <last-name>Group</last-name>
                <email-address>kchinnam@some.com</email-address>
        </requestor>
        <notify-list>
                <email-address>kchinnam@some.com</email-address>
        </notify-list>
^I^I<application>

I am going to guess you want this:

perl -ne 'BEGIN{$/="</application>\n"} @block = m|(<application>.*myapp1-ear.*)^\s+?($/)|ms; if(@block){$block[0] =~ s/^\s+/\t/gms; print @block}' kchinnam.xml

<application>
        <application-name>myapp1-ear</application-name>
        <ear-file-name>myapp1-ear.ear</ear-file-name>
        <edition></edition>
        <shared-library-name></shared-library-name>
</application>

or maybe this:

perl -ne 'BEGIN{$/="</application>\n"} @block = m|(<application>.*myapp1-ear.*)^\s+?($/)|ms; if(@block){$block[0] =~ s/^\s+/" "x4/egms; print @block}' kchinnam.xml

<application>
    <application-name>myapp1-ear</application-name>
    <ear-file-name>myapp1-ear.ear</ear-file-name>
    <edition></edition>
    <shared-library-name></shared-library-name>
</application>

Don_Cragun · February 13, 2016, 2:42am

Hi,
sed was based on ed ; ed came first. ed can do forwards and backwards searches; sed can't do backwards searches. The syntax for the ed g command is:

g/BRE/command

It tells ed to identify every line in the file that matches the basic regular expression BRE and for each line found, execute command on that line. And command in this case is:

?BRE1?,/BRE2/p

where p is the print command which takes zero, one, or two addresses to specify a range of lines to be printed. (No addresses prints the current line; one address prints the addressed line, and two addresses (separated by a comma) prints the lines from the 1st address up to and including the 2nd address.) The address specified by ?BRE1? searches for the line matching the basic regular expression BRE1 backwards from the current line and (as with sed ) /BRE2/ searches forwards from the current line for a line matching the basic regular expression BRE2 .

With your new sample input, the ed script I suggested should still print the lines you want. And, if you like to write less portable, 1-liners instead of code that will work with any POSIX-conforming shell, you can translate this to:

strear='myapp1-ear';ed -s xmlfile3.xml <<< "g/<application-name>$strear<\/application-name>/?<application>?,/<\/application>/p"

The above works with both 1993 or later versions of ksh and bash , but is a syntax error for many other POSIX-compliant shells.

And, if you want to strip two <tab> characters from the front of each of those lines, you could use:

#!/bin/ksh
strear='myapp1-ear'

ed -s xmlfile.xml <<EOF
g/<application-name>$strear<\/application-name>/?<application>?,/<\/application>/s/^..//
?<application-name>,.p
EOF

You could turn that into a 2-liner, but I much prefer readable and maintainable code to the minimal line approach.

kchinnam · February 13, 2016, 10:41pm

Don, I want to keep \t tab characters.
I want my output to have initial generic XML block + search node + last closing element.

<?xml version="1.0" encoding="UTF-8"?>
<deployment-request>
  <requestor>
    <first-name>kchinnam</first-name>
    <last-name>Group</last-name>
    <email-address>kchinnam@some.com</email-address>
  </requestor>
  <notify-list>
    <email-address>kchinnam@some.com</email-address>
  </notify-list>
  <application>
    <application-name>myapp1-ear</application-name>
    <ear-file-name>myapp1-ear.ear</ear-file-name>
    <edition/>
    <shared-library-name/>
  </application>
</deployment-request>

so I started doing something like this, but it is not working.
# This is to get initial generic XML block of text.
ed -s xmlfile.xml <<EOF
g/<notify-list>/?<deployment-request>?,<\/deployment-request>/p
q
EOF

# once above one works, I would like to append that with matched block of XML.
ed -s xmlfile.xml <<EOF
g/<notify-list>/?<deployment-request>?,<\/deployment-request>/p
g/<application-name>$strear<\/application-name>/?<application>?,/<\/application>/p
q
EOF

Don_Cragun · February 14, 2016, 1:50am

You're making it much more difficult than it needs to be. The ed commands needed to print the header and the trailer are identical to the sed commands you need to do the same thing. And, there is no need for three invocations of ed to get the output you want. Try:

#!/bin/ksh
strear='myapp1-ear'

ed -s xmlfile.xml <<EOF
1,/<\/notify-list>/p
g/<application-name>$strear<\/application-name>/?<application>?,/<\/application>/p
$ p
q
EOF

Note that the <space> before the p on the next to the last line in the ed script is not an accident and must not be removed. (If you don't understand why, ask.)

But, the output you say you want in post #12 does not match the spacing in your latest update to your input now shown in post #1. The code above preserves the blanks (spaces in the first few lines and tabs in the last few lines) found in your input file in the output it produces:

<?xml version="1.0" encoding="UTF-8"?>
<deployment-request>
        <requestor>
                <first-name>kchinnam</first-name>
                <last-name>Group</last-name>
                <email-address>kchinnam@some.com</email-address>
        </requestor>
        <notify-list>
                <email-address>kchinnam@some.com</email-address>
        </notify-list>
		<application>
				<application-name>myapp1-ear</application-name>
				<ear-file-name>myapp1-ear.ear</ear-file-name>
				<edition></edition>
				<shared-library-name></shared-library-name>
		</application>
</deployment-request>

kchinnam · February 14, 2016, 4:23pm

Don, I need to assign XML output to a variable. Assigning functions output to a variable causes output to loose newlines

#!/bin/bash
_getXMLblock()
{
strear='myapp1-ear'

ed -s xmlfile.xml <<EOF
1,/<\/notify-list>/p
g/<application-name>$strear<\/application-name>/?<application>?,/<\/application>/p
$ p
q
EOF
}

strXML=${_getXMLblock}
echo $strXML

# how can I retain line breaks in this method?

<?xml version="1.0" encoding="UTF-8"?> <deployment-request> <requestor> <first-name>kchinnam</first-name> <last-name>Group</last-name> <email-address>kchinnam@some.com</email-address> </requestor> <notify-list> <email-address>kchinnam@some.com</email-address> </notify-list> <application> <application-name>myapp1-ear</application-name> <ear-file-name>myapp1-ear.ear</ear-file-name> <edition/> <shared-library-name/> </application> </deployment-request>

This method keeps line breaks. But how can I use multiple statements?

#!/bin/bash
strear='myapp1-ear';

foundXML=$(ed -s xmlfile.xml <<< "1,/<\/notify-list>/p; g/<application-name>$strear<\/application-name>/?<application>?,/<\/application>/p; $ p")

echo $foundXML

output

?
foundXML[]

---------- Post updated at 04:23 PM ---------- Previous update was at 04:01 PM ----------

I could not edit my previous post for some reason.. can someone fix that?
I am able to preserve new lines with this:

echo "$strXML"

Don_Cragun · February 14, 2016, 4:28pm

kchinnam:

Don, I need to assign XML output to a variable. Assigning functions output to a variable causes output to loose newlines
#!/bin/bash
_getXMLblock()
{
strear='myapp1-ear'

ed -s xmlfile.xml <<EOF
1,/<\/notify-list>/p
g/<application-name>$strear<\/application-name>/?<application>?,/<\/application>/p
$ p
q
EOF
}

strXML=${_getXMLblock}
echo $strXML
# how can I retain line breaks in this method?

The function isn't throwing away line breaks. The shell is throwing away the line breaks in your echo statement because you didn't quote the expansion of the variable that was assigned the output from the function. Change:

echo $strXML

to:

echo "$strXML"

or, more safely:

printf '%s\n' "$strXML"

<?xml version="1.0" encoding="UTF-8"?> <deployment-request> <requestor> <first-name>kchinnam</first-name> <last-name>Group</last-name> <email-address>kchinnam@some.com</email-address> </requestor> <notify-list> <email-address>kchinnam@some.com</email-address> </notify-list> <application> <application-name>myapp1-ear</application-name> <ear-file-name>myapp1-ear.ear</ear-file-name> <edition/> <shared-library-name/> </application> </deployment-request>

This method keeps line breaks. But how can I use multiple statements?

#!/bin/bash
strear='meta-ear';

foundXML=$(ed -s xmlfile.xml <<< "1,/<\/notify-list>/p; g/<application-name>$strear<\/application-name>/?<application>?,/<\/application>/p; $ p")

echo $foundXML

output

?
foundXML[]

No, this will not preserve line breaks any more than the earlier attempt. You still have an unquoted expansion of a variable containing line breaks.

And, as you have found, you can't combine ed statements on a single line using semi-colon as a command separator.

But:

#!/bin/bash
strear='meta-ear';

foundXML=$(ed -s xmlfile.xml <<< "1,/<\/notify-list>/p
g/<application-name>$strear<\/application-name>/?<application>?,/<\/application>/p
$ p")

printf '%s\n' "$foundXML"

should work. The first form is more portable and will work on any shell that performs POSIX-standard command substitutions and variable expansions.

The last form only works with recent versions of bash , ksh , and a few other shells supporting the <<< string redirection operator extension to the standards.

kchinnam · February 14, 2016, 4:30pm

Don, you are right. Below solution is working,,

foundXML=$(ed -s xmlfile.xml <<< "1,/<\/notify-list>/p
g/<application-name>$strear<\/application-name>/?<application>?,/<\/application>/p
$ p")

I will go with function. searching for ed related info on google has very little.
Looks like you are the last remaining expert,, I loved what it can do.

Don_Cragun · February 14, 2016, 4:33pm