XML text bounded with tag

unme · January 4, 2015, 5:55am

Could you please give your inputs on the below issue:

source.xml

<?xml version="1.0" encoding="UTF-16"?>
<P1 >
<C1 type="i"><2></C1>
<V1 type="string"><6.2></V1>
<D1 type="string">
	<D2><1.0></D2>
	<D2><2.0></D2>
</D1>
......................
......................
many more records.....
</P1>

Problem with the above xml is, text is bounded between < & >. I am unable to read the xml. Could you please guide me in how to remove the < & > for the text.

Scrutinizer · January 4, 2015, 6:24am

What output are you looking for?

derekludwig · January 4, 2015, 7:00am

The issue will be determining what is a valid XML tag and what is data that appears between "<" and ">". Is it always numeric? Are there negative numbers? Character strings? With or without spaces?
But making a guess, should the results be:

<?xml version="1.0" encoding="UTF-16"?>
<P1 >
<C1 type="i">2</C1>
<V1 type="string">6.2</V1>
<D1 type="string">
    <D2>1.0</D2>
    <D2>2.0</D2>
</D1>
......................
......................
many more records.....
</P1>

This was done with:

perl -pe 's{<(\d+(?:\.\d+)?)>}{\1}g;'

or with:

sed -e 's/<\([1-9][0-9]*\)>/\1/g' -e 's/<\([1-9][0-9]*\.[0-9]*\)>/\1/g'

unme · January 4, 2015, 7:13am

Thanks a lot for your input, It will contains characters, sapces
Input:

<?xml version="1.0" encoding="UTF-16"?>
<P1 >
<C1 type="i"><abc txt></C1>
<V1 type="string"><6.2 txt></V1>
<D1 type="string">
    <D2>1.0</D2>
    <D2>2.0</D2>
</D1>
......................
......................
many more records.....
</P1>

desired output:

<?xml version="1.0" encoding="UTF-16"?>
<P1 >
<C1 type="i">abc txt</C1>
<V1 type="string">6.2 txt</V1>
<D1 type="string">
    <D2>1.0</D2>
    <D2>2.0</D2>
</D1>
......................
......................
many more records.....
</P1>

Scrutinizer · January 4, 2015, 7:45am

With this particular format you could try:

sed 's/<\([^>]*\)>\(<[^>]*>\)$/\1\2/' file

derekludwig · January 4, 2015, 10:19am

With respects to Scrutinizer, if the XML tags are nested on the same line, are empty, have multiples on a line, or span multiple lines, as in:

<?xml version="1.0" encoding="UTF-16"?>
<P1 >
<C1 type="i"><Z1 ><abc txt></Z1></C1>
<V1 type="string"><6.2 txt></V1>
<D1 type="string">
    <D2><1.0></D2>
    <D2><2.0></D2>
    <Y2><one 1.0></Y2><Y2><two 2.0></Y2><Y2><three 3.0></Y2><Y2><four 4.0></Y2>
    <W3 alpha="beta"></W3>
    <X4>
    <  foo 42 bar  >
    </X4>
</D1>
......................
......................
many more records.....
</P1>

Then the man sed (linux) one-liner won't work:

<?xml version="1.0" encoding="UTF-16"?>
<P1 >
<C1 type="i"><Z1 ><abc txt>/Z1</C1>
<V1 type="string">6.2 txt</V1>
<D1 type="string">
    <D2>1.0</D2>
    <D2>2.0</D2>
    <Y2><one 1.0></Y2><Y2><two 2.0></Y2><Y2><three 3.0></Y2><Y2>four 4.0</Y2>
    W3 alpha="beta"</W3>
    <X4>
    <  foo 42 bar  >
    </X4>
</D1>
......................
......................
many more records.....
</P1>

A partial perlish solution:

perl -0777 -pe 'print; print "------\n"; s{<([^/>\s]+)([^>]*)>(?:\s*<([^>]*)\s*>\s*)?</\1\s*>}{<$1$2>$3</$1>}gms;'

which generates:

<?xml version="1.0" encoding="UTF-16"?>
<P1 >
<C1 type="i"><Z1 >abc txt</Z1></C1>
<V1 type="string">6.2 txt</V1>
<D1 type="string">
    <D2>1.0</D2>
    <D2>2.0</D2>
    <Y2>one 1.0</Y2><Y2>two 2.0</Y2><Y2>three 3.0</Y2><Y2>four 4.0</Y2>
    <W3 alpha="beta"></W3>
    <X4>  foo 42 bar  </X4>
</D1>
......................
......................
many more records.....
</P1>

Mind you, what works will depend entirely on your input data. If the sed one-liner works, use it.

unme · January 5, 2015, 2:22am

perl -0777 -pe 's{<([^/>\s]+)([^>]*)>(?:\s*<([^>]*)\s*>\s*)?</\1\s*>}{<$1$2>$3</$1>}gms;'

Above code working perfectly, Thanks a lot all for your inputs.

derekludwig · January 5, 2015, 4:24am

Just noticed a small error in the regex:

perl -0777 -pe 'print; print "------\n"; s{<([^/>\s]+)([^>]*)>(?:\s*<([^>]*)\s*>\s*)?</\1\s*>}{<$1$2>$3</$1>}gms;'

The highlighted token is not needed as any whitespace would have been consumed by ([^>]*) . Also, the two print statements should not have been included. The corrected code is:

perl -0777 -pe 's{<([^/>\s]+)([^>]*)>(?:\s*<([^>]*)>\s*)?</\1\s*>}{<$1$2>$3</$1>}gms;'

Apologies for the inconvenience.