xmlstarlet parse non en_US characters

unclecameron · November 30, 2010, 7:54pm

I'm parsing around 600K xml files, with roughly 1500 lines of text in each, some of the lines include Chinese, Russian, whatever language, with a bash script that uses

 cat $i | xmlstarlet sel -t -m "//section1/section2/section3/section4/section5" -v "@VALUE" -n > somefile

which works, but I get parse errors like

-:2350: parser error : invalid character in attribute value
     <NODE NAME="something" VALUE="^Tï¿<96>ï¾¤ï¿©ï¾¹ï¿°ï¿<8f>ï¿²/dPï¾�*ï¿<84>ï¿<88>mu" />
                                  ^
-:2350: parser error : attributes construct error
     <NODE NAME="something" VALUE="^Tï¿<96>ï¾¤ï¿©ï¾¹ï¿°ï¿<8f>ï¿²/dPï¾�*ï¿<84>ï¿<88>mu" />
                                  ^
-:2350: parser error : Couldn't find end of Start Tag NODE line 2350
     <NODE NAME="something" VALUE="^Tï¿<96>ï¾¤ï¿©ï¾¹ï¿°ï¿<8f>ï¿²/dPï¾�*ï¿<84>ï¿<88>mu" />
                                  ^
-:2350: parser error : PCDATA invalid Char value 20
     <NODE NAME="something" VALUE="^Tï¿<96>ï¾¤ï¿©ï¾¹ï¿°ï¿<8f>ï¿²/dPï¾�*ï¿<84>ï¿<88>mu" />

I have installed all locales. Is there a way to bulk change all the encoding to UTF-8 or something on all the files, or install something, or am I going about it the wrong way?

ctsgnb · November 30, 2010, 8:00pm

Don't know what you are trying to achieve.
maybe give a try to

strings $i |

instead of the

cat $i |

by the way, are you sure xmlstarlet is reliable and up to date ?

fpmurphy · November 30, 2010, 8:40pm

xmlstarlet relies on libxml2 which uses UTF8 internally. For more information, see LIBXML2 - Encodings support.

unclecameron · November 30, 2010, 8:59pm

I tried xml2 parsing, which only converts xml to a flat file format, otherwise I don't know what else to use for bash xml parsing, I've written a couple basic parsers for similar tasks, but they have bad error handling I've found. I think maybe if I could get xmlstarlet to read in extended ascii encoding for these files it would work, but I don't know how to do that.

Strings didn't seem to help either

---------- Post updated at 05:59 PM ---------- Previous update was at 05:47 PM ----------

It seems without much pain I can't get libxml2 to encode ascii extended, I'm wondering if there's a way to convert it when I read the file in from a list, which I do by:

cat "${@:-somelist.txt}" |
while read i
do
        strings $i | xmlstarlet sel -t -m "//sec1/sec2/sec3/sec4/sec5" -v "@VALUE" -n > somefile
   value1=`sed -n "1p" somefile`
...

I also know my looping probably isn't the most elegant, but it works, well, except the encoding. Is there some command I can convert the string before it gets read by xmlstarlet or something?

btw, I'm using Debian Squeeze, which uses xmlstarlet 1.0.2-1

Chubler_XL · November 30, 2010, 9:24pm

iconv might be helpfull here, you can probably extract the documents charset from the XLM meta tag.

iconv -f ${SRC_CHARSET:-UTF-8} -t UTF-8 $i | xmlstarlet sel -t -m "//sec1/sec2/sec3/sec4/sec5" -v "@VALUE" -n  | iconv -f UTF-8 -t ${SRC_CHARSET:-UTF-8} > somefile

fpmurphy · December 1, 2010, 11:03am

Is there an encoding declaration at the top of your XML files? If so, what is it?

If no encoding declaration is present in the XML document, the assumed encoding of an XML document depends on the presence of a Byte-Order-Mark (BOM). A BOM is a Unicode special marker placed at the top of the file to indicate its encoding. A BOM is optional for UTF-8.

First bytes 	                Encoding assumed

EF BB BF 	                UTF-8
FE FF                           UTF-16 (big-endian)
FF FE             	        UTF-16 (little-endian)
00 00 FE FF                     UTF-32 (big-endian)
FF FE 00 00                     UTF-32 (little-endian)

unclecameron · December 1, 2010, 11:53am

yes, <?xml version="1.0" encoding="utf-8"?>

am I correct in assuming that utf-8 won't work for extended ASCII characters like Cyrillic, Chinese, etc? It seems though the xml encoding tag says utf-8, it still has extended ascii characters in it? I converted to UTF-8 using iconv (Chubler_XL), but I still get parse errors, example

-:2854: parser error : PCDATA invalid Char value 1
     <NODE NAME="OEToolbarPos" VALUE="^A" />

which makes xmlstarlet stop parsing the rest of the file, is there a way to make it ignore/handle errors?

fpmurphy · December 1, 2010, 2:20pm

UTF-8 can be used to represent Chinese and Cyrillic characters.

I suspect that you have what is known as a mixed encoding XML document. These are usually problematic to parse. Can you provide a pointer to an example of one of your XML documents?

unclecameron · December 2, 2010, 12:16pm

there's around 11K lines in each file. Really I'm only interested in one section of the xml doc, if I could get xmlstarlet to ignore the rest, it errors out elsewhere, maybe start parsing when it gets to this section:

<SECTION1>
<SECTION2 ID="1000103">
  <SECTION3>
   <SECTION4 NAME="desc1" TEXT="blah_blah">
    <SECTION5 NAME="desc2" VALUE="blah_blah" TEXT="blah_blah">
    <SECTION5 NAME="desc3" VALUE="blah_blah" TEXT="blah_blah" />
   </SECTION5>
  </SECTION4>
 </SECTION3>
 </SECTION2>
</SECTION1>

where I'm looking for the value of desc2, desc3 and others in this section. I guess I could write awk/sed to pattern match this section only then pipe to xmlstarlet, but I'm not that good at awk yet.

Chubler_XL · December 2, 2010, 3:54pm

To extract your section with sed:

sed -n '/<SECTION1>/,/<\/SECTION1>/p' mydoc.xml

unclecameron · December 2, 2010, 5:55pm

Thanks Chubler_XL, working on modifying that, there are several sections I need to descend into to get to the right data, so trying section1/section2/section3/section4, but haven't gotten it working yet, will post if I do, was trying something like:

sed -n '/<SECTION1>/,/<\/SECTION1>/p' | sed -n '/<SECTION2>/,/<\/SECTION2>/p' | sed -n '/<SECTION3>/,/<\/SECTION3>/p'

but it's not working, will dig into it.

Chubler_XL · December 2, 2010, 7:07pm

Put the sections all in 1 sed otherwise the lines are already gone by the time your 2nd (and subsequent) seds get to it:

sed -n '/<SECTION1>/,/<\/SECTION1>/p;/<SECTION2>/,/<\/SECTION2>/p;/<SECTION3>/,/<\/SECTION3>/p'

unclecameron · December 2, 2010, 8:52pm

For some reason that outputs everything in the whole file... do I need some kind of nested loop?

Chubler_XL · December 2, 2010, 9:03pm

Did you forget the -n on sed?

Is the first line of the file a <SECTIONn> marker, and if not did that print?

unclecameron · December 2, 2010, 10:41pm

the top of the file looks like:

<?xml version="1.0" encoding="utf-8"?>
<SECTION1>
 <SOMENAME>
  <NODE NAME="CONFIRMED" VALUE="1" TYPE="DWORD" />
  ...

I'm using

cat file2.xml | sed -n '/<SECTION1>/,/<\/SECTION1>/p;/<SECTION2>/,/<\/SECTION2>/p;/<SECTION3>/,/<\/SECTION3>/p;/<SECTION4>/,/<\/SECTION4>/p;/<SECTION5>/,/<\/SECTION5>/p'

which seems to print the whole file for some reason

unclecameron · December 21, 2010, 1:36pm

okay, I figured it out, I chop the whole rest of the file out using:

sed -n '/<GROUP NAME="SYSTEMINFO"/,/<\/GROUP>/p'

and then run this data through xmlstarlet, and it works