I'm parsing around 600K xml files, with roughly 1500 lines of text in each, some of the lines include Chinese, Russian, whatever language, with a bash script that uses
I have installed all locales. Is there a way to bulk change all the encoding to UTF-8 or something on all the files, or install something, or am I going about it the wrong way?
I tried xml2 parsing, which only converts xml to a flat file format, otherwise I don't know what else to use for bash xml parsing, I've written a couple basic parsers for similar tasks, but they have bad error handling I've found. I think maybe if I could get xmlstarlet to read in extended ascii encoding for these files it would work, but I don't know how to do that.
Strings didn't seem to help either
---------- Post updated at 05:59 PM ---------- Previous update was at 05:47 PM ----------
It seems without much pain I can't get libxml2 to encode ascii extended, I'm wondering if there's a way to convert it when I read the file in from a list, which I do by:
cat "${@:-somelist.txt}" |
while read i
do
strings $i | xmlstarlet sel -t -m "//sec1/sec2/sec3/sec4/sec5" -v "@VALUE" -n > somefile
value1=`sed -n "1p" somefile`
...
I also know my looping probably isn't the most elegant, but it works, well, except the encoding. Is there some command I can convert the string before it gets read by xmlstarlet or something?
btw, I'm using Debian Squeeze, which uses xmlstarlet 1.0.2-1
Is there an encoding declaration at the top of your XML files? If so, what is it?
If no encoding declaration is present in the XML document, the assumed encoding of an XML document depends on the presence of a Byte-Order-Mark (BOM). A BOM is a Unicode special marker placed at the top of the file to indicate its encoding. A BOM is optional for UTF-8.
First bytes Encoding assumed
EF BB BF UTF-8
FE FF UTF-16 (big-endian)
FF FE UTF-16 (little-endian)
00 00 FE FF UTF-32 (big-endian)
FF FE 00 00 UTF-32 (little-endian)
am I correct in assuming that utf-8 won't work for extended ASCII characters like Cyrillic, Chinese, etc? It seems though the xml encoding tag says utf-8, it still has extended ascii characters in it? I converted to UTF-8 using iconv (Chubler_XL), but I still get parse errors, example
UTF-8 can be used to represent Chinese and Cyrillic characters.
I suspect that you have what is known as a mixed encoding XML document. These are usually problematic to parse. Can you provide a pointer to an example of one of your XML documents?
there's around 11K lines in each file. Really I'm only interested in one section of the xml doc, if I could get xmlstarlet to ignore the rest, it errors out elsewhere, maybe start parsing when it gets to this section:
where I'm looking for the value of desc2, desc3 and others in this section. I guess I could write awk/sed to pattern match this section only then pipe to xmlstarlet, but I'm not that good at awk yet.
Thanks Chubler_XL, working on modifying that, there are several sections I need to descend into to get to the right data, so trying section1/section2/section3/section4, but haven't gotten it working yet, will post if I do, was trying something like:
sed -n '/<SECTION1>/,/<\/SECTION1>/p' | sed -n '/<SECTION2>/,/<\/SECTION2>/p' | sed -n '/<SECTION3>/,/<\/SECTION3>/p'