Hello,
Does anyone know of a way to convert an .xml file (ONIX) to something more workable, like a .csv (or even .xls) file? Ideally something on the command line would be ideal, but not absolutely necessary. I would be dealing with .xml files of 125 MB+.
You didn't supply sample input and desired output, so I couldn't attempt a relevant demonstration.
Possible utilities:
XML2(1) General Commands Manual XML2(1)
NAME
xml2 - convert xml documents in a flat format
2xml - convert flat format into xml
html2 - convert html documents in a flat format
2html - convert flat format into html
csv2 - convert csv files in a flat format
2csv - convert flat format into csv
xml2 convert xml documents in a flat format (man)
Path : /usr/bin/xml2
Version : - ( /usr/bin/xml2, 2012-04-16 )
Type : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Repo : Debian 8.7 (jessie)
Versions appear to be available via brew, fink, port for a system like:
OS, ker|rel, machine: Apple/BSD, Darwin 9.8.0, Power Macintosh
Distribution : Mac OS X 10.5.8 (leopard, workstation)
Show the input you have and show the output you want. "Generic" conversion isn't really possible given XML is a tree structure, not a flat structure, but your particular data file may have regular data representable as such.
Thanks for the replies. I have copied .xml code for a single item below. I am trying to extract three items (field indices a001, b203, and j151), so the desired output would be:
9781328740472 Peepers 7.99
Thanks again!
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE ONIXmessage SYSTEM "http://www.editeur.org/onix/2.1/short/onix-international.dtd" >
<ONIXmessage release="2.1">
<header><m174>Houghton Mifflin</m174><m175>Catherine Toolan 978-465-7755</m175><m283>eloquence@firebrandtech.com</m283><m182>20170201</m182><m183>Title information from Houghton Mifflin</m183><m184>eng</m184><m185>01</m185><m186>USD</m186><m187>in</m187><m193>General Trade</m193></header>
<product>
<a001>9781328740472</a001>
<a002>02</a002>
<a197>HMH</a197>
<productidentifier>
<b221>02</b221>
<b244>1328740471</b244>
</productidentifier>
<productidentifier>
<b221>03</b221>
<b244>9781328740472</b244>
</productidentifier>
<productidentifier>
<b221>15</b221>
<b244>9781328740472</b244>
</productidentifier>
<b246>11</b246>
<b012>BC</b012>
<b333>B102</b333>
<b014>Trade Paperback</b014>
<n338/>
<title>
<b202>01</b202>
<b203>Peepers</b203>
</title>
<workidentifier>
<b201>15</b201>
<b244>9780152602970</b244>
</workidentifier>
<contributor>
<b034>1</b034>
<b035>A01</b035>
<b036>Eve Bunting</b036>
<b037>Bunting, Eve</b037>
<b039>Eve</b039>
<b040>Bunting</b040>
<b044><![CDATA[<DIV><P>EVE BUNTING has written*over two hundred*books for children, including the Caldecott Medal-winning <I>Smoky Night,</I> illustrated by David Diaz, <I>The Wall</I>,<I> Fly Away Home</I>, and <I>Train to Somewhere</I>. She lives in Southern California.</P></DIV>]]></b044>
</contributor>
<contributor>
<b034>2</b034>
<b035>A12</b035>
<b036>James E. Ransome</b036>
<b037>Ransome, James E.</b037>
<b039>James E.</b039>
<b040>Ransome</b040>
<b044><![CDATA[<DIV><P>James Ransome has illustrated more than 35 books for children, including many award winners. He lives in Rhinebeck, New York, with his wife, children's book author*Lesa Cline Ransome, and their four children. Visit his website at <A href="http://www.jamesransome.com/">www.jamesransome.com</A>.</DIV>]]></b044>
</contributor>
<b049>Eve Bunting, illustrated by James Ransome</b049>
<n386/>
<language>
<b253>01</b253>
<b252>eng</b252>
</language>
<b061>32</b061>
<b062><![CDATA[full-color illustrations]]></b062>
<b064>JUV029000</b064>
<subject>
<b067>10</b067>
<b069>JUV013000</b069>
</subject>
<subject>
<b067>20</b067>
<b070>fall;autumn;New England;brothers;leaves;color tour;leaf peepers;graveyard;trees;pumpkins;halloween;tour;bus;river;picture book</b070>
</subject>
<subject>
<b067>22</b067>
<b069>EV065</b069>
</subject>
<subject>
<b067>22</b067>
<b069>HL070</b069>
</subject>
<audience>
<b204>01</b204>
<b206>02</b206>
</audience>
<audiencerange>
<b074>11</b074>
<b075>03</b075>
<b076>P</b076>
<b075>04</b075>
<b076>3</b076>
</audiencerange>
<audiencerange>
<b074>17</b074>
<b075>03</b075>
<b076>4</b076>
<b075>04</b075>
<b076>7</b076>
</audiencerange>
<othertext>
<d102>01</d102>
<d103>02</d103>
<d104><![CDATA[<div>It's fall again, and time for Jim and Andy to help their dad run Fred's Fall Color Tours. The tourists they shuttle around are "Leaf Peepers"--and, boy, do those Peepers love to ooh and aah about the dumbest things. Leaves, trees, pumpkins. <i> Bo-o-ring.</i><br><i> </i>But this yerar, even as they poke fun at the Peepers, Jim and Andy can't help but notice how the leaves floating in the river look like a brilliantly colored island, and how the spiky tree branches seem to sweep the clouds across the night sky.<br> Maybe the Peepers aren't so silly after all.<br></div>]]></d104>
</othertext>
<othertext>
<d102>02</d102>
<d103>02</d103>
<d104><![CDATA[<DIV>It's fall again, and time for Jim and Andy to help their dad run Fred's Fall Color Tours. The tourists they shuttle around are "Leaf Peepers"--and, boy, do those Peepers love to ooh and aah about the dumbest things. Leaves, trees, pumpkins. <I> Bo-o-ring.</I><BR /> But this yerar, even as they poke fun at the Peepers, Jim and Andy can't help but notice how the leaves floating in the river look like a brilliantly colored island, and how the spiky tree branches seem to sweep the clouds across the night sky.<BR /> Maybe the Peepers aren't so silly after all.</DIV>]]></d104>
</othertext>
<othertext>
<d102>13</d102>
<d103>02</d103>
<d104><![CDATA[<div><b>EVE BUNTING</b> is the author of many acclaimed books for young readers, including the Caldecott Medal�winning <i>Smoky Night. </i>Her numerous honors include the prestigious Kerlan Award for her body of work. Ms. Bunting lives in Southern California.<br><br><b>JAMES RANSOME</b> has illustrated many books for children. He received the Coretta Scott King Illustrator Award for <i>The Creation</i> and a Coretta Scott King Illustrator Honor for <i>Uncle Jed�s Barbershop. </i>He lives in Poughkeepsie, New York. <br></div>]]></d104>
</othertext>
<mediafile>
<f114>04</f114>
<f115>03</f115>
<f116>01</f116>
<f117>http://cloud.firebrandtech.com/api/v2/hostedcover/eb4f776c-004b-4ac5-97bd-a6de017b03a9</f117>
</mediafile>
<imprint>
<b241>01</b241>
<b242>HMH Books for Young Readers</b242>
<b243>66201921</b243>
<b079>HMH Books for Young Readers</b079>
</imprint>
<publisher>
<b291>01</b291>
<b241>01</b241>
<b242>HMH Books for Young Readers</b242>
<b243>66201921</b243>
<b081>Houghton Mifflin Harcourt</b081>
</publisher>
<b394>02</b394>
<b003>20170905</b003>
<b087>2001</b087>
<salesrights>
<b089>01</b089>
<b090>AD AE AF AG AI AL AM AO AQ AR AS AT AU AW AZ BA BB BD BE BF BG BH BI BJ BL BM BN BO BR BS BT BV BW BY BZ CA CC CD CF CG CH CI CK CL CM CN CO CR CU CV CX CY CZ DE DJ DK DM DO DZ EC EE EG EH ER ES ET FI FJ FK FM FO FR GA GB GD GE GF GG GH GI GL GM GN GP GQ GR GS GT GU GW GY HK HM HN HR HT HU ID IE IL IM IN IO IQ IR IS IT JE JM JO JP KE KG KH KI KM KN KP KR KW KY KZ LA LB LC LI LK LR LS LT LU LV LY MA MC MD ME MF MG MH MK ML MM MN MO MP MQ MR MS MT MU MV MW MX MY MZ NA NC NE NF NG NI NL NO NP NR NU NZ OM PA PE PF PG PH PK PL PM PN PR PT PW PY QA RE RO RS RU RW SA SB SC SD SE SG SH SI SJ SK SL SM SN SO SR SS ST SV SY SZ TC TD TF TG TH TJ TK TM TN TO TR TT TV TW TZ UA UG UM US UY UZ VA VC VE VG VI VN VU WF WS YE YT ZA ZM ZW</b090>
</salesrights>
<measure>
<c093>01</c093>
<c094>11</c094>
<c095>in</c095>
</measure>
<measure>
<c093>01</c093>
<c094>279.4</c094>
<c095>mm</c095>
</measure>
<measure>
<c093>02</c093>
<c094>8.5</c094>
<c095>in</c095>
</measure>
<measure>
<c093>02</c093>
<c094>215.9</c094>
<c095>mm</c095>
</measure>
<measure>
<c093>08</c093>
<c094>1</c094>
<c095>lb</c095>
</measure>
<measure>
<c093>08</c093>
<c094>16</c094>
<c095>oz</c095>
</measure>
<measure>
<c093>08</c093>
<c094>453.59</c094>
<c095>gr</c095>
</measure>
<relatedproduct>
<h208>23</h208>
<productidentifier>
<b221>15</b221>
<b244>9780062086303</b244>
</productidentifier>
</relatedproduct>
<relatedproduct>
<h208>23</h208>
<productidentifier>
<b221>15</b221>
<b244>9780544339200</b244>
</productidentifier>
</relatedproduct>
<relatedproduct>
<h208>22</h208>
<productidentifier>
<b221>15</b221>
<b244>9780544808997</b244>
</productidentifier>
</relatedproduct>
<relatedproduct>
<h208>22</h208>
<productidentifier>
<b221>15</b221>
<b244>9780544555471</b244>
</productidentifier>
</relatedproduct>
<relatedproduct>
<h208>23</h208>
<productidentifier>
<b221>15</b221>
<b244>9781442476561</b244>
</productidentifier>
</relatedproduct>
<relatedproduct>
<h208>22</h208>
<productidentifier>
<b221>15</b221>
<b244>9780544227330</b244>
</productidentifier>
</relatedproduct>
<relatedproduct>
<h208>22</h208>
<productidentifier>
<b221>15</b221>
<b244>9780152602970</b244>
</productidentifier>
</relatedproduct>
<relatedproduct>
<h208>22</h208>
<productidentifier>
<b221>15</b221>
<b244>9780395742129</b244>
</productidentifier>
</relatedproduct>
<relatedproduct>
<h208>22</h208>
<productidentifier>
<b221>15</b221>
<b244>9780395764787</b244>
</productidentifier>
</relatedproduct>
<supplydetail>
<j137>Houghton Mifflin Company</j137>
<j141>NP</j141>
<j396>10</j396>
<j142>20170809</j142>
<j143>20170905</j143>
<j145>50</j145>
<price>
<j148>01</j148>
<discountcoded>
<j363>02</j363>
<j364>88 - Trade & Ref Child PA</j364>
</discountcoded>
<j151>7.99</j151>
<j152>USD</j152>
<j161>20160726</j161>
</price>
<price>
<j148>01</j148>
<discountcoded>
<j363>02</j363>
<j364>88 - Trade & Ref Child PA</j364>
</discountcoded>
<j151>10.99</j151>
<j152>CAD</j152>
<j161>20161216</j161>
</price>
</supplydetail>
<k167>15000</k167>
</product>
Hello,
I want to thank you so much for taking the time to do this. After replacing 1-data1 with the xml filename, I receive the following:
-----
Sampled lines from data file :
./z: line 19: specimen: command not found
-----
Expected output (augmented):
cat: expected-output.txt: No such file or directory
-----
Results, warning message expected:
./z: line 26: $FILE: ambiguous redirect
-----
Verify results if possible:
Results cannot be verified.
-----
Details for xml2:
./z: line 42: dixf: command not found
I am using XQuartz 2.7.9 on a Macbook Pro running El Capitan. Thanks again!
I'm going to be the Devil's Advocate here and suggest something entirely different. If this is a one-time-only conversion you have to do, or if it's something you won't have to do on a regular basis, I'd honestly import the XML into a spreadsheet like MS Excel or OpenOffice/LibreOffice Calc, and then look at tidying it up and exporting it out as a CSV from there.
Of course if this is going to be an ongoing thing you anticipate needing to do many times per day forever then some kind of script would be desirable, but if it's not going to be something you have to spend lots of time doing then you may actually save more time using a spreadsheet than trying to write a script for this.
As you didn't specify any restrictions on neither input (e.g. pattern repetitions) nor output structure (field ordering, multiple lines), this easy approach might be of some interest:
You need to have xml2 in your system. As I wrote, it is available for installing in at least the version of MacOS that I have, albeit from 3rd parties.
If the solution from RudiC works for you, then use it -- it is simpler than xml2.
Very close... The following is the first four lines of output when I run the command on the entire data file:
9781328740472 Peepers 7.99
10.99
9780544503205 Curious George Fire Dog Rescue (CGTV reader) 3.99
5.99
9780544574786 Mistakes Were Made (but Not by Me) 15.95
22.50
9781328683786 Tools of Titans 28.00
40.00
I'm not sure where that extra field is coming from. Desired output:
9781328740472 Peepers 7.99
9780544503205 Curious George Fire Dog Rescue (CGTV reader) 3.99
9780544574786 Mistakes Were Made (but Not by Me) 15.95
9781328683786 Tools of Titans 28.00
That's because your file has two j151 entries in each record, and, as said, each drags along a <newline> char.
Looks like you want to suppress the second entry? Still unclear.