Converting XML to CSV

Hello,

For i while i have been using XMLStarlet to convert several XML files to CSV files. So far this always went fine.

Today however i got a new XML format however but i cannot find out how to get the data i need.

Below is part of the code where it shows the different format. What i expect is the following:

idpurchaseorder;idsupplier;purchaseorderid;idproduct;productcode;price;amount;amountreceived
<searchresults>
  <success>true</success>
  <rate-limit-remaining>498</rate-limit-remaining>
  <data>
    <idpurchaseorder>373251</idpurchaseorder>
    <idsupplier>42919</idsupplier>
    <idwarehouse>4863</idwarehouse>
    <idtemplate>3994</idtemplate>
    <purchaseorderid>PO2019-1008</purchaseorderid>
    <supplier_orderid>09-09-2019</supplier_orderid>
    <supplier_name></supplier_name>
    <status>purchased</status>
    <remarks></remarks>
    <delivery_date>2019-09-14</delivery_date>
    <language>nl</language>
    <purchased_by_iduser>6407</purchased_by_iduser>
    <purchased_at>2019-09-10 09:31:37</purchased_at>
    <completed_by_iduser></completed_by_iduser>
    <completed_at></completed_at>
    <created_by_iduser>6414</created_by_iduser>
    <created>2019-09-09 12:41:44</created>
    <updated>2019-09-10 09:31:37</updated>
    <products>
      <idpurchaseorder_product>5275880</idpurchaseorder_product>
      <idproduct>10307107</idproduct>
      <idvatgroup>9520</idvatgroup>
      <productcode>P00019664</productcode>
      <productcode_supplier>0389450893456</productcode_supplier>
      <name>Unold 78856</name>
      <price>99</price>
      <amount>1</amount>
      <amountreceived>0</amountreceived>
      <delivery_date></delivery_date>
      <weight>0</weight>
    </products>
  </data>
  <data>
    <idpurchaseorder>373062</idpurchaseorder>
    <idsupplier>42561</idsupplier>
    <idwarehouse>4863</idwarehouse>
    <idtemplate>3994</idtemplate>
    <purchaseorderid>PO2019-1007</purchaseorderid>
    <supplier_orderid>09-09-2019</supplier_orderid>
    <supplier_name></supplier_name>
    <status>purchased</status>
    <remarks></remarks>
    <delivery_date>2019-09-13</delivery_date>
    <language>en</language>
    <purchased_by_iduser>6414</purchased_by_iduser>
    <purchased_at>2019-09-09 10:46:22</purchased_at>
    <completed_by_iduser></completed_by_iduser>
    <completed_at></completed_at>
    <created_by_iduser>6414</created_by_iduser>
    <created>2019-09-09 10:46:18</created>
    <updated>2019-09-09 10:46:22</updated>
    <products>
      <idpurchaseorder_product>5272744</idpurchaseorder_product>
      <idproduct>10304708</idproduct>
      <idvatgroup>9520</idvatgroup>
      <productcode>P00172393</productcode>
      <productcode_supplier>34556WDR09</productcode_supplier>
      <name>Product 2</name>
      <price>999</price>
      <amount>1</amount>
      <amountreceived>0</amountreceived>
      <delivery_date></delivery_date>
      <weight>0</weight>
    </products>
    <products>
      <idpurchaseorder_product>5272745</idpurchaseorder_product>
      <idproduct>10304135</idproduct>
      <idvatgroup>9520</idvatgroup>
      <productcode>P00020201</productcode>
      <productcode_supplier>345345074</productcode_supplier>
      <name>Productname</name>
      <price>99</price>
      <amount>1</amount>
      <amountreceived>0</amountreceived>
      <delivery_date></delivery_date>
      <weight>0</weight>
    </products>
  </data>
 </searchresults>
 

What i tried are the following with the results of (part of all the needed columns to keep it easier to read) the above piece.

xmlstarlet sel -t -m //searchresults/data -v "concat(idpurchaseorder,';',products/idproduct)" -n purchaseorders.xml

this above gave me:

373251;10307107
373062;10304708 <-- here should be a second line with the second column different id

then i tried the following line:

xmlstarlet sel -t -m //searchresults/data -v idpurchaseorder -o " ; " -v products/idproduct -n purchaseorders.xml

and this gave me the output below

373251 ; 10307107
373062 ; 10304708
10304135 <-- here is the idpurchaseorder missing from the first column but it does have both products listed
 

What i expect would be the following result from the testdata (i can add the other needed data myself if needed) above:

373251;10307107
373062;10304708
373062;10304135
 

Does anyone know the correct way of getting the data requested? If it is a other way then XMLStarlet it is fine although that is the easiest so far for me.

2 Likes

Hi SDohmen,

some additional bits of Knowledge about XPath will lead you forward towards your solution. Especially try to get information about how to figure out "parent nodes" of your target nodes.

One possible algorithm is to not iterate over the different orders(data), but iterate over the different products(products) and get a parent node(idpurchaseorder) from there.

Regards,
stomp.

P. S.: @Mod or @SDohmen: please fix the typo in the xml above: <name>Productname/name>

P. P. S.: I did not know the solution before the question was asked, and - according to Neos request - I will not take the fun and chance to grow from the poster to figuring it out for him-/herself. (The solution is about the same in size as the current shown attempts).

P. P. P. S: +1 Point for SDohmen using a suitable tool for the task. :wink:

2 Likes

Thank you for the comment. I edited the text to have the xml correctly displayed.

I think i know what you mean by selecting the subnodes first but i have no clue how to get the rest filled in.

When i change the line to the one below:

xmlstarlet sel -t -m "/searchresults/data" -v "products/idproduct" -o " ; " -v idpurchaseorder -n purchaseorders.xml

i am getting the text as above except switched around so that was not what you mean i guess.

Do you have any more clues what to look for?

i did search around some more and found the link below:

https://stackoverflow.com/questions/53550425/extracting-xml-child-node-values-based-on-a-parent

This looks a bit like it but i am not sure if that is the correct direction.

What did I do to find the solution?

  • Read some Tutorials about XPath
  • Reached out for xmlstarlet documentation
  • Used google a lot on the subject, learned from many examples(stackoverflow delivers many of those)
  • experimented with what I found, began sometimes with most simple things to understand small bits and rose higher to the full complexity of whole task
1 Like

A little help for what you've nearly got yourself:

xmlstarlet sel -t -m //searchresults/data/products -v "concat(';',idproduct)" -n purchaseorders.xml

This will give you the ordered products, like this:

;10307107
;10304708
;10304135
 

Now you just have to fill in the XPath-Expression for the Parent node - the ordernumber - into the concat statement...

I'm reading a bit about XPath now. It's really interesting, what is possible with that:

XPath Syntax

XPath Axes

1 Like

SDohmen, I don't generally thank people for posting questions.

However you've posted good example data and a clear and concise question with a reasonable attempt at solving it yourself. Also the tool your using is the right one for the job, so many XML parsing questions posted here want to use sed / grep / awk which will usually fail given a different XML layout.

Interesting to note that if you start from the parent level, which would have been my first inclination, you only get the first product under that parent:

xmlstarlet sel -t -m //searchresults/data -v "concat(';',products/idproduct)" -n purchaseorders.xml
;10307107
;10304708
1 Like

Sorry for the late reply. It took a while and help from some friends who have more knowledge also about this subject. He told me the same thing and gave me a short example of what is possible. With that example i was able to product the line below:

xmlstarlet sel -T -t -m "/searchresults/data/products" -v ../idpurchaseorder -o ";" -v ../idsupplier -o ";" -v ../purchaseorderid -o ";" -v ../purchased_at -o ";" -v idproduct -o ";" -v productcode -o ";" -v price -o ";" -v amount -o ";" -v amountreceived -n purchaseorders.xml > purchaseorders.csv

I was not aware that you could "backtrack" the tree and get the main info added to the product info. It most likely isn't the cleanest solution but at least it works great.

Thank you for this. I always try my best to get as much as possible done myself. Only if i am really stuck i will post the question and even that is most likely days after i got stuck. I find it important to try and test myself although sometimes that is not possible. For each problem it is of course searching for the best solution and even though i dont fully understand xmlstarlet it gave me the best results for all the scripts i have used it on. I seen several attempts with sed etc but they where so confusing or gave me weird outputs (Its like opening a door with a hammer instead of the key ;)) that i wanted to use a program that is specifically designed for this and that was/is xmlstarlet.

I do have to add though that i cheated a bit as i got json data before the xml and i transformed it to the xml to actually get the proper data for it. Since i do lack the knowledge to get all json data (although i did try it with jq first), i decided to look for the transformation (Decoding JSON to a PHP Array and then converting it to XML. . GitHub) and then get all data with xmlstarlet. This last part was almost correct then except for the backtracking that i did not know.

1 Like

Hi SDohmen,

nice to hear that you worked your way through and thanks for giving feedback.

Some comments:

That's why I pointed out XPath for you twice and presented you links for a good and short tutorial. Read the great manuals out there! Read it and you will be lesser dependant on others.

I would suggest to avoid that. With this step, the complexity of the task had risen for no good reason. I understand why you did it(just to get the task done - did that myself in the past), but I recommend to better invest the time to get the data out of the original format. That's most likely possible with jq - even if it needs the work to get familiar with jq. Otherwise you carry the increased complexity, increased error possibility, increased resource footprint with that task from now on.

An alternative to jq may be a small php script, that generates the csv

(...and sometimes a quick and dirty hack is just fine too)

That's the thing with jq :

jq -r <data.json ' .searchresults.data[] |
 . as $parent |
 .products[]  |
 ( $parent.idpurchaseorder+";"+.idproduct)'

Note: Those Pipes are part of the jq-filter - not shell pipes - carefully look at the quotes!

It's more like the approach that didn't work with xml. :wink:

I did find those links you gave me but they where still a bit confusing and i would say limited in what i needed. Too be honest i did not have that much time either as i spend already almost a week on this with xmlstarlet alone.

As you mention this was a cheat but transforming the json to xml but this was also because i am severly limited in time and knowledge of jq. I actually did try a couple days and then i gave up because of the time constraints. Seeing your code however makes it look quite easy and even less fault sensitive. When i have some time i will actually try it out and let you know here about the (hopefully) success of the jq way.

I just tested jq but something seems not to work. I edited your piece of code to work with the json string:

jq -r <purchaseorders.json ' .data[] | 
 . as $parent |
 .products[]  |
 ( $parent.idpurchaseorder+";"+.idproduct)'

but this gave me an error:

jq: error (at <stdin>:0): number (376648) and string (";") cannot be added

Your json is different from mine. As the error message states numbers and strings cannot be added. You have to convert the number to a string before concatenating both:

... (number_var|tostring) + string_var ...
1 Like

Do you mean like:

jq -r <data.json ' .data[] |
 . as $parent |
 .products[]  |
 ( ($parent.idpurchaseorder|tostring)+";"+(.idproduct|tostring))'

I tested the above and this works indeed. When i have some time i will make the full line and see if that works also.

Again thank you for the help and pushing me towards the better solution instead of the workaround with xmlstarlet :slight_smile:

Moderator comments were removed during original forum migration.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.