Convert XML to CSV using awk or shell script

Rashmitha · January 9, 2015, 7:17am

Hello,

I am working on a part of code where I need a awk or shell script to convert the given XML file to CSV or TXT file.

There are multiple xml files and of different structure, so a single script is required for converting data.

I did find a lot of solutions in the forum but none of the awk scripts mentioned here work for me as my XML contains Singletags <tag/> too and they are not parsed properly.

Thanks in advance for your help.

Rashmitha

Akshay_Hegde · January 9, 2015, 10:23am

Hi! Welcome to Forums,

Please show a representative sample of input, desired output, attempts at a solution and specify what OS and versions are being used.

Thank you.

Corona688 · January 9, 2015, 10:52am

I have a slightly more capable than average awk tinkertoy parser, which should be able to handle single tags properly, but it will need to be modified for your purposes. It needs to be told which tags to keep and how.

So, please post your input data and the output you want.

Rashmitha · January 12, 2015, 12:18am

Thanks for the reply.

Here is the Sample data:

<Orders>
<Partner>TTTT</Partner>
<Order>
<OrderType>test</OrderType>
<OrderNumber>1000000000</OrderNumber>
<OrderSource/>
<OrderDate>11/14/2014 12:00:00 AM</OrderDate>
..
..
..
..
<BillingAddress>
<Line1>XXXX</Line1>
<Line2/>
<City>stsss</City>
<State>gg</State>
<PostalCode>101010</PostalCode>
<CountryCode>aaaaa</CountryCode>
<Name>mmmmmm</Name>
</BillingAddress>
<ShippingAddress>
<NumberOfItems>3</NumberOfItems>
<Name>mmmmm</Name>
<Line1>abcd</Line1>
<Line2/>
<City>xyz</City>
<State>sjsdjhi</State>
<PostalCode>101010</PostalCode>
<CountryCode>kkkkkk</CountryCode>
<Method>test</Method>
<tag>False</tag>
</ShippingAddress>
<ShippingAddress>
<NumberOfItems>1</NumberOfItems>
<Name>mmmmm</Name>
<Line1>abcd</Line1>
<Line2/>
<City>xyz</City>
<State>sjsdjhi</State>
<PostalCode>101010</PostalCode>
<CountryCode>kkkkkk</CountryCode>
<Method>test</Method>
<tag>False</tag>
</ShippingAddress>
<Item>
<Name>Item1</Name>
<Quantity>3</Quantity>
<UnitPrice>15.99</UnitPrice>
<Eligible>False</Eligible>
<OrderStatus>test</OrderStatus>
<SKU>5-100000</SKU>
</Item
<Item>
<Name>Item2</Name>
<Quantity>1</Quantity>
<UnitPrice>10.49</UnitPrice>
<Eligible>True</Eligible>
<OrderStatus>test</OrderStatus>
<SKU>5-100001</SKU>
</Item>
</Order>
<Order>
..
</Order>
</Orders>

In the Output I need all the tags as Columns. The output file can be CSV or txt file. With the current script, single close tags are not parsed.

Regards,
Rashmitha

fredharry · January 12, 2015, 2:49am

Try thic if it works.

BEGIN {     FS=",";     OFS = "\n"} NR == 1 {for (i = 1; i <=NF; i++)             tag=$i          print "<abc version=\"3.4.1\" urlbase=\"http://mozilla.com/\" maintainer=\"somebody@mozilla.com\" exporter=\"somebody.else@mozilla.com\">"} NR != 1 {print "   <bug>"          for (i = 1; i <= NF; i++)             print "      <" tag ">" $i "</" tag ">"          print "   </bug>"} END {print "</abc>"}

Corona688 · January 12, 2015, 11:24am

Okay, you have shown half of what you want.

Now we need the other half -- the output you want.

What you are asking for, "each tag as a column", makes no sense since your data includes many nested, non-tabular things. There'd be a huge amount of duplication at the very least.

sea · January 12, 2015, 11:49am

"This wont work!", was my first thought.
A technicaly halftruth/incorecctness, but helpfull to create a better image, saying would be:

XML is like a Database, whereas CSV/TXT is a Tablesheet at best.

It is NO problem to make all data from a table into a database,but moving a database into a table, even when there are only 5 entries, is already rather complex. (unless the DB was used as a simple table )
You need to decide which values/cathegories you want to drop in favor of more important data, OR which values you want to store in another file - if you want to preserve ALL data.

hth

Corona688 · January 12, 2015, 11:54am

Exactly. We could spend hours doing what you want, and get it thrown back in our faces with "thank you, but what I actually wanted it to look like is Y". We don't know Y. You think we should know what Y is implicitly, but there's actually lots of choices.

Rashmitha · January 12, 2015, 11:00pm

Here is the sample output:

Partner,OrderType,OrderNumber,OrderSource,OrderDate,Line1,Line2,City,State,PostalCode,CountryCode,Name,NumberOfItems,Name2,Line13,Line24,City5,State6,PostalCode7,CountryCode8,Method,tag,Name9,Quantity,UnitPrice,Eligible,OrderStatus,SKU
TTTT,test,1000000000,,11/14/2014 12:00:00 AM,XXXX,,stsss,gg,101010,aaaaa,mmmmmm,3,mmmmm,abcd,,xyz,sjsdjhi,101010,kkkkkk,test,False,Item1,3,15.99,False,test,5-100000
TTTT,test,1000000000,,11/14/2014 12:00:00 AM,XXXX,,stsss,gg,101010,aaaaa,mmmmmm,1,mmmmm,abcd,,xyz,sjsdjhi,101010,kkkkkk,test,False,Item2,1,10.49,True,test,5-100001

The header is not required it is just for your reference. When I meant tags I meant all data Tags. For ex. <OrderSource/> is a single close tag which has no data for now but it can have values. So single close tags should have a null value in the csv.

pravin27 · January 13, 2015, 6:14am

perl

use XML::XPath;
use XML::XPath::XMLParser;

my $xpath=XML::XPath->new(filename  =>  "/path/tofile/order.xml");

my $nodelist=$xpath->findnodes("//Orders");
foreach my $node ($nodelist->get_nodelist) {
  ($line=$node->string_value)=~s/\n/,/g;
  print $line,"\n";
}

Rashmitha · January 13, 2015, 6:43am

Thanks for the reply; but I do not know much about perl. In the current project we are supposed to use either Bash or awk script.

pravin27 · January 13, 2015, 8:18am

here you go ....

awk '{ if ( $0 == "<Orders>") { printf "\n";next}  match($0, /\>.*\</); if (RLENGTH > 0) { printf substr($0,RSTART+1,RLENGTH-3)","; next }
 match($0, /\/>/) ; if (RLENGTH > 0) { printf "," ;next } } END {print "" } ' order.xml

Corona688 · January 13, 2015, 10:25am

The perl fails on my system with this:

Can't locate XML/XPath.pm in @INC (you may need to install the XML::XPath module) (@INC contains: /etc/perl /usr/local/lib64/perl5/5.18.2/x86_64-linux /usr/local/lib64/perl5/5.18.2 /usr/lib64/perl5/vendor_perl/5.18.2/x86_64-linux /usr/lib64/perl5/vendor_perl/5.18.2 /usr/local/lib64/perl5 /usr/lib64/perl5/vendor_perl /usr/lib64/perl5/5.18.2/x86_64-linux /usr/lib64/perl5/5.18.2 .) at ./xmlpath.pl line 3.
BEGIN failed--compilation aborted at ./xmlpath.pl line 3.

So please, try again without using nonstandard modules.

Corona688 · January 13, 2015, 10:40am

So, you want a line for each item.

Thank you.

$ cat xml.awk

BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# These should be special variables for match() but aren't.
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
match($1, /^[^\/ \r\n\t>]+/) {
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS);
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        if(DEP < 0) DEP=0;
}

$ cat order.awk

{
        sub(/\/$/, "", $1);
        sub(/^[ \r\n\t]*/, "", $2);
        sub(/[\ \r\n\t]*$/, "", $2);
}

# We are inside <order>, and not at a close-tag
(TAGS ~ /%ORDERS($|%)/) && !/^\// {
        if(!($1 in O)) { O[++L]=$1 ; O[$1]=L }
        D[$1]=$2
}

/\/Item/ {
        P=""
        for(N=1; N<=L; N++) {
                printf("%s%s", P, D[O[N]]); P=OFS;
        }

        print ""
}

$ awk -f xml.awk -f order.awk OFS="," ORS="\n" order.xml

TTTT,,test,1000000000,,11/14/2014 12:00:00 AM,,abcd,,xyz,sjsdjhi,101010,kkkkkk,Item1,,1,test,False,,3,15.99,False,test,5-100000
TTTT,,test,1000000000,,11/14/2014 12:00:00 AM,,abcd,,xyz,sjsdjhi,101010,kkkkkk,Item2,,1,test,False,,1,10.49,True,test,5-100001

$

It's not small, but XML is not trivial and this doesn't depend on external modules to do the work.

drl · January 13, 2015, 12:08pm

Hi.

corona688:

The perl fails on my system with this:
Can't locate XML/XPath.pm in @INC (you may need to install the XML::XPath module) (@INC contains:(..., etc)
So please, try again without using nonstandard modules.

It seems standard on Debian:

libxml-xpath-perl:
  Installed: (none)
  Candidate: 1.13-7
  Version table:
     1.13-7 0
        500 http://ftp.us.debian.org/debian/ jessie/main amd64 Packages

and on Fedora:

Name        : perl-XML-XPath
Arch        : noarch
Version     : 1.13
Release     : 22.fc19
Size        : 82 k
Repo        : fedora/19/x86_64
Summary     : XPath parser and evaluator for Perl
URL         : http://search.cpan.org/dist/XML-XPath/
License     : GPL+ or Artistic
Description : This module aims to comply exactly to the XPath specification at
            : http://www.w3.org/TR/xpath and yet allow extensions to be added in
            : the form of functions. Modules such as XSLT and XPointer may need
            : to do this as they support functionality beyond XPath.

One does need to use the package manager to install it.

Perhaps I misunderstood what you mean by non-standard ... cheers, drl

Corona688 · January 13, 2015, 12:10pm

Something that doesn't require root access to deal with. Something that you can expect to work dependably. You might as well have asked him to install a third-party utility.

drl · January 13, 2015, 12:29pm

Hi.

I keep a number of perl modules local to my home, not requiring the package manager. I generally don't put anything into /bin, /user/bin, etc., unless it is in the repositories. For many other things that may need to be accessible (say for loading, linking), I use my library directory:

$ ls ~/lib
perl/  python/	readme.txt  tcl/

and:

/home/drl/lib/perl/		|-- RegExp
|-- Acme			|-- Rmyegexp
|-- Algorithms			|-- Sample.pm
|-- App				|-- Sort
|-- Array			|-- String
|-- Data			|-- T2t
|-- Date			|-- Tabs.pl
|-- DateTime			|-- Term
|-- Excel			|-- Text
|-- File			|-- Version
|-- Getopt			|-- backup
|-- HTML			|-- cva.pm
|-- IO				|-- lib.pm
|-- Lingua			|-- nposw.pl
|-- Math			|-- old-T2t
|-- Module			|-- parse_csv.pl
|-- MyWrap.pl			|-- pqs.pl
|-- Pod				|-- prutil.pl
|-- RCS				`-- yinfo.pl
|-- Random			28 directories, 10 files

This seems somewhat off-topic here, so we could continue this elsewhere if useful ... cheers, drl