Parse XML file into CSV with shell?

Hi,

It's been a few years since college when I did stuff like this all the time. Can someone help me figure out how to best tackle this problem? I need to parse a file full of entries that look like this:

<eq action="A" sectyType="0" symbol="PGR" exch="CA" curr="VEF" sess="NORM" dfltInd="1" issuerName="PROAGROI-7 B" issuShortDesc="VEB100" sectySubType="" sedol="2705132" isin="VEV000901000" cusip="" localCode="VEV000901000" localId="5" Csymbol="PGR" Cexch="CA" Ccurr="VEF" Csess="NORM" Psymbol="PGR" Pexch="CA" Pcurr="VEF" Psess="NORM" Ssymbol="PGR" Sexch="CA" Scurr="VEF" Ssess="NORM" exclPFInd="0" ranking="" longIssuerName="PROAGRO, C.A." issuLongDesc="VEB100" sicCode="" exchSym="" streetSym="" mostLiquid="0" />

And I want the data in a csv file with the following columns:

issuerName (symbol-exch) | symbol | exch | curr | Csymbol | Cexch | Ccurr

I only want the data that's in each of these fields, so I want PGR, not symbol="PGR"

I can use sed to strip away everything but the data I need -- which I've done -- but the data remains in its original order, not the one I'm looking for: (Note, the issuerName field is in Brackets for visual purposes).

PGR CA VEF [PROAGROI-7 B] PGR CA VEF

What's the best way to re-order the above line according to my CSV needs? Or is there a different approach I should be taking entirely?

Instead of doing it in shell (sed/awk), better use any XML parser. May be you can write a simple script in Perl or any scripting languages which support XML parsing.

Here is an example of how to do it using xsltproc. Suppose your XML document (file.xml) contains 2 records i.e.

<?xml version = "1.0"?>
<root>
<eq action="A" sectyType="0" symbol="PGR" exch="CA" curr="VEF" sess="NORM" dfltInd="1" issuerName="PROAGROI-7 B" issuSho
rtDesc="VEB100" sectySubType="" sedol="2705132" isin="VEV000901000" cusip="" localCode="VEV000901000" localId="5" Csymbo
l="PGR" Cexch="CA" Ccurr="VEF" Csess="NORM" Psymbol="PGR" Pexch="CA" Pcurr="VEF" Psess="NORM" Ssymbol="PGR" Sexch="CA" S
curr="VEF" Ssess="NORM" exclPFInd="0" ranking="" longIssuerName="PROAGRO, C.A." issuLongDesc="VEB100" sicCode="" exchSym
="" streetSym="" mostLiquid="0" />
<eq action="A" sectyType="0" symbol="PGR" exch="BB" curr="VEF" sess="NORM" dfltInd="1" issuerName="PROAGROI-8 B" issuSho
rtDesc="VEB100" sectySubType="" sedol="2705132" isin="VEV000901000" cusip="" localCode="VEV000901000" localId="5" Csymbo
l="PGR" Cexch="CA" Ccurr="VEF" Csess="NORM" Psymbol="PGR" Pexch="CA" Pcurr="VEF" Psess="NORM" Ssymbol="PGR" Sexch="CA" S
curr="VEF" Ssess="NORM" exclPFInd="0" ranking="" longIssuerName="PROAGRO, C.A." issuLongDesc="VEB100" sicCode="" exchSym
="" streetSym="" mostLiquid="0" />
</root>

and you have an XSL stylesheet called file.xsl (deliberately simplified) which contains

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>

<xsl:template match="/">
  <xsl:apply-templates select="/root/eq"/>
</xsl:template>

<!-- write out comma separated file -->
<xsl:template match="/root/eq">
   <xsl:value-of select="@issuerName"/>
   <xsl:value-of select="','"/>
   <xsl:value-of select="@symbol"/>
   <xsl:value-of select="','"/>
   <xsl:value-of select="@exch"/>
   <xsl:value-of select="','"/>
   <xsl:value-of select="@curr"/>
   <xsl:value-of select="','"/>
   <xsl:value-of select="@Csymbol"/>
   <xsl:value-of select="','"/>
   <xsl:value-of select="@Cexch"/>
   <xsl:value-of select="','"/>
   <xsl:value-of select="@Ccurr"/>
   <xsl:text>
</xsl:text>
</xsl:template>

</xsl:stylesheet>

Using xsltproc to transform the document produces the required output

$ xsltproc file.xsl file.xml
PROAGROI-7 B,PGR,CA,VEF,PGR,CA,VEF
PROAGROI-8 B,PGR,BB,VEF,PGR,CA,VEF

Thanks! I'll need a bit of time to work with this, but I prefer using the right tool for the job and this looks like it will help me with a few next steps I was planning anyways.

hi,

basically use xml parse tools in perl is easy.

Anyway, below boring code can also address the requirement.

You amy try it

open FH,"<a.txt";
my @arr=<FH>;
close FH;
foreach(@arr){
	while(m/ (.*?=".*?")/){
		my $str=$1;
		$_=$';
		$hash{$1}=$2 if ($str=~m/(.*)="(.*)"/);
	}
	print $hash{issuerName},"|",$hash{symbol},"|",$hash{exch},"|",$hash{curr},"|",$hash{Csymbol},"|",$hash{Cexch},"|",$hash{Ccurr},"\n";
}

Hi,

I have tried the above code for the following xml
<Account id='xxxxxxxxxxxxxx' name='xxxx' creator='abcd' createDate='110908'
lastModifier='abcd' resource='DataMart' accountId='F100206'
userid='F100206' situation='active' discoveredSituation='CONFIRMED' accountExists='true'>
<MemberObjectGroups>
<ObjectRef type='ObjectGroup' id='#ID#Top' name='Top'/>
</MemberObjectGroups>
</Account>

open FH,"<a.txt";
my @arr=<FH>;
close FH;
foreach(@arr){
while(m/ (.*?=".?")/){
my $str=$1;
$_=$';
$hash{$1}=$2 if ($str=~m/(.*)="(.
)"/);
}
print $hash{accountId},"|",$hash{createDate},"|",$hash{userid},"|",$hash{creator},"|",$hash{accountExists},"|",$hash{resource},"|",$hash{lastModifier},"\n";
}

I got empty output
||||||
||||||
||||||
||||||
||||||
||||||
||||||
||||||
||||||
||||||
||||||

Can you please explain what this peice of code does and where i am going wrong?

while(m/ (.*?=".?")/){
my $str=$1;
$_=$';
$hash{$1}=$2 if ($str=~m/(.*)="(.
)"/);
}

Thanks in advance for your help

you can use Perl and XML::Simple.

use XML::Simple;
use Data::Dumper;
my $config = XMLin("file");
print Dumper($config);
my $issuername = $config->{issuerName};
my $symbol =  $config->{issuerName};
my $exch =  $config->{exch};
my $curr = $config->{curr};
my $csymbol = $config->{Csymbol};
my $cexch = $config->{Cexch};
my $ccurr = $config->{Ccurr};
@line = ($symbol,$exch,$curr,$csymbol,$cexch,$ccurr );
print join(",",@line);

output

# ./test.pl
PROAGROI-7 B,CA,VEF,PGR,CA,VEF

or the "hard way"

while (<>){
 if ( /<eq/ .. /\/>/ ){
     @list = split /\"\s/ ,$_;
     foreach my $k (@list){
       print "$k\n";
       # get your values;
     }
 }
}

If the original poster was close enough for only this to remain:

and if he could change his output separator from " " to some other character, say "@" it would be a simple matter of

nawk -F@ '{print $4 $1 $2 $3 $5 $6 $7}'

What about the xlstproc post, that seemed easy, complete and correct?