converting specific XML file to CSV

Nicol · November 25, 2010, 10:32am

Hi,

i would convert the following XML file :

<?xml version="1.0" encoding="UTF-8" ?>
<files xmlns="http://www.lotus.com/dxl/console">
 <filedata notesversion="6" odsversion="43" logged="yes" backup="no" id="C12577E6:004B0DA3" iid="C12577E6:004B0DA8" link="1" dboptions="0,524288,0,0">
  <replica id="C12577E6:004B0DA3" flags="64" count="1">
   <cutoff interval="90">20100622T102416,65+02</cutoff>
  </replica>
  <path>/base/base01/appli/appli-01/natachat/postetravail.nsf</path>
  <name>postetravail.nsf</name>
  <title>Poste de Travail</title>
  <template></template>
  <inheritedtemplate>m_natachat_postetrav_17-3-0</inheritedtemplate>
  <category>toto</category>
  <size current="8650752" max="0" usage="0"/>
  <quota limit="0" warning="0" />
  <created>20101125T143946,91+01</created>
  <lastfixup>20101125T145142,38+01</lastfixup>
  <unread marks="yes" replicate="never"/>
 </filedata>
 <filedata

and so on

so every field between double quote contained in a XML statement
as in :

<size current="8650752" max="0" usage="0"/>

and the complete data contained in the XML statement as:

<name>postetravail.nsf</name>

i tried to figure all necessary data in bold

all these fields separated by #

and end of line after each </filedata> paragraph

and for sure without the header xml version.....

i began to try with this kind of shell :

sed -n '/^<filedata notesversion/h;{/}/!={H;g;s/\n/;/;s/\t//gp;d}'

but without success

if somebody can help me ?

i continue to search

thanks in advance

panyam · November 25, 2010, 10:47am

Nicol · November 25, 2010, 10:51am

Hi,

all data in bold separated by #

like:

20101125T143946,91+01#20101125T145142,38+01#yes#never#

and so on

regards
Christian

cabrao · November 25, 2010, 12:59pm

I'm not sure if I understood what you need but if it's all data between " try this:

$ awk -vRS=\=\" -F\" 'NR>4{print $1}' ORS="#" file
6#43#yes#no#C12577E6:004B0DA3#C12577E6:004B0DA8#1#0,524288,0,0#C12577E6:004B0DA3#64#1#90#8650752#0#0#0#0#yes#never#

Nicol · November 26, 2010, 3:50am

Thanks ,

this is what i need + the data between xml statement like :

<title>Poste de Travail</title>

result :

Poste de Travail

i will test your post

regards
Christian

ctsgnb · November 26, 2010, 3:54am

sed 's/<[^>]*>//g' input >output

# echo "<title>Poste de Travail</title>"
<title>Poste de Travail</title>
# echo "<title>Poste de Travail</title>" | sed 's/<[^>]*>//g'
Poste de Travail

But that code will remove all mark like <.> (whatever . could be)
If you want to remove only "title" mark
then

sed 's/<[^>]*title>//g' input >output

Nicol · November 26, 2010, 4:26am

Thanks ,

i doesn't work exactly as i expect , let's me resume :

the input file is :

<?xml version="1.0" encoding="UTF-8" ?>

<files xmlns="http://www.lotus.com/dxl/console">

 <filedata notesversion="8" odsversion="51" logged="yes" >

  <cutoff interval="90">20100811T010253,56+02</cutoff>

  <path>/base/base01/mail/mail-20/valerie_deshuissard.nsf</path>

  </filedata>

 <filedata notesversion="8" odsversion="51" logged="yes" >

    <cutoff interval="90">20100811T010231,02+02</cutoff>

    <path>/base/base01/mail/mail-20/laurent_abello.nsf</path>

   </filedata>

and i expect exactly on output:

first line :
8#51#yes#90#20100811T010253,56+02#/base/base01/mail/mail-20/valerie_deshuissard.nsf#

second line:
8#51#yes#90#20100811T010231,02+02#/base/base01/mail/mail-20/laurent_abello.nsf#

etc.

so
i don't want to keep the header :

<?xml version="1.0" encoding="UTF-8" ?>

<files xmlns="http://www.lotus.com/dxl/console">

and
i want to keep on one line all data extracted between <filedata
and

</filedata>

i know this is quite complicated but i'm too basic shell writer to succeed.

thanks again to help me
regards
Christian

Klashxx · November 26, 2010, 5:59am

Try:

# awk '/<\/filedata>/{f=0;printf "\n"}/^<filedata /{f=1}f{for (i=2;i<=NF-1;i++){gsub(/\".*|<.*/,"",$i);printf("%s#",$i)}}' FS='(=")|(>)' input
8#51#yes#90#20100811T010253,56+02#/base/base01/mail/mail-20/valerie_deshuissard.nsf#
8#51#yes#90#20100811T010231,02+02#/base/base01/mail/mail-20/laurent_abello.nsf#

Nicol · November 26, 2010, 6:09am

Thanks for your work,

i forgot to specify the unix version:

i work under AIX 5.3

and the awk gives no result , more exactly 2 blank lines

some differences with HP or linux ?

regards
christian

Klashxx · November 26, 2010, 6:17am

I get the correct result under HP-UX and Linux RH , try gawk if avaliable.

Nicol · November 26, 2010, 6:29am

I don't have gawk on the system

somebody could test on AIX ?

what can be the problem ?

it seems to write blank at the place of data extracted
we have 2 lines so the separation with "fileledata" seems to work.

some ideas ?

regards
Christian

gc_sw · November 26, 2010, 7:23am

maybe you should try nawk instead. please check your Man Pages.

Nicol · November 26, 2010, 10:09am

nawk doesn't work

maybe you can detailed a little the differents phases of data extraction and i will be able to search more.

thanks in advance
Christian

ctsgnb · November 26, 2010, 10:29am

nawk mawk gawk awk , none of these ?

Nicol · November 26, 2010, 10:48am

I have awk and nawk but each time i have 2 lines blank

it seems that the data is not written

i continue to search

thanks
Christian

panyam · November 26, 2010, 11:03am

Is your input file is proper? I mean does not have any special characters / tabs in the starting of each line?

If the file is from windows , do a dos2unix first and then use awk.

Nicol · November 26, 2010, 11:37am

Good idea

the file is extracted on the AIX machine
and i verify this is the same thing

thanks again , i will continue on next monday

regards
Christian

ctsgnb · November 26, 2010, 2:05pm

egrep -ve '^[\s]*$|^[ \t]*<[^>=]*>$' toto | tail +3 | sed  's/<[^"]*="//;s/["]  [^"]*="/#/g;s/<[^#>]*>//g;s/[">]//g;s/$/#/;s/  *//g' | xargs  -n3 echo | sed 's/ //g'

egrep -ve '^[\s]*$|^[ \t]*<[^>=]*>$' toto | tail +3 | sed  'N;N;s/\n/#/g;s/[^<"]*="//g;s/<[^">]*>/#/g;s/[  <>]//g;s/["]/#/g;s/##*/#/g'

bash-3.00# cat toto
<?xml version="1.0" encoding="UTF-8" ?>

<files xmlns="http://www.lotus.com/dxl/console">

 <filedata notesversion="8" odsversion="51" logged="yes" >

  <cutoff interval="90">20100811T010253,56+02</cutoff>

  <path>/base/base01/mail/mail-20/valerie_deshuissard.nsf</path>

  </filedata>

 <filedata notesversion="8" odsversion="51" logged="yes" >

    <cutoff interval="90">20100811T010231,02+02</cutoff>

    <path>/base/base01/mail/mail-20/laurent_abello.nsf</path>

   </filedata>
bash-3.00# egrep -ve '^[\s]*$|^[ \t]*<[^>=]*>$' toto | tail +3 | sed 's/<[^"]*="//;s/["] [^"]*="/#/g;s/<[^#>]*>//g;s/[">]//g;s/$/#/;s/  *//g' | xargs -n3 echo | sed 's/ //g'
8#51#yes#9020100811T010253,56+02#/base/base01/mail/mail-20/valerie_deshuissard.nsf#
8#51#yes#9020100811T010231,02+02#/base/base01/mail/mail-20/laurent_abello.nsf#
bash-3.00#

bash-3.00# egrep -ve '^[\s]*$|^[ \t]*<[^>=]*>$' toto | tail +3 | sed 'N;N;s/\n/#/g;s/[^<"]*="//g;s/<[^">]*>/#/g;s/[ <>]//g;s/["]/#/g;s/##*/#/g'
8#51#yes#90#20100811T010253,56+02#/base/base01/mail/mail-20/valerie_deshuissard.nsf#
8#51#yes#90#20100811T010231,02+02#/base/base01/mail/mail-20/laurent_abello.nsf#
bash-3.00#

Nicol · November 29, 2010, 4:00am

Hi,

well it works with the file "toto" :

egrep -ve '^[\s]$|^[ \t]*<[^>=]>$' toto | tail +3 | sed 'N;N;s/\n/#/g;s/[^<"]*="//g;s/<[^">]>/#/g;s/[ <>]//g;s/["]/#/g;s/##/#/g'

my problem is that the paragraph could be with different longer

if my paragraph is :

<filedata notesversion="8" odsversion="51" logged="yes" backup="no" id="C125742C:0038C006" iid="7630E56A:ADB4562F" link="1" dboptions="8192,4849664,17276934,0">
<replica id="41256605:0048070F" flags="72" count="1">
<cutoff interval="90">20100811T010253,56+02</cutoff>
</replica>
<path>/base/base01/mail/mail-20/valerie_deshuissard.nsf</path>
<name>valerie_deshuissard.nsf</name>
<title>Valerie DESHUISSARD</title>
<template></template>
<inheritedtemplate>M0170DIT</inheritedtemplate>
<category>M5;w230;W230;F250;PDPI2</category>
<size current="129325927" max="0" usage="49429504"/>
<quota limit="0" warning="0"/>
<created>20080415T121951,74+02</created>
<lastcompact>20101119T182500,73+01</lastcompact>
<unread marks="yes" replicate="never"/>
<daos enabled="readwrite" objects="107" bytes="78994279" lastsync="20101126T151637,20+01"/>
</filedata>

the sort is made over many lines in place of only one and in another order like this :

90#20100811T010253,56+02#/base/base01/mail/mail-20/valerie_deshuissard.nsf#valerie_deshuissard.nsf#
#ValerieDESHUISSARD#M0170DIT#
#M5;w230;W230;F250;PDPI2#129325927#0#49429504#/#0#0#/
#20080415T121951,74+02#20101119T182500,73+01#yes#never#/

thanks
Christian

ctsgnb · November 29, 2010, 3:30pm

egrep -ve '<filedata|</replica|<daos' in | sed 's/^<cutoff interval="//;s:/>:>/:;s/="/>/g;s/"/</g;s/<[^>]>/#/g' | grep -v '^#$' | xargs -n3 echo | sed 's/[#][ ]*#/#/g' >output

# cat in
<filedata notesversion="8" odsversion="51" logged="yes" backup="no" id="C125742C:0038C006" iid="7630E56A:ADB4562F" link="1" dboptions="8192,4849664,17276934,0">
<replica id="41256605:0048070F" flags="72" count="1">
<cutoff interval="90">20100811T010253,56+02</cutoff>
</replica>
<path>/base/base01/mail/mail-20/valerie_deshuissard.nsf</path>
<name>valerie_deshuissard.nsf</name>
<title>Valerie DESHUISSARD</title>
<template></template>
<inheritedtemplate>M0170DIT</inheritedtemplate>
<category>M5;w230;W230;F250;PDPI2</category>
<size current="129325927" max="0" usage="49429504"/>
<quota limit="0" warning="0"/>
<created>20080415T121951,74+02</created>
<lastcompact>20101119T182500,73+01</lastcompact>
<unread marks="yes" replicate="never"/>
<daos enabled="readwrite" objects="107" bytes="78994279" lastsync="20101126T151637,20+01"/>
</filedata>
# egrep -ve '<filedata|</*replica|<daos' in | sed 's/^<cutoff interval="//;s:/>:>/:;s/="/>/g;s/"/</g;s/<[^>]*>/#/g' | grep -v '^#*$' | xargs -n3 echo | sed 's/[#]*[ ]*#/#/g'
90#20100811T010253,56+02#/base/base01/mail/mail-20/valerie_deshuissard.nsf#valerie_deshuissard.nsf#
#Valerie DESHUISSARD#M0170DIT#
#M5;w230;W230;F250;PDPI2#129325927#0#49429504#/#0#0#/
#20080415T121951,74+02#20101119T182500,73+01#yes#never#/
#

The output may appear truncated in more than 4 lines but it is not :

# egrep -ve '<filedata|</*replica|<daos' in | sed 's/^<cutoff interval="//;s:/>:>/:;s/="/>/g;s/"/</g;s/<[^>]*>/#/g' | grep -v '^#*$' | xargs -n3 echo | sed 's/[#]*[ ]*#/#/g' >output
# wc -l output
       4 output
# cat output
90#20100811T010253,56+02#/base/base01/mail/mail-20/valerie_deshuissard.nsf#valerie_deshuissard.nsf#
#Valerie DESHUISSARD#M0170DIT#
#M5;w230;W230;F250;PDPI2#129325927#0#49429504#/#0#0#/
#20080415T121951,74+02#20101119T182500,73+01#yes#never#/

---------- Post updated at 09:30 PM ---------- Previous update was at 09:13 PM ----------

If it doesn't fit your need, please provide a representative sample of your input as well as the expected output.