Need Help in extracting data from XML File

Hi All
My input file is an XML and it has some tags and data rows at end.

Starting of data rows is <rs:data> and ending of data rows is </rs:data>.

Within sample data rows (2 rows) shown below, I want to extract data value after equal to sign (until space or "/" sign).

So if XML data rows look like this

 
<z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' /> 
<z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
 

So output should be 2 rows with data as shown below:

'John Doe','XYZ Hospital','Heart Health'
'Jane Doe','XYZ Hospital','Maternity'

Complete XML Input file looks like this:

-----------------------------------------------------------------------------------------

<?xml version="1.0" encoding="utf-8"?>
<xml xmlns:s='uuid:XYZ'
xmlns:dt='uuid:ABC'
xmlns:rs='urn:schemas-microsoft-com:rowset'
xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
<s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
<s:AttributeType name='First_Name' rs:name='First_Name' rs:number='1'>
<s:datatype dt:type='string' dt:maxLength='100' />
</s:AttributeType>
<s:AttributeType name='Hospital_Name' rs:name='Hospital_Name' rs:number='2'>
<s:datatype dt:type='string' dt:maxLength='100' />
</s:AttributeType>
<s:AttributeType name='Dept_Name' rs:name='Dept_Name' rs:number='3'>
<s:datatype dt:type='string' dt:maxLength='100' />
</s:AttributeType>
</s:ElementType>
</s:Schema>
<rs:data>
<z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' /> 
<z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>
</xml>
 

-----------------------------------------------------------------------------------------

Appreciate your help in advance.

$ sed -n "/^<z:/s/[^']*\('[^']*'\)[^']*\('[^']*'\)[^']*\('[^']*'\).*/\1,\2,\3/p" file
'John Doe','XYZ Hospital','Heart Health'
'Jane Doe','XYZ Hospital','Maternity'
1 Like

Hello vx04,

Following may help you in same.

awk -vs1="'" -F"' " '{if($0 ~ /<rs:data>/){B=1;getline}} {if($0 ~ /<\/rs:data>/){B=0}}  B{for(i=1;i<=NF;i++){gsub(/.*\=||[/><]/,X,$i);A=A?A s1","$i:$i};gsub(/\, $/,X,A);gsub(/\,$/,X,A);print A;A=""}'  Input_file

Output will be as follows.

'John Doe','XYZ Hospital','Heart Health'
'Jane Doe','XYZ Hospital','Maternity'

EDIT: Adding a non oneliner form for same.

awk -vs1="'" -F"' " '
                        {if($0 ~ /<rs:data>/)
                                                {B=1;getline}
                        }
                        {if($0 ~ /<\/rs:data>/)
                                                {B=0}
                        }
                        B{for(i=1;i<=NF;i++)     {
                                                 gsub(/.*\=||[/><]/,X,$i);
                                                 A=A?A s1","$i:$i};
                                                 gsub(/\, $/,X,A);
                                                 gsub(/\,$/,X,A);
                                                 print A;A=""
                        }
                    '  Input_file

Thanks,
R. Singh

1 Like

Gawk

 gawk '/^<z/{s="";  while (match($0, /\047([^\047]*)\047(.*)/, x)) { s = sprintf("%s%s\047%s\047",s,(s?OFS:""),x[1] ); $0 = x[2] } print s }' OFS=, infile

---------- Post updated at 09:38 PM ---------- Previous update was at 09:23 PM ----------

awk -F"'"  '/^<z/{s=""; for (i=2; i<=NF; i+=2) s = sprintf("%s%s\047%s\047",s,(s?OFS:""),$i ); print s }' OFS=, infile
1 Like

This takes care of

  • no data outside the <rs:data> ... </rs:data> region
  • several tagged data fields can share the same line
sed -n "/<rs:data>/,/<\/rs:data>/{s/^.*<rs:data>//;s/<\/rs:data>.*$//;s/^<z[^=']*=//;s/ [^=']*=/,/g;s/[^']*$//;p}" file4
'John Doe','XYZ Hospital','Heart Health'
'Jane Doe','XYZ Hospital','Maternity'
1 Like

Hi Guys
Thanks for taking time to reply. I wanted to mention that we are using AIX 6.1 and none of the answers seemed to work.

Here is the error I get on each

Input File

==> cat Hospital.xml
<?xml version="1.0" encoding="utf-8"?>
<xml xmlns:s='uuid:XYZ'
     xmlns:dt='uuid:ABC'
     xmlns:rs='urn:schemas-microsoft-com:rowset'
     xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
   <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
      <s:AttributeType name='First_Name' rs:name='First_Name' rs:number='1'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Hospital_Name' rs:name='Hospital_Name' rs:number='2'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Dept_Name' rs:name='Dept_Name' rs:number='3'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
   </s:ElementType>
</s:Schema>
<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />
   <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' />
</rs:data>
</xml>

I am using AIX 6.1 version.

=======

1st Solution by anbu23 does not bring any results.

==> cat scr1.ksh
sed -n "/^<z:/s/[^']*\('[^']*'\)[^']*\('[^']*'\)[^']*\('[^']*'\).*/\1,\2,\3/p" Hospital.xml
==> ./scr1.ksh > scr1_out.txt

-rw-rw-rw-    1 user1 group1            0 Dec 23 10:16 scr1_out.txt

=========================================================================

2nd Solution by RavinderSingh13 gives regular expression editor error

==> cat scr2.ksh
awk -vs1="'" -F"' " '{if($0 ~ /<rs:data>/){B=1;getline}} {if($0 ~ /<\/rs:data>/){B=0}}  B{for(i=1;i<=NF;i++){gsub(/.*\=||[/><]/,X,$i);A=A?A s1","$i:$i};gsub(/\, $/,X,A);gsub(/\,$/,X,A);print A;A=""}' Hospital.xml

==> ./scr2.ksh

awk: 0602-521 There is a regular expression error.
        [] imbalance
 The source line number is 1.
 The error context is
                {if($0 ~ /<rs:data>/){B=1;getline}} {if($0 ~ /<\/rs:data>/){B=0}}  >>>  B{for(i=1;i<=NF;i++){gsub(/.*\=||[/> <<< <]/,X,$i);A=A?A s1","$i:$i};gsub(/\, $/,X,A);gsub(/\,$/,X,A);print A;A=""}

=========================================================================

Alternate solution by RavinderSingh13 also gives regular expression error

==> cat scr3.ksh
awk -vs1="'" -F"' " '
                        {if($0 ~ /<rs:data>/)
                                                {B=1;getline}
                        }
                        {if($0 ~ /<\/rs:data>/)
                                                {B=0}
                        }
                        B{for(i=1;i<=NF;i++)     {
                                                 gsub(/.*\=||[/><]/,X,$i);
                                                 A=A?A s1","$i:$i};
                                                 gsub(/\, $/,X,A);
                                                 gsub(/\,$/,X,A);
                                                 print A;A=""
                        }
                    '  Hospital.xml
==> ./scr3.ksh
awk: 0602-521 There is a regular expression error.
        [] imbalance
 The source line number is 9.
 The error context is
                                                                 >>>  gsub(/.*\=||[/> <<< <]/,X,$i);

=========================================================================

3rd Solution by Akhsay Hegde does not give any error but no output

==> cat scr4.ksh
awk -F"'"  '/^<z/{s=""; for (i=2; i<=NF; i+=2) s = sprintf("%s%s\047%s\047",s,(s?OFS:""),$i ); print s }' OFS=, Hospital.xml
==> ./scr4.ksh

No output

=========================================================================

4th Solution by RudiC gives error that function can not be parsed.

==> cat scr5.ksh
sed -n "/<rs:data>/,/<\/rs:data>/{s/^.*<rs:data>//;s/<\/rs:data>.*$//;s/^<z[^=']*=//;s/ [^=']*=/,/g;s/[^']*$//;p}" Hospital.xml
==> ./scr5.ksh
sed: Function /<rs:data>/,/<\/rs:data>/{s/^.*<rs:data>//;s/<\/rs:data>.*$//;s/^<z[^=']*=//;s/ [^=']*=/,/g;s/[^']*$//;p} 
     cannot be parsed.

=========================================================================

Try:

sed -n "/rs:data/,/\\rs:data/{//d; s/[^'=]*=//g; s/''/','/g; s/[^']*$//;p;}" Hospital.xml

or:

awk '$0~s{p=1-p; next} p{print FS $2, $4, $6 FS}' FS=\' OFS="','" s=rs:data Hospital.xml
1 Like

Yes, this one works. Thanks a lot, appreciate it.

:slight_smile:

==> cat scr6.ksh
sed -n "/rs:data/,/\\rs:data/{//d; s/[^'=]*=//g; s/''/','/g; s/[^']*$//;p;}" Hospital.xml


==> ./scr6.ksh
'John Doe','XYZ Hospital','Heart Health'
'Jane Doe','XYZ Hospital','Maternity'