Extract value from XML

I have a file like below

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"><soap:Body><ns2:executeMDXResponse xmlns:ns2="http://webservices.quartetfs.com"><aggregates><axes><axis><name>ROWS</name><positions><position><members><member><depth>0</depth><dimensionName>AsOfDate</dimensionName><displayName>AllMember</displayName><levelName>ALL</levelName><path><items><item>AllMember</item></items></path></member></members></position><position><members><member><depth>1</depth><dimensionName>AsOfDate</dimensionName><displayName>04-01-2012</displayName><levelName>AsOfDate</levelName><path><items><item>AllMember</item><item>04-01-2012</item></items></path></member></members></position><position><members><member><depth>1</depth><dimensionName>AsOfDate</dimensionName><displayName>20-12-2011</displayName><levelName>AsOfDate</levelName><path><items><item>AllMember</item><item>20-12-2011</item></items></path></member></members></position><position><members><member><depth>1</depth><dimensionName>AsOfDate</dimensionName><displayName>12-12-2011</displayName><levelName>AsOfDate</levelName><path><items><item>AllMember</item><item>12-12-2011</item></items></path></member></members></position><position><members><member><depth>1</depth><dimensionName>AsOfDate</dimensionName><displayName>09-12-2011</displayName><levelName>AsOfDate</levelName><path><items><item>AllMember</item><item>09-12-2011</item></items></path></member></members></position></positions></axis></axes><cells><cell><formattedValue>3840769</formattedValue><ordinal>0</ordinal><value xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="xs:long">3840769</value></cell><cell><formattedValue>444930</formattedValue><ordinal>1</ordinal><value xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="xs:long">444930</value></cell><cell><formattedValue>1136654</formattedValue><ordinal>2</ordinal><value xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="xs:long">1136654</value></cell><cell><formattedValue>1081680</formattedValue><ordinal>3</ordinal><value xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="xs:long">1081680</value></cell><cell><formattedValue>1177505</formattedValue><ordinal>4</ordinal><value xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="xs:long">1177505</value></cell></cells><slicerAxis><name>SlicerAxis</name><positions><position><members><member><depth>0</depth><dimensionName>Measures</dimensionName><displayName>contributors.COUNT</displayName><levelName>Measures</levelName><path><items><item>contributors.COUNT</item></items></path></member></members></position></positions></slicerAxis></aggregates></ns2:executeMDXResponse></soap:Body></soap:Envelope>

not in properly aligned and everything in one line. So If I try to serach by grep or sed for a particular tag and value in between them, returns whole file ?
can anyone how can I search it?

I need to search date between displayName tag?

$ awk -F"</?displayName>" '{for(i=1;++i<=NF;) if(length($i)==10) print $i}' yourfile.xml
04-01-2012
20-12-2011
12-12-2011
09-12-2011
1 Like
awk '/displayName/ && $2~/^[0-9][0-9]-/{print $2}' FS="[><]" RS='><' xmlFile
1 Like

hello there,
how could I get the value between formatted tag ? by below logic
which is based on Ordinal tag

let's say in my first xml posted on the top

<formattedValue>1177505</formattedValue> 
  <ordinal>4</ordinal> 
  <value xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="xs:long">1177505</value> 

ordinal value is 4 , so that means it should search for 4th displayName tag
(skip ordinal value 0)

and o/p will be

09-12-2011  1177505

as each ordinal tag relates to the displayName tag seqentially

so final output should be

04-01-2012    444930     as ordinal 1 and  it's formattedValue 
20-12-2011    1136654   as ordinal 2 and  it's formattedValue 
12-12-2011    1081680    as ordinal 3 and  it's formattedValue 
09-12-2011    1177505   as ordinal 4 and  it's formattedValue 

Try this:

awk '
/displayName/ && $2~/^[0-9][0-9]-/{dt[++cnt]=$2}
/^formattedValue>/{fv=$2; getline; print dt[$2],fv,$2 }
' FS="[><]" RS='><' xmlFile 

It assumes that ordinal tag is the next tag right after formattedValue tag. If that is not always the case, you could try this a little more general approach:

awk '
/displayName/ && $2~/^[0-9][0-9]-/{dt[c1++]=$2}
/^formattedValue>/{fv[c2++]=$2}
/^ordinal>/{o[c3++]=$2}
END{
  for(i=0; i<c1; i++) 
     print dt[o],fv[o+1]
}' FS="[><]" RS='><' xmlFile
1 Like

this is amazing mi. this is waht exactly looking for
thanks again. can you please point out, where exactly I'm doing it wrong, if I gave little formatting behaviour to below awk.

awk '
                                /displayName/ && $2~/^[0-9][0-9]-/{dt[c1++]=$2}
                                /^formattedValue>/{fv[c2++]=$2}
                                /^ordinal>/{o[c3++]=$2}
                                END{
                                  for(i=0; i<c1; i++){
                cnt=split(dt[o],a,"-")
for (j=cnt,j<=1;j--){ date=a[j] }
print date,fv[o+1]
}date=""} '  FS="[><]" RS='><' file.txt

I want o/p to be

20111209 1177505   

Well, you do have a bunch of syntax errors in this line:

Commas, semicolons and logic in the for statement are messed up.
Here:

awk '
/displayName/ && $2~/^[0-9][0-9]-/{dt[c1++]=$2}
/^formattedValue>/{fv[c2++]=$2}
/^ordinal>/{o[c3++]=$2}
END{
  for(i=0; i<c1; i++) {
    split(dt[o],a,"-"); 
    print a[3] a[2] a[1],fv[o+1] 
  }
}' FS="[><]" RS='><' xmlFile

you know mi, I tried the same for first one which goes like this and working fine.

awk '/displayName/ && $2~/^[0-9][0-9]-/{dt[++cnt]=$2}
> /^formattedValue>/{fv=$2; getline;
> if ( dt[$2] == "" ){
>  print "==================================="
> print "TotalValue", fv,$2
>  print "===================================" }
> else
> { cnt=split( dt[$2], a, "-")
> { name=a[3]a[2]a[1] }
> print name,fv,$2
> }name=""}' FS="[><]" RS='><' output.txt
===================================
TotalValue 1721994 0
===================================
20120105 1141169 1
20120104 580825 2

but for the second one (which is more good way represnting it), I was thinking if spliting it into array and then join it thru for loop getting the value from that array rather than joining thru their hardcoded value. There I went crazy but thanks anyway.

Ever used command xmllint and trying xpath in that. Should be able to extract any thing from XML.

Not a good solution as compared to awk

( cat XMLFILE|tr '>' '\012'|egrep  "</formattedValue$"|cut -d "<" -f1 >FILE1 ;cat XMLFILE|tr '>' '\012'|egrep  "</displayName$"|cut -d "<" -f1 >FILE2;paste FILE1 FILE2|grep -- - ;rm FILE1 FILE2 )

FYI : I tried on RHE4 machine

$ cat display|tr '>' '\012'|egrep  "</displayName$"|cut -d "<" -f1 >displayName
$ cat display|tr '>' '\012'|egrep  "</formattedValue$"|cut -d "<" -f1 >formattedValue
$ paste displayName formattedValue|grep -- - ;rm displayName formattedValue
04-01-2012      444930
20-12-2011      1136654
12-12-2011      1081680
09-12-2011      1177505

---------- Post updated 01-08-12 at 12:23 AM ---------- Previous update was 01-07-12 at 11:25 PM ----------

@chakrapani : Can you please post the command ?
Tried but unable to get the required output .