Extract and parse XML data (statistic value) to csv

Hi All,

I need to parse some statistic data from the "measInfo" -eg. 25250000 (as highlighted) and return the result into line by line, and erasing all other unnecessary info/tag.
Thought of starting with grep "measInfoID="25250000" but this only returns 1 line. How do I get all the output below this measInfoID? and return each of the value, line by line as per my desired output? I am assuming sed is needed to erase some of the data, and perhaps awk to loop?

Any help would be appreciated. Thanks all

Long xml data

.
.
.
.
<measInfo measInfoId="15150000">
<granPeriod duration="PT3600S" endTime="2011-12-19T11:00:00+11:00"/>
<repPeriod duration="PT3600S"/>
<measTypes>15153111 15153112 15153119 15153120 15153121 15153123 15153124 15153127 15153128 15154169 15154778 15150512 15151757 15151758 15151759 15159900 </measTypes>
<measValue measObjLdn="Label=Site-O:MD0035-O-A-2, ID=59135">
<measResults>0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </measResults>
</measValue>
<measValue measObjLdn="Label=Site-O:MA0340-O-A-2, ID=56575">
<measResults>0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </measResults>
</measValue>
<measValue measObjLdn="Label=Site-O:MD8001-O-A-3, ID=59646">
<measResults>0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </measResults>
</measValue> 
</measInfo>
<measInfo measInfoId="25250000">
<granPeriod duration="PT3600S" endTime="2011-12-19T11:00:00+11:00"/>
<repPeriod duration="PT3600S"/>
<measTypes>25254177 25254178 25254179 25254806 25254807 25254808 25254809 25254810 25254811 25254812 25254860 25254861 25254862 25254863 25254864 </measTypes>
<measValue measObjLdn="Label=Site-O:MD0035-O-A-2, ID=59135">
<measResults>0 0 0 27300 100194 141378 2282 0 0 379 5849362 0 0 2497 0 </measResults>
</measValue>
<measValue measObjLdn="Label=Site-O:MA0340-O-A-2, ID=56575">
<measResults>0 0 0 2099 11649 11091 28 0 0 74 249108 0 0 119 0 </measResults>
</measValue>
<measValue measObjLdn="Label=Site-O:MD8001-O-A-3, ID=59646">
<measResults>0 0 0 0 549 0 0 0 0 0 1967 0 0 0 0 </measResults>
</measValue>
<measValue measObjLdn="Label=Site-O:MA0056-O-A-2, ID=59155">
<measResults>0 0 0 0 1571 37 0 0 0 41 24453 0 0 0 0 </measResults>
</measValue>
<measValue measObjLdn="Label=Site-O:MA0056-O-A-1, ID=59154">
<measResults>0 0 0 1349 4921 878 0 0 0 48 24651 0 0 0 0 </measResults>
</measValue>
<measValue measObjLdn="Label=Site-O:MA0146-O-A-3, ID=57106">
<measResults>0 0 0 0 7018 106949 0 0 0 10 3928360 0 0 0 0 </measResults>
</measValue>
.
.
. (a lot more data).
.
<measValue measObjLdn="Label=Site-O:MA0120-O-B-3, ID=12561">
<measResults>0 0 0 8021 31504 1743 53 0 0 12 3939629 0 0 0 0 </measResults>
</measValue>
<measValue measObjLdn="Label=Site-O:MA8105-O-A-3, ID=58896">
<measResults>0 0 0 0 2807 195 0 0 0 0 50977 0 0 0 0 </measResults>
</measValue>
<measValue measObjLdn="Label=Site-O:MA0289-O-A-3, ID=57616">
<measResults>0 0 0 0 15665 10976 0 0 0 4 692551 0 0 831 0 </measResults>
</measValue>
<measValue measObjLdn="Label=Site-O:MA0146-O-A-1, ID=57104">
<measResults>0 0 0 0 1884 237 0 0 0 1 13943 0 0 0 0 </measResults>
</measValue>
</measInfo>
<measInfo measInfoId="25350000">
<granPeriod duration="PT3600S" endTime="2011-12-19T11:00:00+11:00"/>
<repPeriod duration="PT3600S"/>
<measTypes>25353111 25353112 25353119 25353120 25353121 25353123 25353124 25353127 25353128 25354169 25354778 25350512 25351757 25351758 25351759 25359900 </measTypes>
<measValue measObjLdn="Label=Site-O:MD0035-O-A-2, ID=59135">
<measResults>0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </measResults>
</measValue>
<measValue measObjLdn="Label=Site-O:MA0340-O-A-2, ID=56575">
<measResults>0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </measResults>
</measValue>
<measValue measObjLdn="Label=Site-O:MD8001-O-A-3, ID=59646">
<measResults>0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 </measResults>
</measValue> 
</measInfo>
.
.
.
.
.

Desired output
And the desired output should be in csv format (not sure if "," is needed...I just want the easily further processed with awk ' using variable $1...$n)

Site-O:MD0035-O-A-2, ID=59135, 0, 0, 0, 27300, 100194, 141378, 2282, 0, 0, 379, 5849362, 0, 0, 2497
Site-O:MA0340-O-A-2, ID=56575, 0, 0, 0, 2099, 11649, 11091, 28, 0, 0, 74, 249108, 0, 0, 119, 0
Site-O:MD8001-O-A-3, ID=59646, 0, 0, 0, 0, 549, 0, 0, 0, 0, 0, 1967, 0, 0, 0, 0
Site-O:MA0056-O-A-2, ID=59155, 0, 0, 0, 0, 1571, 37, 0, 0, 0, 41, 24453, 0, 0, 0, 0
Site-O:MA0056-O-A-1, ID=59154, 0, 0, 0, 1349, 4921, 878, 0, 0, 0, 48, 24651, 0, 0, 0, 0
Site-O:MA0146-O-A-3, ID=57106, 0, 0, 0, 0, 7018, 106949, 0, 0, 0, 10, 3928360, 0, 0, 0, 0 
.
.
. (a lot more data).
.
Site-O:MA0120-O-B-3, ID=12561, 0, 0, 0, 8021, 31504, 1743, 53, 0, 0, 12, 3939629, 0, 0, 0, 0
Site-O:MA8105-O-A-3, ID=58896, 0, 0, 0, 0, 2807, 195, 0, 0, 0, 0, 50977, 0, 0, 0, 0
Site-O:MA0289-O-A-3, ID=57616, 0, 0, 0, 0, 15665, 10976, 0, 0, 0, 4, 692551, 0, 0, 831, 0
Site-O:MA0146-O-A-1, ID=57104, 0, 0, 0, 0, 1884, 237, 0, 0, 0, 1, 13943, 0, 0, 0, 0

Try this out:

mID=25250000
sed -n '/<measInfo measInfoId="'$mID'">/,/<\/measInfo>/  {/^<measValue / {s/.*Label=\([^"]*\).*/\1/ ;x; n;  s/^<measResults>\([0-9 ]*\).*/\1/ ;H; x; s/\n//; p  } }' stats.xml  

Let me explain this mess :slight_smile: :

mID=25250000  #use a variable
sed -n '   
/<measInfo measInfoId="'$mID'">/,/<\/measInfo/{  #consider only the section inbetween measInfo tags
    /^<measValue / {  #on lines that start with measValue tag
        s/.*Label=\([^"]*\).*/\1/ ;  #get the stuff behind 'Label='
        x;    #and put it into hold buffer
        n;    #read next line
        s/^<measResults>\([0-9 ]*\).*/\1/ ;  #extract just the numbers 
        H;    #and append them to hold buffer
        x;     #retrieve hold buffer
        s/\n//;  #get rid of an extra newline
        p      #and print out
    } 
}' stats.xml 

This assumes that the <measResults> data is always on the next line after <measValue>.
If you put this code into your script, make sure to keep the comments :slight_smile:

Have you checked xmllint and using Xpath ?

Thanks mirni.
That code is very complicated lol. I can never understand sed as its syntax is too confusing. I tried it but it returned no result. Something must have gone wrong.

root@localhost:~/xmlproj> sed -n '/<measInfo measInfoId="25250000">/,/<\/measInfo>/  {/^<measValue / {s/.*Label=\([^"]*\).*/\1/ ;x; n;  s/^<measResults>\([0-9 ]*\).*/\1/ ;H; x; s/\n//; p  } }' sampleCellbasedscript.txt
root@localhost:~/xmlproj> sed -n '/<measInfo measInfoId="25250000">/,/<\/measInfo>/  {/^<measValue / {s/.*Label=\([^"]*\).*/\1/ ;x; n;  s/^<measResults>\([0-9 ]*\).*/\1/ ;H; x; s/\n//; p  } }' sampleCellbasedscript.txt
root@localhost:~/xmlproj>

Hi chapakrani,
Whats xmlint and Xpath? I tried searching online for xml to csv parser but I could not find any useful one.

Hmm... it works for me:

$ sed -n '/<measInfo measInfoId="25250000">/,/<\/measInfo>/  {/^<measValue / {s/.*Label=\([^"]*\).*/\1/^<measResults>\([0-9 ]*\).*/\1/ ;H; x; s/\n//; p  } }' stats.xml 
Site-O:MD0035-O-A-2, ID=591350 0 0 27300 100194 141378 2282 0 0 379 5849362 0 0 2497 0 
Site-O:MA0340-O-A-2, ID=565750 0 0 2099 11649 11091 28 0 0 74 249108 0 0 119 0 
Site-O:MD8001-O-A-3, ID=596460 0 0 0 549 0 0 0 0 0 1967 0 0 0 0 
Site-O:MA0056-O-A-2, ID=591550 0 0 0 1571 37 0 0 0 41 24453 0 0 0 0 
Site-O:MA0056-O-A-1, ID=591540 0 0 1349 4921 878 0 0 0 48 24651 0 0 0 0 
Site-O:MA0146-O-A-3, ID=571060 0 0 0 7018 106949 0 0 0 10 3928360 0 0 0 0 
Site-O:MA0120-O-B-3, ID=125610 0 0 8021 31504 1743 53 0 0 12 3939629 0 0 0 0 
Site-O:MA8105-O-A-3, ID=588960 0 0 0 2807 195 0 0 0 0 50977 0 0 0 0 
Site-O:MA0289-O-A-3, ID=576160 0 0 0 15665 10976 0 0 0 4 692551 0 0 831 0 
Site-O:MA0146-O-A-1, ID=571040 0 0 0 1884 237 0 0 0 1 13943 0 0 0 0 

Where stats.xml is the copied'n'pasted stuff from your first post.
Does your file sampleCellbasedscript.txt contain exactly what you posted? Are there by any chance any whitespace characters at the beggining of the line with measValue tag?

Sed is infamous for its obscurity, but that is just on the first sight. Once you understand how it works, it is no mystery.

Try this:

sed -n '/<measInfo measInfoId="25250000">/,/<\/measInfo>/ p' inputFile

It should print the section between measInfo tags.

Another way, using awk:

awk -v ID="2520000" '/<measInfo / {  P=match($0, "\""ID"\""); } P; /<\/measInfo/ { P=0 }' datafile.xml

Thank you so much, Mirni.
Yes, there are 4 white spaces before the meaValue and 9 before meaResult. I have tried to ammend the code to following, and it now gives close to my desidered result. However, how do I add extra "," in between the last string returned by sedding <measResults> ?

root@localhost:~/xmlproj> sed -n '/<measInfo measInfoId="25250000">/,/<\/measInfo>/ {/^<measValue / {s/.*Label=\([^"]*\)./\1/ ;x; n; s/^<measResults>\([0-9 ]*\)./\1/ ;H; x; s/\n//; p } }' sampleCellbasedscript.txt
(initial code - no result due to white spaces)

root@localhost:~/xmlproj> sed -n '/<measInfo measInfoId="25250000">/,/<\/measInfo>/ {/<measValue / {s/.*Label=\([^"]*\)./\1/ ;x; n; s/^<measResults>\([0-9 ]*\)./\1/ ;H; x; s/\n//; p } }' sampleCellbasedscript.txt
Site-O:MD0035-O-A-2, ID=59135 <measResults>0 0 0 27300 100194 141378 2282 0 0 379 5849362 0 0 2497 0 </measResults>
Site-O:MA0340-O-A-2, ID=56575 <measResults>0 0 0 2099 11649 11091 28 0 0 74 249108 0 0 119 0 </measResults>
Site-O:MD8001-O-A-3, ID=59646 <measResults>0 0 0 0 549 0 0 0 0 0 1967 0 0 0 0 </measResults>
Site-O:MA0056-O-A-2, ID=59155 <measResults>0 0 0 0 1571 37 0 0 0 41 24453 0 0 0 0 </measResults>
Site-O:MA0056-O-A-1, ID=59154 <measResults>0 0 0 1349 4921 878 0 0 0 48 24651 0 0 0 0 </measResults>
Site-O:MA0146-O-A-3, ID=57106 <measResults>0 0 0 0 7018 106949 0 0 0 10 3928360 0 0 0 0 </measResults>
Site-O:MA0120-O-B-3, ID=12561 <measResults>0 0 0 8021 31504 1743 53 0 0 12 3939629 0 0 0 0 </measResults>
Site-O:MA8105-O-A-3, ID=58896 <measResults>0 0 0 0 2807 195 0 0 0 0 50977 0 0 0 0 </measResults>
Site-O:MA0289-O-A-3, ID=57616 <measResults>0 0 0 0 15665 10976 0 0 0 4 692551 0 0 831 0 </measResults>
Site-O:MA0146-O-A-1, ID=57104 <measResults>0 0 0 0 1884 237 0 0 0 1 13943 0 0 0 0 </measResults>

root@localhost:~/xmlproj> sed -n '/<measInfo measInfoId="25250000">/,/<\/measInfo>/ {/<measValue / {s/.*Label=\([^"]*\)./\1/ ;x; n; s/.<measResults>\([0-9 ]*\).*/\1/ ;H; x; s/\n//; p } }' sampleCellbasedscript.txt
Site-O:MD0035-O-A-2, ID=591350 0 0 27300 100194 141378 2282 0 0 379 5849362 0 0 2497 0
Site-O:MA0340-O-A-2, ID=565750 0 0 2099 11649 11091 28 0 0 74 249108 0 0 119 0
Site-O:MD8001-O-A-3, ID=596460 0 0 0 549 0 0 0 0 0 1967 0 0 0 0
Site-O:MA0056-O-A-2, ID=591550 0 0 0 1571 37 0 0 0 41 24453 0 0 0 0
Site-O:MA0056-O-A-1, ID=591540 0 0 1349 4921 878 0 0 0 48 24651 0 0 0 0
Site-O:MA0146-O-A-3, ID=571060 0 0 0 7018 106949 0 0 0 10 3928360 0 0 0 0
Site-O:MA0120-O-B-3, ID=125610 0 0 8021 31504 1743 53 0 0 12 3939629 0 0 0 0
Site-O:MA8105-O-A-3, ID=588960 0 0 0 2807 195 0 0 0 0 50977 0 0 0 0
Site-O:MA0289-O-A-3, ID=576160 0 0 0 15665 10976 0 0 0 4 692551 0 0 831 0
Site-O:MA0146-O-A-1, ID=571040 0 0 0 1884 237 0 0 0 1 13943 0 0 0 0
root@localhost:~/xmlproj>

//add an "," in between the result after ID=XXXXX
root@localhost:~/xmlproj> sed -n '/<measInfo measInfoId="25250000">/,/<\/measInfo>/ {/<measValue / {s/.*Label=\([^"]*\)./\1,/ ;x; n; s/.<measResults>\([0-9 ]*\).*/\1/ ;H; x; s/\n//; p } }' sampleCellbasedscript.txt
Site-O:MD0035-O-A-2, ID=59135,0 0 0 27300 100194 141378 2282 0 0 379 5849362 0 0 2497 0
Site-O:MA0340-O-A-2, ID=56575,0 0 0 2099 11649 11091 28 0 0 74 249108 0 0 119 0
Site-O:MD8001-O-A-3, ID=59646,0 0 0 0 549 0 0 0 0 0 1967 0 0 0 0
Site-O:MA0056-O-A-2, ID=59155,0 0 0 0 1571 37 0 0 0 41 24453 0 0 0 0
Site-O:MA0056-O-A-1, ID=59154,0 0 0 1349 4921 878 0 0 0 48 24651 0 0 0 0
Site-O:MA0146-O-A-3, ID=57106,0 0 0 0 7018 106949 0 0 0 10 3928360 0 0 0 0
Site-O:MA0120-O-B-3, ID=12561,0 0 0 8021 31504 1743 53 0 0 12 3939629 0 0 0 0
Site-O:MA8105-O-A-3, ID=58896,0 0 0 0 2807 195 0 0 0 0 50977 0 0 0 0
Site-O:MA0289-O-A-3, ID=57616,0 0 0 0 15665 10976 0 0 0 4 692551 0 0 831 0
Site-O:MA0146-O-A-1, ID=57104,0 0 0 0 1884 237 0 0 0 1 13943 0 0 0 0

Question:
1) How to remove the space before ID
2) How to print , after each result just like the desired output?
3) If there are decimal point in the input file, then this code fails to output the float number. I think the ([0-9 ]*\).*/\1/ only returns any number between 0 to 9, and float number will fail. How do I resolve this?

I also tried piping the result and run a 2nd sed to print , yet could not erase the existing , (hence causing double ,,)
root@localhost:~/xmlproj> sed -n '/<measInfo measInfoId="25250000">/,/<\/measInfo>/ {/<measValue / {s/.*Label=\([^"]*\)./\1,/ ;x; n; s/.<measResults>\([0-9 ]*\).*/\1/ ;H; x; s/\n//; p } }' sampleCellbasedscript.txt | sed -e 's/ /,/g'
Site-O:MD0035-O-A-2,,ID=59135,0,0,0,27300,100194,141378,2282,0,0,379,5849362,0,0,2497,0,
Site-O:MA0340-O-A-2,,ID=56575,0,0,0,2099,11649,11091,28,0,0,74,249108,0,0,119,0,
Site-O:MD8001-O-A-3,,ID=59646,0,0,0,0,549,0,0,0,0,0,1967,0,0,0,0,
Site-O:MA0056-O-A-2,,ID=59155,0,0,0,0,1571,37,0,0,0,41,24453,0,0,0,0,
Site-O:MA0056-O-A-1,,ID=59154,0,0,0,1349,4921,878,0,0,0,48,24651,0,0,0,0,
Site-O:MA0146-O-A-3,,ID=57106,0,0,0,0,7018,106949,0,0,0,10,3928360,0,0,0,0,
Site-O:MA0120-O-B-3,,ID=12561,0,0,0,8021,31504,1743,53,0,0,12,3939629,0,0,0,0,
Site-O:MA8105-O-A-3,,ID=58896,0,0,0,0,2807,195,0,0,0,0,50977,0,0,0,0,
Site-O:MA0289-O-A-3,,ID=57616,0,0,0,0,15665,10976,0,0,0,4,692551,0,0,831,0,
Site-O:MA0146-O-A-1,,ID=57104,0,0,0,0,1884,237,0,0,0,1,13943,0,0,0,0,

Desired Output
Site-O:MD0035-O-A-2,ID=59135,0,0,0,27300,100194,141378,2282,0,0,379,5849362,0,0,2497,0,
Site-O:MA0340-O-A-2,ID=56575,0,0,0,2099,11649,11091,28,0,0,74,249108,0,0,119,0,
Site-O:MD8001-O-A-3,ID=59646,0,0,0,0,549,0,0,0,0,0,1967,0,0,0,0,
Site-O:MA0056-O-A-2,ID=59155,0,0,0,0,1571,37,0,0,0,41,24453,0,0,0,0,
Site-O:MA0056-O-A-1,ID=59154,0,0,0,1349,4921,878,0,0,0,48,24651,0,0,0,0,
Site-O:MA0146-O-A-3,ID=57106,0,0,0,0,7018,106949,0,0,0,10,3928360,0,0,0,0,
Site-O:MA0120-O-B-3,ID=12561,0,0,0,8021,31504,1743,53,0,0,12,3939629,0,0,0,0,
Site-O:MA8105-O-A-3,ID=58896,0,0,0,0,2807,195,0,0,0,0,50977,0,0,0,0,
Site-O:MA0289-O-A-3,ID=57616,0,0,0,0,15665,10976,0,0,0,4,692551,0,0,831,0,
Site-O:MA0146-O-A-1,ID=57104,0,0,0,0,1884,237,0,0,0,1,13943,0,0,0,0,

Any help would be appreciated. thanks guys.

---------- Post updated at 04:06 PM ---------- Previous update was at 04:04 PM ----------

Thanks corona688.
I tried the awk code but it didnt work. Hmm

Please please please, use code tags when posting input/output or any code.
Here. Added a few things to resolve 1., 2., 3.:

sed -n '/<measInfo measInfoId="25250000">/,/<\/measInfo>/ {/<measValue / {s/ //g; s/.*Label=\([^"]*\).*/\1/ ;x; n; s/^<measResults>\([0-9. ]*\).*/\1/ ;H; x; s/\n//; s/ /,/g; p } }'
1 Like

Thanks mirni.
It works! and great help indeed.