Extract values from xml file script

Ophiuchus · July 1, 2018, 2:33am

Hi, please help on this. I want extract values of xml file structure and print in determined way.

<ProjectName> --> only appears once
<StructList> --> is the top node
<Struct> node --> could be more than 1
NameID, STX, STY, PRX, PRY --> appears only 1 time within each <Struct> node
<PR_Ranges> node --> only appears once but inside this node could be more than 1 <RangesInfo>
I want to extract children (OD, ODF, ODRangeStart and ODRangeStop) from each <RangesInfo>

I want to print the values for each <Struct> node in a single line with this format

ProjectName|NameID|STX-STY|PRX-PRY|OD-ODF|ODRangeStart|ODRangeStop

My input xml, current awk code and current output that is wrong is below

echo "<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ProjectInfo>
<ProjectName>HY-LKL</ProjectName>
<StructList>
    <Struct>
    <StructData>    
    <NameID>ROPSL</NameID>    
            <STR_VAL>
            <STX>210</STX>
            <STY>21</STY>
            </STR_VAL>
            <PRO_VAL>
            <PRX>62</PRX>
            <PRY>822</PRY>
            </PRO_VAL>
            <PR_Ranges>
                <RangesInfo>
                <ValueRange>
                    <OD>22</OD>
                    <ODF>3199</ODF>
                </ValueRange>
                </RangesInfo>
                <RangesInfo>
                <ValueRange>
                    <OD>22</OD>
                    <ODF>023</ODF>
                    <ODRange>
                    <ODRangeStart>00</ODRangeStart>
                    <ODRangeStop>99</ODRangeStop>
                    </ODRange>
                </ValueRange>
                </RangesInfo>
            </PR_Ranges>      
    </StructData>
    </Struct>  
    <Struct>
    <StructData>
    <NameID>MACLS</NameID>      
            <STR_VAL>
            <STX>210</STX>
            <STY>01</STY>
            </STR_VAL>
            <PRO_VAL>
            <PRX>62</PRX>
            <PRY>816</PRY>
            </PRO_VAL>
            <PR_Ranges>
                <RangesInfo>
                <ValueRange>
                    <OD>22</OD>
                    <ODF>010</ODF>
                    <ODRange>
                    <ODRangeStart>00</ODRangeStart>
                    <ODRangeStop>99</ODRangeStop>
                    </ODRange>
                </ValueRange>
                </RangesInfo>
            </PR_Ranges>              
    </StructData>
    </Struct>
</StructList>   
</ProjectInfo>" | 

awk -F"<|>" '
BEGIN{print "ProjectName|NameID|STX-STY|PRX-PRY|OD-ODF|ODRangeStart|ODRangeStop"}
/ProjectName/{printf "%s",$3}
/NameID/{id=$3}
/STX/{stx=$3}
/STY/{sty=$3}
/PRX/{prx=$3}
/PRY/{pry=$3}
/OD/ {od=$3}
/ODF/{odf=$3}
/ODRangeStart/{rngStart=$3}
/ODRangeStop/ {rngStop=$3
printf "|%s|%s-%s|%s-%s|%s-%s|%s|%s\n",id,stx,sty,prx,pry,od,odf,rngStart,rngStop
stx=sty=prx=pry=od=odf=rngStart=rngStop=""
}
'

My current output (not desired output)

ProjectName|NameID|STX-STY|PRX-PRY|OD-ODF|ODRangeStart|ODRangeStop
HY-LKL|ROPSL|210-21|62-822|99-023|00|99
|MACLS|210-01|62-816|99-010|00|99

My desired output

ProjectName|NameID|STX-STY|PRX-PRY|OD-ODF|ODRangeStart|ODRangeStop
HY-LKL|ROPSL|210-21|62-822|22-3199||
||||22-023|00|99
|MACLS|210-01|62-816|22-010|00|99

Thanks in advance for any help.

joker · July 1, 2018, 3:11pm

Hi,

I suggest to use an XML-Tool for parsing an XML-File. Look at the thread here for some tools:

XPath ist name of the Search-Syntax, you can use to find values:

Some Examples:

Get the Projectname

xmllint --xpath "//ProjectName/text()" file.xml

Get all NameIDs

xmllint --xpath "//NameID/text()" file.xml

Get STX for a section NameID "MACLS"

xmllint --xpath "//*[NameID[text()='MACLS']]/STR_VAL/STX/text()"

That are a lot of xmllint calls. I myself would use a scripting language that has xml as library. But Bash with xmllint should be possible to albeit not so fast.

Chubler_XL · July 1, 2018, 6:11pm

This appears to work OK for you sample input:

awk -F"<|>" '
BEGIN{print "ProjectName|NameID|STX-STY|PRX-PRY|OD-ODF|ODRangeStart|ODRangeStop"}
/ProjectName/{printf "%s",$3}
/NameID/{id=$3}
/<STX>/{stx=$3}
/<STY>/{sty=$3}
/<PRX>/{prx=$3}
/<PRY>/{pry=$3}
/<OD>/ {od=$3}
/ODF>/{odf=$3}
/ODRangeStart/{rngStart=$3}
/ODRangeStop/ {rngStop=$3}
/<.ValueRange/ {
printf "|%s|%s|%s|%s|%s|%s\n",id,stx?stx"-"sty:"",prx?prx"-"pry:"",od?od"-"odf:"",rngStart,rngStop
id=stx=sty=prx=pry=od=odf=rngStart=rngStop=""
}
'

Ophiuchus · July 1, 2018, 11:05pm

Hi stomp, thanks for your answer and suggestion. I'll have in mind this xml tool, but for now I think I close to get the output desired with awk.

---------- Post updated at 11:05 PM ---------- Previous update was at 11:04 PM ----------

Hi Chubler_XL,

Thanks. It works, but trying with a real input xml it prints a kind of different output, this was mmy fault
since in order to make the sample input shorter I missed some nodes.

Below I present a more representative sample file.

The nodes STR_VAL and PRO_VAL are the same structure, the issue is exist a parent node called <GROUP_Ranges> that contains
the children <XB_Ranges>, <PR_Ranges> and <KJ_Ranges>. Each one of this children have the same sub-children named OD, ODF, ODRange, etc.
The output desired remains the same, I only want to extract the sub-children of <PR_Ranges>, since my first sample file was less representative, your
current solution is printing the sub-children of <XB_Ranges> and sub-children of <KJ_Ranges>. In addition, the values of STX, STY, PRX, and PRY are not being
printed when input file is like this second sample file.

May be you can help me to fix this, how to print the same output as before but considering only values from PR_Ranges.

The input file 2 is:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<ProjectInfo>
<ProjectName>ABDFC</ProjectName>
<StructList>
    <Struct>
    <StructData>    
    <NameID>ROPSL</NameID>  
            <GROUP_Ranges>
              <XB_Ranges>
                <RangesInfo>
                  <ValueRange>
                    <OD>534</OD>
                    <ODF>91</ODF>
                    <ODRange>
                      <ODRangeStart>00</ODRangeStart>
                      <ODRangeStop>99</ODRangeStop>
                    </ODRange>
                  </ValueRange>
                </RangesInfo>
              </XB_Ranges>
              <PR_Ranges>
                <RangesInfo>
                  <ValueRange>
                    <OD>534</OD>
                    <ODF>91</ODF>
                    <ODRange>
                      <ODRangeStart>56</ODRangeStart>
                      <ODRangeStop>879</ODRangeStop>
                    </ODRange>
                  </ValueRange>
                </RangesInfo>
                <RangesInfo>
                <ValueRange>
                    <OD>92</OD>
                    <ODF>21</ODF>
                    <ODRange>
                    <ODRangeStart>100</ODRangeStart>
                    <ODRangeStop>299</ODRangeStop>
                    </ODRange>
                </ValueRange>
                </RangesInfo>				
              </PR_Ranges>
              <KJ_Ranges>
                <ValueRange>
                  <OD>534</OD>
                  <ODF>91</ODF>
                  <ODRange>
                    <ODRangeStart>440</ODRangeStart>
                    <ODRangeStop>449</ODRangeStop>
                  </ODRange>
                </ValueRange>
              </KJ_Ranges>
            </GROUP_Ranges>
            <STR_VAL>
              <STX>283</STX>
              <STY>84</STY>
            </STR_VAL>
            <PRO_VAL>
              <PRX>534</PRX>
              <PRY>91</PRY>
            </PRO_VAL>	     
    </StructData>
    </Struct>  
</StructList>   
</ProjectInfo>

The output for this input file 2 would be like this:

ProjectName|NameID|STX-STY|PRX-PRY|OD-ODF|ODRangeStart|ODRangeStop
ABDFC|ROPSL|283-84|534-91|534-91|56|879
||||92-21|100|299

Thanks

Chubler_XL · July 1, 2018, 11:43pm

A little confused with this. The <STR_VAL> block for the first output line appears after the <ODRange> block for the 2nd output line.

How do you match up STX-STY values in the XLM with their appropriate output lines?

Ophiuchus · July 2, 2018, 12:25am

Actually <STR_VAL> and <PRO_VAL> goes after <ValueRange>/<ODRange>. In first sample STR_VAL appears before for the same reason as explain that was a not too representative sample.

All the values related witn each <NameID>, STX-STY and each PRX-PRY are inside each <Struct> node. I don�t know if I answer your doubts.

Don_Cragun · July 2, 2018, 1:38am

No. You have not answered our doubts. You gave us sample input in post #1 and you showed us the output you wanted from that input. And, you were given code that produced that output from that input and then you changed your requirements.

In post 4 you gave us new sample input and you said "The output for this input file 2 would be like this:" and you showed us some output. But with that wording, I don't know if you are saying that that is the output you get from some code that has been suggested (but not what you want), that it is the output you get from some other code that you're using (but not what you want), or if it is the output you want from that new input.

Furthermore, you haven't clearly specified whether the original input you provided in post #1 was valid input that did not include data that was needed to trigger special cases that were missing from your original algorithm (and the code you want should still provide the output you said you want from that input in post #1) or if the input you provided in post #1 was not valid input and everything you said about the output you wanted to be produced from that input should be ignored.

Ophiuchus · July 2, 2018, 2:02am

don cragun:

No. You have not answered our doubts. You gave us sample input in post #1 and you showed us the output you wanted from that input. And, you were given code that produced that output from that input and then you changed your requirements.

In post 4 you gave us new sample input and you said "The output for this input file 2 would be like this:" and you showed us some output. But with that wording, I don't know if you are saying that that is the output you get from some code that has been suggested (but not what you want), that it is the output you get from some other code that you're using (but not what you want), or if it is the output you want from that new input.

Furthermore, you haven't clearly specified whether the original input you provided in post #1 was valid input that did not include data that was needed to trigger special cases that were missing from your original algorithm (and the code you want should still provide the output you said you want from that input in post #1) or if the input you provided in post #1 was not valid input and everything you said about the output you wanted to be produced from that input should be ignored.

Hi Don,

1) The output I show for sample 2 is my desired output for sample file 2, written/edited manually by me not by any other code(I don�t have more codes ).
2) Yes, input file in post #1 was valid input that did not include data that was needed to trigger special cases. I noticed that was needed to add data to sample file to consider special cases when I tested the Chubler_XL�s code in sample #2.
3) The code provided by Chubler_XL it works for the sample of post #1. The structure of output I was looking for help in post #1 is the same structure of output for sample #2,
but sample #2 includes some other nodes (<XB_Ranges> and <KJ_Ranges>) that have the same name for their sub-children. That makes the script confuses the values in the output. When I tested the Chubler_XL solution in sample #2 I realized the output was not the desired one. In this point I saw was needed to add those other nodes in order to be considered in script.
4) In addition, in real file the nodes <STR_VAL> and <PRO_VAL> go after <PR_Ranges>. Since the real xml is a kind of large with a lot of other nodes, I only focused in the nodes I want to
extract and the sample I created for post #1 was not accurate in the order of nodes.

I hope be more clear.

Thanks for the help and sorry for any inconvenience.

Chubler_XL · July 2, 2018, 4:28pm

How about this:

awk -F"<|>" '
BEGIN{print "ProjectName|NameID|STX-STY|PRX-PRY|OD-ODF|ODRangeStart|ODRangeStop"}
/ProjectName/{printf "%s",$3}
/<.StructData>/{
    for(i=1; i<=rngln; i++) {
       printf "|%s|%s|%s|%s|%s|%s\n",\
          id,stx?stx"-"sty:"",\
          prx?prx"-"pry:"",\
          od?od"-"odf:"",\
          rngStart,rngStop
       delete od
       delete odf
       delete rngStart
       delete rngStop
       id=stx=sty=prx=pry=""
    }
    rngln=0
}
/NameID/{id=$3}
/<STX>/{stx=$3}
/<STY>/{sty=$3}
/<PRX>/{prx=$3}
/<PRY>/{pry=$3}
/<PR_Ranges>/ {PRActive=1}
!PRActive { next }
/<.PR_Ranges>/ {PRActive=0}
/<OD>/ {od[++rngln]=$3}
/ODF>/{odf[rngln]=$3}
/ODRangeStart/{rngStart[rngln]=$3}
/ODRangeStop/{rngStop[rngln]=$3}
' infile2

Ophiuchus · July 2, 2018, 8:55pm

How can I say. It works just exactly as I was looking for.

I tested changing the "delete array[i]" sentences to outside for loop as "delete array" and it works too.

Many thanks for your help again.

Don_Cragun · July 2, 2018, 10:16pm

The standards specify that every conforming version of awk must support delete array[subscript] . Some implementations of awk also support delete array , but that is not required by the standards. On systems that do not support delete array , you can use split("", array) to get the same effect.

So, if you're trying to write code that will work on every implementation, avoid using delete array . If you're just writing code to be used on your current system,