Help in parsing xml file (sed/nawk)

shekhar2010us · August 11, 2011, 7:32am

I have a large xml file as shown below:

<input>
    <blah>
    <blah>
        <atr="blah blah value = "">
    <blah>
        <blah>
</input>

..2nd chunk...

..3rd chunk...

...4th chunk...

All lines between <input> and </input> is one 'order' and this 'order' is repeated several times, but the first and last line for all the 'orders' are same i.e. <input> and </input>.

I need the entire 'order' containing a string (value=""), i.e. all lines between <input> and </input> containing value="".

Now in the xml, I have many occurrences of value="", I need all 'orders' containing value="" in a separate file.

Restrictions:
1) one 'order' may contain more than one value="", for that I need the order only once in the output file.

I am using solaris.
Thanks for helping.

itkamaraj · August 11, 2011, 7:42am

 
$ nawk -F"\"" ' /atr/ {print $4}' test.xml | sort | uniq
abc
adfasdfas
dafasfas
dafasfasf

test data :

 
$ cat test.xml
<input>
<blah>
<blah>
<atr="blah blah" value = "adfasdfas">
<blah>
<blah>
</input>
<input>
<blah>
<blah>
<atr="blah blah" value = "abc">
<blah>
<blah>
</input>
<input>
<blah>
<blah>
<atr="blah blah" value = "dafasfasf">
<blah>
<blah>
</input>
<input>
<blah>
<blah>
<atr="blah blah" value = "abc">
<blah>
<blah>
</input>
<input>
<blah>
<blah>
<atr="blah blah" value = "dafasfas">
<blah>
<blah>
</input>
<input>
<blah>
<blah>
<atr="blah blah" value = "abc">
<blah>
<blah>
</input>

shekhar2010us · August 11, 2011, 8:34am

thanks itkamaraj,
but that;s not I needed.

I will explain with your test data (with little changes).

<input>
<blah>
<blah>
<atr="blah blah" value = "">
<blah>
<blah>
</input>
<input>
<blah>
<blah>
<atr="blah blah" value = "">
<blah>
<blah>
</input>
<input>
<blah>
<blah>
<atr="blah blah" value = "dafasfasf">
<blah>
<blah>
</input>
<input>
<blah>
<blah>
<atr="blah blah" value = "abc">
<blah>
<blah>
</input>
<input>
<blah>
<blah>
<atr="blah blah" value = "">
<blah>
<blah>
</input>

Output should be: All lines between <input> and </input> where value=""

<input>
<blah>
<blah>
<atr="blah blah" value = "">
<blah>
<blah>
</input>

<input>
<blah>
<blah>
<atr="blah blah" value = "">
<blah>
<blah>
</input>

<input>
<blah>
<blah>
<atr="blah blah" value = "">
<blah>
<blah>
</input>

bartus11 · August 11, 2011, 8:52am

Try:

perl -ln0e 'while (/<input>.*?<\/input>/sg){$x=$&;print "$x\n" if $x=~/atr=\"blah blah\" value = \"\"/}'file.xml

itkamaraj · August 11, 2011, 8:52am

 
$ nawk 'BEGIN{RS=""; FS="\</input\>"} {for(i=1;i<=NF;i++){ if ($i~/\"\"/) print $i"</input>"}}' test                                              
<input>
<blah>
<blah>
<atr="blah blah" value = "">
<blah>
<blah>
</input>
<input>
<blah>
<blah>
<atr="blah blah" value = "">
<blah>
<blah>
</input>
<input>
<blah>
<blah>
<atr="blah blah" value = "">
<blah>
<blah>
</input>

shekhar2010us · August 11, 2011, 10:45am

thanks itkamaraj.

that was awesome. I also checked for the case when there are two values="" in the same order, and it worked fine..

can you plz explain me a bit how it work, as I need to make some changes.....

Thanks a lot.

itkamaraj · August 11, 2011, 10:57am

Normally awk has record separator as \n and field separator as space. But in the code we are overriding it to record seperator as "" and field seperator as </input>

so each record has the value of <input>..........<blah>

in that record, we are checking $0~/\"\"/ (any record has two double quotes... "" )

if yes, then print it

---------- Post updated at 08:27 PM ---------- Previous update was at 08:27 PM ----------

read more about awk here

shekhar2010us · August 11, 2011, 10:58am

for e.g.

if I need to add "today's date" before every order in output file during awk.
I tried

nawk 'BEGIN{RS=""; FS="\</inputProvision\>"} {print "hello"; for(i=1;i<=NF;i++){ if ($i~/\"\"/) print $i"</inputProvision>"}}' test

but it's printing hello before and after each order.

Plz explain the awk so that I can change it accordingly.

thanks a lot..

itkamaraj · August 11, 2011, 11:02am

Please use the code tag (while posting any code )

you can pass the variable and print in awk.

nawk -v mydate=`date` 'BEGIN{RS=""; FS="\</inputProvision\>"} {print mydate ; for(i=1;i<=NF;i++){ if ($i~/\"\"/) print $i"</inputProvision>"}}' test

shekhar2010us · August 11, 2011, 11:18am

nawk -v mydate=`date` 'BEGIN{RS=""; FS="\</inputProvision\>"} {print mydate ; for(i=1;i<=NF;i++){ if ($i~/\"\"/) print $i"</inputProvision>"}}' test

its giving me the error:

nawk: can't open file 11
source line number 1

whereas if I use mydate=hello in side nawk, it works..

dude2cool · August 11, 2011, 11:23am

try this, put double quotes around mydate=`date`

nawk -v "mydate=`date`" 'BEGIN{RS=""; FS="\</inputProvision\>"} {print mydate ; for(i=1;i<=NF;i++){ if ($i~/\"\"/) print $i"</inputProvision>"}}' test

itkamaraj · August 11, 2011, 11:24am

try this...

 myvar=$(date);nawk -v mydate=$myvar 'BEGIN{RS=""; FS="\</inputProvision\>"}  {print mydate ; for(i=1;i<=NF;i++){ if ($i~/\"\"/) print  $i"</inputProvision>"}}' test

shekhar2010us · August 11, 2011, 5:12pm

Thanks itkamaraj and dude2cool for the help...
@itkamaraj, thanks for the awk tutorial link. its too good....

I want to do one thing with this awk. It would be great if you can help me....

Previous code:

nawk -v "mydate=`date`" 'BEGIN{RS=""; FS="\</inputProvision\>"} {print mydate ; for(i=1;i<=NF;i++){ if ($i~/\"\"/) print $i"</inputProvision>"}}' test

$i prints everything before </inputProvision> where value="".......... This is what I want.
I also want to print a a tag along with it which is a part of $i.

For that I am doing:

nawk -v "mydate=`date`" "var2=`grep XYZ $i`" 'BEGIN{RS=""; FS="\</inputProvision\>"} {for(i=1;i<=NF;i++) { if ($i~/value=\"\"/) print var2 mydate $i "</inputProvision>"}}' test

basically in $i, there is a field called XYZ in the format:
<ATTR name="XYZ" value="123456789"/>
I need it's value to be printed along with $i...

can we use awk inside awk..

Thanks.....

fpmurphy · August 11, 2011, 6:53pm

To keep the forums high quality for all users, please take the time to format your posts correctly.

First of all, use Code Tags when you post any code or data samples so others can easily read your code. You can easily do this by highlighting your code and then clicking on the # in the editing menu. (You can also type code tags

```text
 and 
```

by hand.)

Second, avoid adding color or different fonts and font size to your posts. Selective use of color to highlight a single word or phrase can be useful at times, but using color, in general, makes the forums harder to read, especially bright colors like red.

Third, be careful when you cut-and-paste, edit any odd characters and make sure all links are working property.

Thank You.

The UNIX and Linux Forums

shekhar2010us · August 12, 2011, 9:18am

Sorry for the mess. I am not very used to forum rules as I haven't used this much but will try to follow the rules.

@itkamaraj:
Thanks for your help.
I just need to add one more component to the awk, as I explained above.

This is the previous code:

  nawk -v "mydate=`date`" 'BEGIN{RS=""; FS="\</inputProvision\>"} {for(i=1;i<=NF;i++) { if ($i~/value=\"\"/) print mydate "\n" $i "</inputProvision>" "\n"}}' test

here $i contains several lines in xml tags. What I need is to fetch few values in $i and print it along $i.

Sample $i:

 
blah blah
<ATTR name="AB" value="123"/>
<ATTR name="CD" value="456"/>
<ATTR name="EF" value="789"/>
blah blah

When value="" matches, the awk statement prints $i,

AB=123
EF=789

should be printed along with $i.

I used:

 nawk -v "mydate=`date`" "var2=`grep AB $i`" "var3=`grep EF $i`" 'BEGIN{RS=""; FS="\</inputProvision\>"} {for(i=1;i<=NF;i++) { if ($i~/value=\"\"/) print var2 "\n" var3 mydate $i "</inputProvision>"}}' test

But this is entering into an infinite loop.

Thanks for your help..