Parsing XML using command line

pauldx · July 17, 2015, 6:27pm

Hi Experts,

How do I parse a XML with below contents

      <saw:user name="mbussey@xyz.com" />
      <saw:user name="kimmy.chan@pqr.com" />
      <saw:user name="chudgins@gmail.com" />

and retrieve below output ?

	  mbussey@xyz.com
	  kimmy.chan@pqr.com
	  chudgins@gmail.com

Also I have 5 .xml file with different set of parsing name value. Anyway we can parse and merge them in command line and output in one file ?

thanks and appreciate your help ?

paul

cjcox · July 17, 2015, 6:36pm

Difficult to do true parsing without some sort of tool (that's not part of Linux/Unix normally).

For very specific cases it might work to glean out what you want though (all depends on the input).

So...if we assume it's like what you mentioned and the data is all on one line, etc.. then you could filter the data with:

sed -n 's/.*saw:user name="\([^"]*\).*/\1/p'

But it assumes pretty formatted XML.

pauldx · July 17, 2015, 6:42pm

thanks for prompt response. I am sorry but how do I pass input file name ?

cjcox · July 17, 2015, 6:46pm

just redirect

sed -n 's/.*saw:user name="\([^"]*\).*/\1/p' <myfile.xml

As for multiple things, again with assumption about the xml file as before....

sed -n -e 's/.*saw:user name="\([^"]*\).*/\1/p' -e 's/.*something:else name="\([^"]*\).*/\1/p' <myfile.xml

Just keep adding -e with the substitution and capture/replace patterns you need.

pauldx · July 17, 2015, 6:59pm

Sorry but not helping too much . Might be XML is badly formatted . This is entire XML.

<?xml version="1.0" encoding="UTF-8"?>
<saw:ibot xmlns:saw="com.siebel.analytics.web/report/v1" version="1" priority="normal" jobID="36">
  <saw:schedule timeZoneId="(GMT-05:00) Eastern Time (US & Canada)" disabled="false">
    <saw:start repeatMinuteInterval="60" endTime="23:59:00" startImmediately="true" />
    <saw:recurrence runOnce="false">
      <saw:weekly weekInterval="1" mon="true" tue="true" wed="true" thu="true" fri="true" />
    </saw:recurrence>
  </saw:schedule>
  <saw:dataVisibility type="recipient" runAs="cgm" />
  <saw:choose>
    <saw:when condition="true">
      <saw:deliveryContent>
        <saw:headline>
          <saw:caption>
            <saw:text>Availability Parity</saw:text>
          </saw:caption>
        </saw:headline>
        <saw:conditionalReport />
      </saw:deliveryContent>
      <saw:postActions />
    </saw:when>
    <saw:otherwise />
  </saw:choose>
  <saw:deliveryDestinations>
    <saw:destination category="dashboard" />
    <saw:destination category="activeDeliveryProfile" />
  </saw:deliveryDestinations>
  <saw:recipients subscribers="true" customize="false" specificRecipients="false">
    <saw:subscribers>
    <saw:user name="mbussey@xyz.com" />
    <saw:user name="kimmy.chan@pqr.com" />
    <saw:user name="chudgins@gmail.com" />
</saw:recipients>
  <saw:conditionQuery>
    <saw:reportRefNode path="/shared/Quote/Product/Alerts/Daily Availability Parity Alert - Next 14 Days - Content" />
  </saw:conditionQuery>
</saw:ibot>

cjcox · July 17, 2015, 7:05pm

Should work.

I put your xml into a file and this what I got:

sed -n 's/.*saw:user name="\([^"]*\).*/\1/p' <myfile.xml
mbussey@xyz.com
kimmy.chan@pqr.com
chudgins@gmail.com

(I'm just using the simple case of one pattern)

And modified to add the email code tags:

sed -n 's/.*saw:user name="\([^"]*\).*/\1\[\/email]/p' <myfile.xml
mbussey@xyz.com
kimmy.chan@pqr.com
chudgins@gmail.com

Note: on the unix forums the "bbcode" like tags are interpreted... so you won't see the email tags come out here.

Aia · July 17, 2015, 7:32pm

perl -nle '/saw:user name="(.*?)"/ and print qq{$1\}' myfile.xml

pauldx · July 17, 2015, 7:48pm

I am very sorry . You are absolutely correct. I was trying to see whats going wrong and realized that my tool gave different XML than actual XML in file on filesystem .

So I tried to ran the sed for this XML set and it is not doing right. Might be we need alternative sed command:

<?xml version="1.0" encoding="utf-8"?>
<saw:ibot xmlns:saw="com.siebel.analytics.web/report/v1" version="1" priority="normal" jobID="36"><saw:schedule timeZoneId="(GMT-05:00) Eastern Time (US & Canada)" disabled="false"><saw:start repeatMinuteInterval="60" endTime="23:59:00" startImmediately="true"/><saw:recurrence runOnce="false"><saw:weekly weekInterval="1" mon="true" tue="true" wed="true" thu="true" fri="true"/></saw:recurrence></saw:schedule><saw:dataVisibility type="recipient" runAs="cgm"/><saw:choose><saw:when condition="true"><saw:deliveryContent><saw:headline><saw:caption><saw:text>Availability Parity Alert for Next 14 Days </saw:text></saw:caption></saw:headline><saw:conditionalReport/></saw:deliveryContent><saw:postActions/></saw:when><saw:otherwise/></saw:choose><saw:deliveryDestinations><saw:destination category="dashboard"/><saw:destination category="activeDeliveryProfile"/></saw:deliveryDestinations><saw:recipients subscribers="true" customize="false" specificRecipients="false"><saw:subscribers><saw:user name="mbussey@xyz.com" /><saw:user name="kimmy.chan@pqr.com" /><saw:user name="chudgins@gmail.com" /></saw:subscribers></saw:recipients><saw:conditionQuery><saw:reportRefNode path="/shared/Quote/Product/Alerts/Daily Availability Parity Alert - Next 14 Days - Content"/></saw:conditionQuery></saw:ibot>

Aia · July 17, 2015, 8:10pm

Would that do it?

 perl -nle '@mail = $_=~/saw:user name="(.*?)"/g and map {print "$_\"} @mail' infile

pauldx · July 17, 2015, 10:11pm

Excellent . Indeed it does. Thank you very much but I output is coming like this :

chudgins@gmail.com

So I tried this :

perl -nle '@mail = $_=~/saw:user name="(.*?)"/g and map {print "$_"} @mail' infile

this works great and removes .

Now quetion I have inputfile 1 , 2 ,3 all under dir1 .
So I want perl command to execute across all files under dir1 and give me results in redirected output file. Can we acheive this ?

Aia · July 17, 2015, 10:33pm

pauldx:

Hi Experts,

How do I parse a XML with below contents [...]
and retrieve below output ?
	  mbussey@xyz.com
	  kimmy.chan@pqr.com
	  chudgins@gmail.com
Also I have 5 .xml file with different set of parsing name value. Anyway we can parse and merge them in command line and output in one file ?

thanks and appreciate your help ?

paul

I thought you wanted the [email][/email] added.

In that case the map is not needed.

perl -ne '$"=qq{\n}; @mail = $_=~/saw:user name="(.*?)"/g and print "@mail\n"' dir1/inputfile{1,2,3} > output.result

pauldx · July 17, 2015, 10:38pm

excellent much better now. thank you

I was trying in loop like belowand this indeed work but yours are much better...
now last and final question is I want unique list of emails in target output file. Anyway can we do that in your previous command ?

for FILENAME in $(find . -type f )
do
perl -nle '@mail = $_=~/saw:user name="(.*?)"/g and map {print "$_"} @mail' $FILENAME >> /home/orabi/final.txt
done;

Aia · July 17, 2015, 10:52pm

pauldx, look a little closer to my last post #11, I explained that you do not need to use map, also I gave you a clue as how Perl can take care of reading any files under dir1 without the need to use a for loop.

If you still have issues understanding, please post the result of find . -type f . Make sure you use the proper tags when you post or the moderator, that has edited several time your posts, will get mad.

pauldx · July 17, 2015, 11:02pm

My apologize. I am new to the forum .
Sorry I mean I want to use your code below . But I need to understand how can I save outputfile using perl command below but need unique email list. I don't want to use |sort | uniq and I am sure we can achieve this in Perl

perl -ne '$"=qq{\n}; @mail = $_=~/saw:user name="(.*?)"/g and print "@mail\n"' dir1/inputfile{1,2,3} > output.result

Aia · July 18, 2015, 12:23am

pauldx:

My apologize. I am new to the forum .
Sorry I mean I want to use your code below . But I need to understand how can I save outputfile using perl command below but need unique email list. I don't want to use |sort | uniq and I am sure we can achieve this in Perl
perl -ne '$"=qq{\n}; @mail = $_=~/saw:user name="(.*?)"/g and print "@mail\n"' dir1/inputfile{1,2,3} > output.result

What you are asking, now, changes the command radically.

I do not like how messy is becoming as a one-liner. Here's a proper Perl script.

#!/usr/bin/perl

use strict;
use warnings;

my %emails;
while(my $line = <>) {
    $emails{$_}++ for my @mail = $line =~ m/saw:user name="(.*?)"/g;
}

for my $email (sort keys %emails) {
    print "$email\n";
}

Save as extract_emails.pl
Run as

perl extract_emails.pl dir1/inputfile{1,2,3} > output.result

pauldx · July 18, 2015, 1:01am

I did this way in one command and its working great ...

perl -ne '$"=qq{\n}; @mail = $_=~/saw:user name="(.*?)"/g and print "@mail\n"' demand360+alerts/* rate360+alerts/* | sort | uniq > /home/orabi/abc.txt

Aia · July 18, 2015, 1:22am

I guess you changed your mind.
My post #15 script will do what you asked previously. It would keep unique ones and it will sort it in Perl.

Don_Cragun · July 18, 2015, 1:37am

And, if you going to use sort | uniq (after explicitly saying you didn't want to do that in post #14 in this thread), you could at least replace those two commands in your pipeline with just sort -u .