Parsing XML using command line

Hi Experts,

How do I parse a XML with below contents

      <saw:user name="mbussey@xyz.com" />
      <saw:user name="kimmy.chan@pqr.com" />
      <saw:user name="chudgins@gmail.com" />

and retrieve below output ?

	  mbussey@xyz.com
	  kimmy.chan@pqr.com
	  chudgins@gmail.com

Also I have 5 .xml file with different set of parsing name value. Anyway we can parse and merge them in command line and output in one file ?

thanks and appreciate your help ?

  • paul

Difficult to do true parsing without some sort of tool (that's not part of Linux/Unix normally).

For very specific cases it might work to glean out what you want though (all depends on the input).

So...if we assume it's like what you mentioned and the data is all on one line, etc.. then you could filter the data with:

sed -n 's/.*saw:user name="\([^"]*\).*/\1/p'

But it assumes pretty formatted XML.

thanks for prompt response. I am sorry but how do I pass input file name ?

just redirect

sed -n 's/.*saw:user name="\([^"]*\).*/\1/p' <myfile.xml

As for multiple things, again with assumption about the xml file as before....

sed -n -e 's/.*saw:user name="\([^"]*\).*/\1/p' -e 's/.*something:else name="\([^"]*\).*/\1/p' <myfile.xml

Just keep adding -e with the substitution and capture/replace patterns you need.

Sorry but not helping too much . Might be XML is badly formatted . This is entire XML.

<?xml version="1.0" encoding="UTF-8"?>
<saw:ibot xmlns:saw="com.siebel.analytics.web/report/v1" version="1" priority="normal" jobID="36">
  <saw:schedule timeZoneId="(GMT-05:00) Eastern Time (US & Canada)" disabled="false">
    <saw:start repeatMinuteInterval="60" endTime="23:59:00" startImmediately="true" />
    <saw:recurrence runOnce="false">
      <saw:weekly weekInterval="1" mon="true" tue="true" wed="true" thu="true" fri="true" />
    </saw:recurrence>
  </saw:schedule>
  <saw:dataVisibility type="recipient" runAs="cgm" />
  <saw:choose>
    <saw:when condition="true">
      <saw:deliveryContent>
        <saw:headline>
          <saw:caption>
            <saw:text>Availability Parity</saw:text>
          </saw:caption>
        </saw:headline>
        <saw:conditionalReport />
      </saw:deliveryContent>
      <saw:postActions />
    </saw:when>
    <saw:otherwise />
  </saw:choose>
  <saw:deliveryDestinations>
    <saw:destination category="dashboard" />
    <saw:destination category="activeDeliveryProfile" />
  </saw:deliveryDestinations>
  <saw:recipients subscribers="true" customize="false" specificRecipients="false">
    <saw:subscribers>
    <saw:user name="mbussey@xyz.com" />
    <saw:user name="kimmy.chan@pqr.com" />
    <saw:user name="chudgins@gmail.com" />
</saw:recipients>
  <saw:conditionQuery>
    <saw:reportRefNode path="/shared/Quote/Product/Alerts/Daily Availability Parity Alert - Next 14 Days - Content" />
  </saw:conditionQuery>
</saw:ibot>

Should work.

I put your xml into a file and this what I got:

sed -n 's/.*saw:user name="\([^"]*\).*/\1/p' <myfile.xml
mbussey@xyz.com
kimmy.chan@pqr.com
chudgins@gmail.com

(I'm just using the simple case of one pattern)

And modified to add the email code tags:

sed -n 's/.*saw:user name="\([^"]*\).*/\1\[\/email]/p' <myfile.xml
mbussey@xyz.com
kimmy.chan@pqr.com
chudgins@gmail.com

Note: on the unix forums the "bbcode" like tags are interpreted... so you won't see the email tags come out here.

1 Like
perl -nle '/saw:user name="(.*?)"/ and print qq{$1\}' myfile.xml

I am very sorry . You are absolutely correct. I was trying to see whats going wrong and realized that my tool gave different XML than actual XML in file on filesystem .

So I tried to ran the sed for this XML set and it is not doing right. Might be we need alternative sed command:

<?xml version="1.0" encoding="utf-8"?>
<saw:ibot xmlns:saw="com.siebel.analytics.web/report/v1" version="1" priority="normal" jobID="36"><saw:schedule timeZoneId="(GMT-05:00) Eastern Time (US & Canada)" disabled="false"><saw:start repeatMinuteInterval="60" endTime="23:59:00" startImmediately="true"/><saw:recurrence runOnce="false"><saw:weekly weekInterval="1" mon="true" tue="true" wed="true" thu="true" fri="true"/></saw:recurrence></saw:schedule><saw:dataVisibility type="recipient" runAs="cgm"/><saw:choose><saw:when condition="true"><saw:deliveryContent><saw:headline><saw:caption><saw:text>Availability Parity Alert for Next 14 Days </saw:text></saw:caption></saw:headline><saw:conditionalReport/></saw:deliveryContent><saw:postActions/></saw:when><saw:otherwise/></saw:choose><saw:deliveryDestinations><saw:destination category="dashboard"/><saw:destination category="activeDeliveryProfile"/></saw:deliveryDestinations><saw:recipients subscribers="true" customize="false" specificRecipients="false"><saw:subscribers><saw:user name="mbussey@xyz.com" /><saw:user name="kimmy.chan@pqr.com" /><saw:user name="chudgins@gmail.com" /></saw:subscribers></saw:recipients><saw:conditionQuery><saw:reportRefNode path="/shared/Quote/Product/Alerts/Daily Availability Parity Alert - Next 14 Days - Content"/></saw:conditionQuery></saw:ibot>

Would that do it?

 perl -nle '@mail = $_=~/saw:user name="(.*?)"/g and map {print "$_\"} @mail' infile
1 Like

Excellent . Indeed it does. Thank you very much but I output is coming like this :

chudgins@gmail.com 

So I tried this :

perl -nle '@mail = $_=~/saw:user name="(.*?)"/g and map {print "$_"} @mail' infile 

this works great and removes .

Now quetion I have inputfile 1 , 2 ,3 all under dir1 .
So I want perl command to execute across all files under dir1 and give me results in redirected output file. Can we acheive this ?

I thought you wanted the [email][/email] added.

In that case the map is not needed.

perl -ne '$"=qq{\n}; @mail = $_=~/saw:user name="(.*?)"/g and print "@mail\n"' dir1/inputfile{1,2,3} > output.result
1 Like

excellent much better now. thank you

I was trying in loop like belowand this indeed work but yours are much better...
now last and final question is I want unique list of emails in target output file. Anyway can we do that in your previous command ?

for FILENAME in $(find . -type f )
do
perl -nle '@mail = $_=~/saw:user name="(.*?)"/g and map {print "$_"} @mail' $FILENAME >> /home/orabi/final.txt
done;

pauldx, look a little closer to my last post #11, I explained that you do not need to use map, also I gave you a clue as how Perl can take care of reading any files under dir1 without the need to use a for loop.

If you still have issues understanding, please post the result of find . -type f . Make sure you use the proper tags when you post or the moderator, that has edited several time your posts, will get mad.

My apologize. I am new to the forum .
Sorry I mean I want to use your code below . But I need to understand how can I save outputfile using perl command below but need unique email list. I don't want to use |sort | uniq and I am sure we can achieve this in Perl

perl -ne '$"=qq{\n}; @mail = $_=~/saw:user name="(.*?)"/g and print "@mail\n"' dir1/inputfile{1,2,3} > output.result

What you are asking, now, changes the command radically.

I do not like how messy is becoming as a one-liner. Here's a proper Perl script.

#!/usr/bin/perl

use strict;
use warnings;

my %emails;
while(my $line = <>) {
    $emails{$_}++ for my @mail = $line =~ m/saw:user name="(.*?)"/g;
}

for my $email (sort keys %emails) {
    print "$email\n";
}

Save as extract_emails.pl
Run as

perl extract_emails.pl dir1/inputfile{1,2,3} > output.result
1 Like

I did this way in one command and its working great ...

perl -ne '$"=qq{\n}; @mail = $_=~/saw:user name="(.*?)"/g and print "@mail\n"' demand360+alerts/* rate360+alerts/* | sort | uniq > /home/orabi/abc.txt

I guess you changed your mind. :wink:
My post #15 script will do what you asked previously. It would keep unique ones and it will sort it in Perl.

And, if you going to use sort | uniq (after explicitly saying you didn't want to do that in post #14 in this thread), you could at least replace those two commands in your pipeline with just sort -u .