Problem getting Nth match in sed

Zel2008 · July 21, 2014, 11:00am

Hi all,

I'm trying to create a sed command to get the Nth instance of an XML tag in a string, but thus far I can only ever seem to get the last one.

Given an XML string:

<Wrap><GrayLevel>a</GrayLevel><GrayLevel>b</GrayLevel></Wrap>

I tried to do this on the command line to get each GrayLevel (yes, I know it's messy, these were just the examples I was trying):

echo "<Wrap><GrayLevel>a</GrayLevel><GrayLevel>b</GrayLevel></Wrap>" | sed -e "s/.*<GrayLevel[^>]*>\(.*\)<\/GrayLevel>.*/\1/1"
echo "<Wrap><GrayLevel>a</GrayLevel><GrayLevel>b</GrayLevel></Wrap>" | sed -e "s/.*<GrayLevel[^>]*>\(.*\)<\/GrayLevel>.*/\1/2"

I expected to get the following from these examples:

a
b

But instead I got:

b
<Wrap><GrayLevel>a</GrayLevel><GrayLevel>b</GrayLevel></Wrap>

Can anyone explain to me what I'm doing wrong, please? I'm guessing it's something with the match parameter, but I can't figure out what.

Thanks,
Zel2008

Scrutinizer · July 21, 2014, 11:59am

The problem is that the .* will match anything so there can only be one match per line (the last occurrence), so this is difficult with sed .

Perhaps you could try awk instead. For example:

awk -v n=1 '$1=="GrayLevel"{if(++c==n) print $2}' RS=\< FS=\>

awk -v n=2 '$1=="GrayLevel"{if(++c==n) print $2}' RS=\< FS=\>

Zel2008 · July 21, 2014, 12:59pm

Thanks Scrutinizer,

I managed to find a solution in Perl:

perl -pe "s/(.*?<GrayLevel>){1}(.*?)<\/GrayLevel>.*/\2/"

That being said, I know absolutely nothing about awk. Would you mind explaining your awk solution in more detail, so I can learn what it does?

Thanks,
Zel2008

disedorgue · July 21, 2014, 5:56pm

Hi,
A sed solution (difference in red):

$ echo "<Wrap><GrayLevel>a</GrayLevel><GrayLevel>b</GrayLevel></Wrap>" | sed -e 's/<GrayLevel>\([^<]\+\)<\/GrayLevel>/xxx\1xxx/1;s/.*xxx\(.*\)xxx.*/\1/'
a
$ echo "<Wrap><GrayLevel>a</GrayLevel><GrayLevel>b</GrayLevel></Wrap>" | sed -e 's/<GrayLevel>\([^<]\+\)<\/GrayLevel>/xxx\1xxx/2;s/.*xxx\(.*\)xxx.*/\1/'
b

Regards.

Scrutinizer · July 21, 2014, 6:30pm

Nice work-around disedorgue. One remark \+ is a GNU extension. Instead you could use the standard * , which would work just as well in this case. So ... \([^<]*\) ...

RavinderSingh13 · August 6, 2014, 5:43am

Hello,

Following may also help.

echo "<Wrap><GrayLevel>a</GrayLevel><GrayLevel>b</GrayLevel></Wrap>" | awk -F"[<>]" '{for(i=1;i<=NF;i++){if($i ~ /^[a-z]$/) {print $i}}}'

Output will be as follows.

a
b

Thanks,
R. Singh

Scrutinizer · August 6, 2014, 6:13am

@Ravinder. That will only happen to work when the content is a single lowercase letter. I am sure that will be true for the example only. Besides it does not take the name of the label into consideration and it would list both labels and content if they are 1 lowercase letter wide..

RavinderSingh13 · August 6, 2014, 6:17am

Yes, Scrutinizer. It is true to only this example only. Thank you for pointing out the same.