Problem getting Nth match in sed

Hi all,

I'm trying to create a sed command to get the Nth instance of an XML tag in a string, but thus far I can only ever seem to get the last one.

Given an XML string:

<Wrap><GrayLevel>a</GrayLevel><GrayLevel>b</GrayLevel></Wrap>

I tried to do this on the command line to get each GrayLevel (yes, I know it's messy, these were just the examples I was trying):

echo "<Wrap><GrayLevel>a</GrayLevel><GrayLevel>b</GrayLevel></Wrap>" | sed -e "s/.*<GrayLevel[^>]*>\(.*\)<\/GrayLevel>.*/\1/1"
echo "<Wrap><GrayLevel>a</GrayLevel><GrayLevel>b</GrayLevel></Wrap>" | sed -e "s/.*<GrayLevel[^>]*>\(.*\)<\/GrayLevel>.*/\1/2"

I expected to get the following from these examples:

a
b

But instead I got:

b
<Wrap><GrayLevel>a</GrayLevel><GrayLevel>b</GrayLevel></Wrap>

Can anyone explain to me what I'm doing wrong, please? I'm guessing it's something with the match parameter, but I can't figure out what.

Thanks,
Zel2008

The problem is that the .* will match anything so there can only be one match per line (the last occurrence), so this is difficult with sed .

Perhaps you could try awk instead. For example:

awk -v n=1 '$1=="GrayLevel"{if(++c==n) print $2}' RS=\< FS=\>
awk -v n=2 '$1=="GrayLevel"{if(++c==n) print $2}' RS=\< FS=\>

Thanks Scrutinizer,

I managed to find a solution in Perl:

perl -pe "s/(.*?<GrayLevel>){1}(.*?)<\/GrayLevel>.*/\2/"

That being said, I know absolutely nothing about awk. Would you mind explaining your awk solution in more detail, so I can learn what it does?

Thanks,
Zel2008

Hi,
A sed solution (difference in red):

$ echo "<Wrap><GrayLevel>a</GrayLevel><GrayLevel>b</GrayLevel></Wrap>" | sed -e 's/<GrayLevel>\([^<]\+\)<\/GrayLevel>/xxx\1xxx/1;s/.*xxx\(.*\)xxx.*/\1/'
a
$ echo "<Wrap><GrayLevel>a</GrayLevel><GrayLevel>b</GrayLevel></Wrap>" | sed -e 's/<GrayLevel>\([^<]\+\)<\/GrayLevel>/xxx\1xxx/2;s/.*xxx\(.*\)xxx.*/\1/'
b

Regards.

1 Like

Nice work-around disedorgue. One remark \+ is a GNU extension. Instead you could use the standard * , which would work just as well in this case. So ... \([^<]*\) ...

1 Like

Hello,

Following may also help.

echo "<Wrap><GrayLevel>a</GrayLevel><GrayLevel>b</GrayLevel></Wrap>" | awk -F"[<>]" '{for(i=1;i<=NF;i++){if($i ~ /^[a-z]$/) {print $i}}}'

Output will be as follows.

a
b

Thanks,
R. Singh

@Ravinder. That will only happen to work when the content is a single lowercase letter. I am sure that will be true for the example only. Besides it does not take the name of the label into consideration and it would list both labels and content if they are 1 lowercase letter wide..

Yes, Scrutinizer. It is true to only this example only. Thank you for pointing out the same.