sed with pattern using variable

Dear Community;

I have a long xml file (100k+ lines) with patterns like below:

<OfferDefinition Id="123">
        <Type>Timer</Type>
        <Description>Test Text1</Description>
        <MajorPriority>95</MajorPriority>
        <SelectableInPolicy>0</SelectableInPolicy>
    </OfferDefinition>
    <OfferDefinition Id="456">
        <Type>Timer</Type>
        <Description>Test Text2</Description>
        <EnableAtProvisioning>0</EnableAtProvisioning>
        <EndOfProvisioning>0</EndOfProvisioning>
        <SelectableInPolicy>0</SelectableInPolicy>
    </OfferDefinition>

I need to print the Id value in each pattern and add it in the description

Id Value is 456 in below line

<OfferDefinition Id="456">

New Pattern:

<OfferDefinition Id="123">
        <Type>Timer</Type>
        <Description>123_Test Text1</Description>
        <MajorPriority>95</MajorPriority>
        <SelectableInPolicy>0</SelectableInPolicy>
    </OfferDefinition>
    <OfferDefinition Id="456">
        <Type>Timer</Type>
        <Description>456_Test Text2</Description>
        <EnableAtProvisioning>0</EnableAtProvisioning>
        <EndOfProvisioning>0</EndOfProvisioning>
        <SelectableInPolicy>0</SelectableInPolicy>
    </OfferDefinition>

I have tried below command but it does not work:

 var=`sed -n -e '/OfferDefinition Id="/ s/.*\=" *//; s/">//p' file.txt`; 
 sed -n '/<OfferDefinition Id/,/OfferDefinition>/ H;/OfferDefinition>/ {g;s/<Description>/a "$var"/p;x;}' file.txt

Instead of using a variable outside sed, I was trying to get the "id" within the same sed command and append it in the line "Description", but so far - no luck!

Thanks for any help/suggestions on this.

this adds Id value in the Description tag:

awk '
/<\/OfferDefinition>/ {id=""}
/<OfferDefinition Id=/ {id=$0; sub(".*Id=\"", "", id); sub("\".*", "", id);}
/<Description>.*<\/Description>/ && length(id) {sub("<\/", " " id "</")}
{print $0}
' file.txt
1 Like

Try also

awk -F\" '/OfferDefinition>/ {ID = $2} /Description>/ {sub (/>/, "&" ID "_")} 1' file
1 Like

With sed

sed -E '/Offer/{h;s/.*"(.*)">/\1/;x;};/Description/G;s/(.*>)(.*>)\n(.*)/\1\3_\2/' infile

Also doable with sed, continuing on your attempt

sed -e '/<OfferDefinition Id="/ {h; s/.*=" *//; s/ *">.*//; x; n;}' -e '/<Description>/ {H; x; s/\(.*\)\n\(.*<Description>\)/\2\1_/;}' file.txt

Better readable in two lines

sed '
  /<OfferDefinition Id="/ {h; s/.*=" *//; s/" *>.*//; x; n;}
  /<Description>/ {H; x; s/\(.*\)\n\(.*<Description>\)/\2\1_/;}
' file.txt

Of course digging out the saved value from the hold space is a bit of a hack (also in the previous post).

1 Like

Standard disclaimer: to "understand" XML a program(ming language) needs to work context-sensitive. For this you need a (recursive) parser Because regexp machines (like sed or awk ) aren't parserswhatever you can create with these will always retain some sort of uncertainty - in other works it will always be possible to trick them into doing something they shouldn't by crafting the input in a respective way.

Having said this: there is nothing wrong with a "best-effort" solution as long as you are aware that it is exactly this.

Your sed script was already quite close, here is how it goes:

First, you need to set rules what happens with which type of lines:

1) In a line of the form <OfferDefinition Id=...> we need to extract the value ID and store it somewhere.

2) In a line of the form </OfferDefinition> the block within which the ID makes sense ends and we have to drop the stored value there.

3) In a line of the form <Description>....</Description> we need to insert the stored value if there is one.

Notice that i assume the lines to be "well-behaved". This tag:

<Description>
....
</Description>

would be well inside the definition but would confuse the regexp as it is. You would have to work on this if you want to cover that too. Likewise for some other quirks - this is what i was talking above.

Now let us implement the three rules, notice that the explanations are NOT part of the script. Also notice (the last line) that th content of the hold space contains a line break, which we have to clear. This is one of the more tricky things when you work with multiline patterns:

sed '/<OfferDefinition Id=.*>/ {                # rule 1-lines
          p                                     # print, so that the unaltered line is in the output
          s/.*Id="//                            # remove everything up to Id="
          s/">.*//                              # remove the trailing part, isolating the value
          h                                     # move that to the hold space
          d                                     # and delete from pattern space
     }
     /<\/OfferDefinition>/ {                    # rule 2-lines
          p                                     # print unaltered line
          d                                     # delete pattern space
          x                                     # exchange hold/pattern (= clear hold)
          d                                     # and delete pattern again
     }
     /<Description>.*<\/Description>/ {         # rule 3-lines
          s/[   ]*$//                           # clear trailing whitespace
          G                                     # append hold space content to pattern space
          s/\(<Description>\)\(.*\)\(<\/Description>\)\(.*\)/\1\4_\2\3/
                                                # rearrange contents:
                                                # from: <Des>content</Desc>val
                                                # to:   <Des>val_content</Desc>
          s/\n//                                # remove extra line breaks
     }' /path/to/input

I hope this helps.

bakunin

3 Likes

Good idea, clear the hold buffer, so a <Description> outside the </OfferDefinition> block will not be altered.
But in my tests

/<\/OfferDefinition>/ {p; d; x; d;}

and

/<\/OfferDefinition>/ {p; d; h;}

failed(?), but

/<\/OfferDefinition>/ {x; s/.*//; x;}

worked.
Ah, of course, the d command jumps to the next input cycle, so the following commands are not run.

Actually this was not the case in my test and the script worked as it was shown here. For reference, i used Linux (Kernel 4.10.42) and GNU-sed 4.2.2, shell is Kornshell 93 u+.

bakunin

man sed
...
d

Delete pattern space. Start next cycle.

You must test with an input file that has a further <Description>xyz</Description> after (outside) the <OfferDefinition Id=...> ... </OfferDefinition> block.

While the other codes using awk were giving slight errors (while printing the 'id' in 'description', they were replacing the existing text) which could be tweaked, your code worked perfectly. Many Thanks!

---------- Post updated at 05:17 PM ---------- Previous update was at 05:03 PM ----------

Thanks Rudi, you are always helpful. Worked with a little bit of tweak in separators:

 awk -F\" '/OfferDefinition>|="/ {ID = $2"_"} /Description>/ {sub (/>/, "&" ID )} 1' file 

However, can you explain why you have used '1' at the end? I tried removing it to understand but the code doesn't work without it...

BR//

awk works in

pattern {action}

pairs. If pattern evaluates to TRUE, the respective action is executed. 1 is always TRUE, and for a missing actions the default, print , will be taken.

1 Like