sed with pattern using variable

mystition · January 11, 2018, 8:14am

Dear Community;

I have a long xml file (100k+ lines) with patterns like below:

<OfferDefinition Id="123">
        <Type>Timer</Type>
        <Description>Test Text1</Description>
        <MajorPriority>95</MajorPriority>
        <SelectableInPolicy>0</SelectableInPolicy>
    </OfferDefinition>
    <OfferDefinition Id="456">
        <Type>Timer</Type>
        <Description>Test Text2</Description>
        <EnableAtProvisioning>0</EnableAtProvisioning>
        <EndOfProvisioning>0</EndOfProvisioning>
        <SelectableInPolicy>0</SelectableInPolicy>
    </OfferDefinition>

I need to print the Id value in each pattern and add it in the description

Id Value is 456 in below line

<OfferDefinition Id="456">

New Pattern:

<OfferDefinition Id="123">
        <Type>Timer</Type>
        <Description>123_Test Text1</Description>
        <MajorPriority>95</MajorPriority>
        <SelectableInPolicy>0</SelectableInPolicy>
    </OfferDefinition>
    <OfferDefinition Id="456">
        <Type>Timer</Type>
        <Description>456_Test Text2</Description>
        <EnableAtProvisioning>0</EnableAtProvisioning>
        <EndOfProvisioning>0</EndOfProvisioning>
        <SelectableInPolicy>0</SelectableInPolicy>
    </OfferDefinition>

I have tried below command but it does not work:

 var=`sed -n -e '/OfferDefinition Id="/ s/.*\=" *//; s/">//p' file.txt`; 
 sed -n '/<OfferDefinition Id/,/OfferDefinition>/ H;/OfferDefinition>/ {g;s/<Description>/a "$var"/p;x;}' file.txt

Instead of using a variable outside sed, I was trying to get the "id" within the same sed command and append it in the line "Description", but so far - no luck!

Thanks for any help/suggestions on this.

rdrtx1 · January 11, 2018, 8:30am

this adds Id value in the Description tag:

awk '
/<\/OfferDefinition>/ {id=""}
/<OfferDefinition Id=/ {id=$0; sub(".*Id=\"", "", id); sub("\".*", "", id);}
/<Description>.*<\/Description>/ && length(id) {sub("<\/", " " id "</")}
{print $0}
' file.txt

RudiC · January 11, 2018, 8:44am

Try also

awk -F\" '/OfferDefinition>/ {ID = $2} /Description>/ {sub (/>/, "&" ID "_")} 1' file

ctac · January 11, 2018, 12:14pm

With sed

sed -E '/Offer/{h;s/.*"(.*)">/\1/;x;};/Description/G;s/(.*>)(.*>)\n(.*)/\1\3_\2/' infile

MadeInGermany · January 11, 2018, 12:37pm

Also doable with sed, continuing on your attempt

sed -e '/<OfferDefinition Id="/ {h; s/.*=" *//; s/ *">.*//; x; n;}' -e '/<Description>/ {H; x; s/\(.*\)\n\(.*<Description>\)/\2\1_/;}' file.txt

Better readable in two lines

sed '
  /<OfferDefinition Id="/ {h; s/.*=" *//; s/" *>.*//; x; n;}
  /<Description>/ {H; x; s/\(.*\)\n\(.*<Description>\)/\2\1_/;}
' file.txt

Of course digging out the saved value from the hold space is a bit of a hack (also in the previous post).

bakunin · January 11, 2018, 1:03pm

mystition:

I have tried below command but it does not work:
 var=`sed -n -e '/OfferDefinition Id="/ s/.*\=" *//; s/">//p' file.txt`; 
 sed -n '/<OfferDefinition Id/,/OfferDefinition>/ H;/OfferDefinition>/ {g;s/<Description>/a "$var"/p;x;}' file.txt
Instead of using a variable outside sed, I was trying to get the "id" within the same sed command and append it in the line "Description", but so far - no luck!

Standard disclaimer: to "understand" XML a program(ming language) needs to work context-sensitive. For this you need a (recursive) parser Because regexp machines (like sed or awk ) aren't parserswhatever you can create with these will always retain some sort of uncertainty - in other works it will always be possible to trick them into doing something they shouldn't by crafting the input in a respective way.

Having said this: there is nothing wrong with a "best-effort" solution as long as you are aware that it is exactly this.

Your sed script was already quite close, here is how it goes:

First, you need to set rules what happens with which type of lines:

1) In a line of the form <OfferDefinition Id=...> we need to extract the value ID and store it somewhere.

2) In a line of the form </OfferDefinition> the block within which the ID makes sense ends and we have to drop the stored value there.

3) In a line of the form <Description>....</Description> we need to insert the stored value if there is one.

Notice that i assume the lines to be "well-behaved". This tag:

<Description>
....
</Description>

would be well inside the definition but would confuse the regexp as it is. You would have to work on this if you want to cover that too. Likewise for some other quirks - this is what i was talking above.

Now let us implement the three rules, notice that the explanations are NOT part of the script. Also notice (the last line) that th content of the hold space contains a line break, which we have to clear. This is one of the more tricky things when you work with multiline patterns:

sed '/<OfferDefinition Id=.*>/ {                # rule 1-lines
          p                                     # print, so that the unaltered line is in the output
          s/.*Id="//                            # remove everything up to Id="
          s/">.*//                              # remove the trailing part, isolating the value
          h                                     # move that to the hold space
          d                                     # and delete from pattern space
     }
     /<\/OfferDefinition>/ {                    # rule 2-lines
          p                                     # print unaltered line
          d                                     # delete pattern space
          x                                     # exchange hold/pattern (= clear hold)
          d                                     # and delete pattern again
     }
     /<Description>.*<\/Description>/ {         # rule 3-lines
          s/[   ]*$//                           # clear trailing whitespace
          G                                     # append hold space content to pattern space
          s/\(<Description>\)\(.*\)\(<\/Description>\)\(.*\)/\1\4_\2\3/
                                                # rearrange contents:
                                                # from: <Des>content</Desc>val
                                                # to:   <Des>val_content</Desc>
          s/\n//                                # remove extra line breaks
     }' /path/to/input

I hope this helps.

bakunin

MadeInGermany · January 11, 2018, 1:38pm

Good idea, clear the hold buffer, so a <Description> outside the </OfferDefinition> block will not be altered.
But in my tests

/<\/OfferDefinition>/ {p; d; x; d;}

and

/<\/OfferDefinition>/ {p; d; h;}

failed(?), but

/<\/OfferDefinition>/ {x; s/.*//; x;}

worked.
Ah, of course, the d command jumps to the next input cycle, so the following commands are not run.

bakunin · January 11, 2018, 6:11pm

Actually this was not the case in my test and the script worked as it was shown here. For reference, i used Linux (Kernel 4.10.42) and GNU-sed 4.2.2, shell is Kornshell 93 u+.

bakunin

MadeInGermany · January 12, 2018, 1:43am

man sed
...
d

Delete pattern space. Start next cycle.

You must test with an input file that has a further <Description>xyz</Description> after (outside) the <OfferDefinition Id=...> ... </OfferDefinition> block.

mystition · January 14, 2018, 6:47am

madeingermany:

Also doable with sed, continuing on your attempt
sed -e '/<OfferDefinition Id="/ {h; s/.*=" *//; s/ *">.*//; x; n;}' -e '/<Description>/ {H; x; s/$.*$\n$.*<Description>$/\2\1_/;}' file.txt
Better readable in two lines
sed '
  /<OfferDefinition Id="/ {h; s/.*=" *//; s/" *>.*//; x; n;}
  /<Description>/ {H; x; s/$.*$\n$.*<Description>$/\2\1_/;}
' file.txt
Of course digging out the saved value from the hold space is a bit of a hack (also in the previous post).

While the other codes using awk were giving slight errors (while printing the 'id' in 'description', they were replacing the existing text) which could be tweaked, your code worked perfectly. Many Thanks!

---------- Post updated at 05:17 PM ---------- Previous update was at 05:03 PM ----------

Thanks Rudi, you are always helpful. Worked with a little bit of tweak in separators:

 awk -F\" '/OfferDefinition>|="/ {ID = $2"_"} /Description>/ {sub (/>/, "&" ID )} 1' file

However, can you explain why you have used '1' at the end? I tried removing it to understand but the code doesn't work without it...

BR//

RudiC · January 14, 2018, 9:07am

awk works in

pattern {action}

pairs. If pattern evaluates to TRUE, the respective action is executed. 1 is always TRUE, and for a missing actions the default, print , will be taken.