Help using SED to comment XML elements

I'm trying to write a script to help automate some VERY tedious manual tasks.

I have groups of fairly large XML files (~3mb+) that I need to edit.

I need to look through the files and parse the XML looking for a certain flag contained in a field. If I find this flag (an integer value) I need to insert XML comments around the entire element (in the <!-- --> style) so that another XML parser will skip over them. After doing that, I later need to remove all the comments from the file (which I think I have).

I found this thread:

Which explains how to insert the comments using SED based on finding the element tag in the file. This is helpful, but I only need to comment elements that contain the "flag" (an int value). Unfortunately, the elements have various names, and aren't in any sort of order.

I was thinking about using PHP (what I'm most familiar with) or maybe Ruby to help parse through the XML to find matching flags, which I'm comfortable with. My problem is how to use/invoke SED once I find a element that needs commenting, and doing so.

This might be something easy, but at the moment I'm having a hard time figuring out which direction to go in. Does anyone have any guidance they'd share with me? Does it sound like I'm heading in the right direction, or am I totally off? Am I overlooking some obvious answer?

I'd much appreciate any help.

  • Jeremy

If you post your input and the expected output , then surely you can expect more responses .

if you know PHP, all the better. There are XML parsers for PHP you can use. just google for PHP XML parser. Or if you want to do it by hand, use the normal fopen(), fclose(), fgets() to read files, the suite of preg_* functions (or str* functions) for string manipulations etc..... See the PHP documentation site for examples..

My apologies if my original example wasn't very clear. Here's some sample data pulled from the XML files:

<analysisMessages.js>
        <Source_Eng_Old />
        <Source_Eng_New>Time Graph Base Properties - Analyze activities relative to time.</Source_Eng_New>
        <Source_Trans_Old />
        <Target_Trans_New>Time Graph Base Properties - Analyze activities relative to time.</Target_Trans_New>
        <NumOfKeys>1</NumOfKeys>
        <Translation_Type>1</Translation_Type>
        <ResourceId>timeGraph���common_props_title</ResourceId>
</analysisMessages.js>
<ganttChartMessages.js>
        <Source_Eng_Old />
        <Source_Eng_New>Show</Source_Eng_New>
        <Source_Trans_Old />
        <Target_Trans_New>Show</Target_Trans_New>
        <NumOfKeys>1</NumOfKeys>
        <Translation_Type>1</Translation_Type>
        <ResourceId>drawGanttChartButtonCaption</ResourceId>
</ganttChartMessages.js>
<peMessages.js>
        <Source_Eng_Old />
        <Source_Eng_New>Rerun the selected job</Source_Eng_New>
        <Source_Trans_Old />
        <Target_Trans_New>Rerun the selected job</Target_Trans_New>
        <NumOfKeys>1</NumOfKeys>
        <Translation_Type>9</Translation_Type>
        <ResourceId>jobGrid���tooltip���RERUN</ResourceId>
</peMessages.js>

I'm looking specifically at the <Translation_Type> attribute, and deciding whether to comment the entire element or not based upon what that integer is.

In the sample above, I'd want the last element to be commented out in this format:

<!--<peMessages.js>
        <Source_Eng_Old />
        <Source_Eng_New>Rerun the selected job</Source_Eng_New>
        <Source_Trans_Old />
        <Target_Trans_New>Rerun the selected job</Target_Trans_New>
        <NumOfKeys>1</NumOfKeys>
        <Translation_Type>9</Translation_Type>
        <ResourceId>jobGrid���tooltip���RERUN</ResourceId>
 </peMessages.js>-->
   

I'm familiar with PHP, but I've only done a little XML parsing with it before. I was under the impression that while I could parse through the XML and easily match/do logic on the Translation_Type values, I wouldn't be able to drop in those comments before and after the element? I'll certainly go back and look through my notes, but I was remembering using a multi-dimensional array type data structure to access the various elements, not actually editing the raw line-by-line file itself (that part being hidden via the PHP class, obviously you can add/drop/change the XML attributes).

Sorry but sed or awk are not appropriate tools for massaging XML data (documents) except for very simple files. Instead you need to transform the document using a XSLT stylesheet processor.

BTW Translation_Type is an element and not an attribute. Assuming that your document is valid and well-formed, here is a stylesheet which will do the transformation that you want.

The only change I made was to add a top-level node called "root" for well-formedness. You should change "root" to the name of your top-level element.

Here is the stylesheet:

<?xml version="1.0" encoding="UTF8" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

    <xsl:template match="node()">
        <xsl:if test="./Translation_Type != '9'" >
            <xsl:copy-of select="."/>
        </xsl:if>
        <xsl:if test="./Translation_Type = '9'" >
           <xsl:text disable-output-escaping="yes"><!-- </xsl:text>
           <xsl:copy-of select="." />
           <xsl:text disable-output-escaping="yes"> --></xsl:text>
        </xsl:if>
        <xsl:text>
</xsl:text>
    </xsl:template>

    <xsl:template match="root">
       <xsl:element name="{ name() }" >
           <xsl:text>
</xsl:text>
           <xsl:apply-templates select="*"/>
       </xsl:element>
    </xsl:template>

</xsl:stylesheet>

Note you may have to change the encoding to suit your data set.

Assuming your document is called test.xml and your stylesheet is called test.xsl, invoking xsltproc test.xsl test.xsl gives the following output:

<?xml version="1.0"?>
<root>
<analysisMessages.js>
        <Source_Eng_Old/>
        <Source_Eng_New>Time Graph Base Properties - Analyze activities relative to time.</Source_Eng_New>
        <Source_Trans_Old/>
        <Target_Trans_New>Time Graph Base Properties - Analyze activities relative to time.</Target_Trans_New>
        <NumOfKeys>1</NumOfKeys>
        <Translation_Type>1</Translation_Type>
        <ResourceId>timeGraph???common_props_title</ResourceId>
</analysisMessages.js>
<ganttChartMessages.js>
        <Source_Eng_Old/>
        <Source_Eng_New>Show</Source_Eng_New>
        <Source_Trans_Old/>
        <Target_Trans_New>Show</Target_Trans_New>
        <NumOfKeys>1</NumOfKeys>
        <Translation_Type>1</Translation_Type>
        <ResourceId>drawGanttChartButtonCaption</ResourceId>
</ganttChartMessages.js>
<!-- <peMessages.js>
        <Source_Eng_Old/>
        <Source_Eng_New>Rerun the selected job</Source_Eng_New>
        <Source_Trans_Old/>
        <Target_Trans_New>Rerun the selected job</Target_Trans_New>
        <NumOfKeys>1</NumOfKeys>
        <Translation_Type>9</Translation_Type>
        <ResourceId>jobGrid???tooltip???RERUN</ResourceId>
</peMessages.js> -->
</root>

Note there are some question marks in the output elements. That is because I did not bother setting up the correct locale and code-set on my system to suit your sample data.