Extract selective block from XML file

Hi,

There's an xml file produced from a front-end tool as shown below:

 
<INPUT DATABASE ="ORACLE" DBNAME ="UNIX" NAME ="FACT_TABLE" OWNERNAME ="DIPS">
<INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" NAME ="STORE_KEY" PICTURETEXT ="" PORTTYPE ="INPUT" PRECISION ="15" SCALE ="0"/>
<INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" EXPRESSION ="VAR_STORE_KEY" EXPRESSIONTYPE ="GENERAL" NAME ="VAR_PREV_STORE_KEY" PICTURETEXT ="" PORTTYPE ="LOCAL VARIABLE" PRECISION ="15" SCALE ="0"/>
</INPUT>
<INPUT DATABASE ="ORACLE" DBNAME ="UNIX" NAME ="DIM_TABLE" OWNERNAME ="RAGS">
<INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" NAME ="DEPT_KEY" PICTURETEXT ="" PORTTYPE ="INPUT" PRECISION ="15" SCALE ="0"/>
<INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" EXPRESSION ="VAR_DEPT_KEY" EXPRESSIONTYPE ="GENERAL" NAME ="VAR_PREV_DEPT_KEY" PICTURETEXT ="" PORTTYPE ="LOCAL VARIABLE" PRECISION ="15" SCALE ="0"/>
</INPUT>
 
.... so on

I want to extract only one of these blocks based on the NAME for e.g. If the input is "FACT_TABLE" then only the 1st block OR lines within this block should be selected that is...

 
<INPUT DATABASE ="ORACLE" DBNAME ="UNIX" NAME ="FACT_TABLE" OWNERNAME ="DIPS">
<INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" NAME ="STORE_KEY" PICTURETEXT ="" PORTTYPE ="INPUT" PRECISION ="15" SCALE ="0"/>
<INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" EXPRESSION ="VAR_STORE_KEY" EXPRESSIONTYPE ="GENERAL" NAME ="VAR_PREV_STORE_KEY" PICTURETEXT ="" PORTTYPE ="LOCAL VARIABLE" PRECISION ="15" SCALE ="0"/>
</INPUT>

OR

<INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" NAME ="STORE_KEY" PICTURETEXT ="" PORTTYPE ="INPUT" PRECISION ="15" SCALE ="0"/>
<INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" EXPRESSION ="VAR_STORE_KEY" EXPRESSIONTYPE ="GENERAL" NAME ="VAR_PREV_STORE_KEY" PICTURETEXT ="" PORTTYPE ="LOCAL VARIABLE" PRECISION ="15" SCALE ="0"/>

Note: I cannot use any XML parser tool as I don't permission to download it in my office. Also the code would be a part of Korn shell script. Please suggest.

-dips

Like this?

awk '/NAME ="FACT_TABLE"/,/<\/INPUT>/' infile
v='"FACT_TABLE"'; sed -n "/NAME =$v/,/<\/INPUT>/p" infile

Because I like PERL:

use strict;
use warnings;
use Getopt::Long;
use File::Basename;

my $NAME = basename $0, '.pl';
my $name = '';

Getopt::Long::Configure(qw{ bundling no_ignore_case });

undef $/;

$\ = "\n";
$, = "\n";

unless (GetOptions('name|n=s' => \$name)) {
    print STDERR "USAGE: $NAME [--name=<NAME> | -n <name> ] files...";
    exit 1;
}

while (<>) {
    my @M = m{<INPUT[^>]*NAME\s*=\s*"$name"[^>]*>\s*(.*?)\s*</INPUT>}gs;
    print @M if 0 < @M;
}

Use:

./scriptname --name=DIM_TABLE inputfile
<INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" NAME ="DEPT_KEY" PICTURETEXT ="" PORTTYPE ="INPUT" PRECISION ="15" SCALE ="0"/>
<INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" EXPRESSION ="VAR_DEPT_KEY" EXPRESSIONTYPE ="GENERAL" NAME ="VAR_PREV_DEPT_KEY" PICTURETEXT ="" PORTTYPE ="LOCAL VARIABLE" PRECISION ="15" SCALE ="0"/>

Hi Scrutinizer,
I tried both of your commands and they work partially i.e. it selects other blocks with different tags and omits the same tag block

The whole file would look like NOTE: I have shown all the tags but shrunk the no. of lines (In the earlier post I showed only an excerpt from the XML file, so may be you didn't get the idea how the whole file would look......I thought it should suffice but I was mistaken!!)

 
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE POWERHOUSE SYSTEM >
<DATAMART NAME="DEVELOPMENT" VERSION="1" CODEPAGE="Latin1" DATABASETYPE="Oracle">
<FOLDER NAME="CUSTOMER_USAGE" GROUP="" OWNER="Administrator" SHARED="NOTSHARED" DESCRIPTION="" PERMISSIONS="rwx---r--" UUID="acb98070232-agaiojhawv9780">
<INPUT DATABASE ="ORACLE" DBNAME ="UNIX" NAME ="FACT_TABLE" OWNERNAME ="DIPS">
        <INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" NAME ="STORE_KEY" PICTURETEXT ="" PORTTYPE ="INPUT" PRECISION ="15" SCALE ="0"/>
        <INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" EXPRESSION ="VAR_STORE_KEY" EXPRESSIONTYPE ="GENERAL" NAME ="VAR_PREV_STORE_KEY" PICTURETEXT ="" PORTTYPE ="LOCAL VARIABLE" PRECISION ="15" SCALE ="0"/>
</INPUT>
<INPUT DATABASE ="ORACLE" DBNAME ="UNIX" NAME ="DIM_TABLE" OWNERNAME ="RAGS">
        <INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" NAME ="DEPT_KEY" PICTURETEXT ="" PORTTYPE ="INPUT" PRECISION ="15" SCALE ="0"/>
        <INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" EXPRESSION ="VAR_DEPT_KEY" EXPRESSIONTYPE ="GENERAL" NAME ="VAR_PREV_DEPT_KEY" PICTURETEXT ="" PORTTYPE ="LOCAL VARIABLE" PRECISION ="15" SCALE ="0"/>
</INPUT>
<OUTPUT DATABASE ="Oracle" DESCRIPTION ="TABLE STRUCTURE FOR ERROR TABLE" NAME ="ERROR_TABLE" OBJECTVERSION ="1" TABLEOPTIONS ="" VERSIONNUMBER ="1">
        <OUTPUTFIELD DATATYPE ="date" DESCRIPTION ="" FIELDNUMBER ="1" KEYTYPE ="NOT A KEY" NAME ="TIME_KEY" NULLABLE ="NULL" PICTURETEXT ="" PRECISION ="19" SCALE ="0"/>
        <OUTPUTFIELD DATATYPE ="number(p,s)" DESCRIPTION ="" FIELDNUMBER ="2" KEYTYPE ="NOT A KEY" NAME ="ERROR_KEY" NULLABLE ="NULL" PICTURETEXT ="" PRECISION ="15" SCALE ="0"/>
</OUTPUT>
<TRANSFER DESCRIPTION ="Calculates customer's spending in a store. This is a FACT_TABLE" NAME ="CAL_CUST_SPEND" OBJECTVERSION ="1" REUSABLE ="NO" TYPE ="Expression" VERSIONNUMBER ="9">
            <TRANSFERMFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" NAME ="CUSTOMER_KEY" PICTURETEXT ="" PORTTYPE ="INPUT" PRECISION ="15" SCALE ="0"/>
            <TRANSFERMFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" EXPRESSION ="STORE_KEY" EXPRESSIONTYPE ="GENERAL" NAME ="STORE_KEY" PICTURETEXT ="" PORTTYPE ="LOCAL VARIABLE" PRECISION ="15" SCALE ="0"/>
</TRANSFER>
<TRANSFER DESCRIPTION ="Calculates store's inventory" NAME ="LKP_FACT_TABLE" OBJECTVERSION ="1" REUSABLE ="NO" TYPE ="Expression" VERSIONNUMBER ="9">
            <TRANSFERMFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" NAME ="STORE_KEY" PICTURETEXT ="" PORTTYPE ="INPUT" PRECISION ="15" SCALE ="0"/>
            <TRANSFERMFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" EXPRESSION ="STORE_KEY" EXPRESSIONTYPE ="GENERAL" NAME ="STORE_KEY" PICTURETEXT ="" PORTTYPE ="LOCAL VARIABLE" PRECISION ="15" SCALE ="0"/>
</TRANSFER>
<TRANSFER DESCRIPTION ="Tags errors" NAME ="ERROR_TYPE" OBJECTVERSION ="1" REUSABLE ="NO" TYPE ="Expression" VERSIONNUMBER ="9">
            <TRANSFERMFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" NAME ="ERROR_KEY" PICTURETEXT ="" PORTTYPE ="INPUT" PRECISION ="15" SCALE ="0"/>
            <TRANSFERMFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" EXPRESSION ="ERROR_KEY" EXPRESSIONTYPE ="GENERAL" NAME ="ERROR_KEY" PICTURETEXT ="" PORTTYPE ="LOCAL VARIABLE" PRECISION ="15" SCALE ="0"/>
</TRANSFER>
<INSTANCE DESCRIPTION ="This component pulls all the records from the FACT_TABLE" NAME ="FACT_TABLE" REUSABLE ="NO" INPUT_NAME ="FACT_TABLE" INPUT_TYPE ="Lookup Procedure" TYPE ="INPUT"/>
<INSTANCE DESCRIPTION ="This component pulls all the records from the DIM_TABLE" NAME ="DIM_TABLE" REUSABLE ="NO" INPUT_NAME ="DIM_TABLE" INPUT_TYPE ="Source Qualifier" TYPE ="INPUT">
<ASSOCIATED_INSTANCE NAME ="FACT_TABLE"/>
<ASSOCIATED_INSTANCE NAME ="ANOTHER_TABLE"/>
        </INSTANCE>
<CONNECTOR FROMFIELD ="FROM_FIELD_1" FROMINSTANCE ="COMP_INSERT" FROMINSTANCETYPE ="Router" TOFIELD ="TO_FIELD_1" TOINSTANCE ="FACT_TABLE" TOINSTANCETYPE ="Target Definition"/>
        <CONNECTOR FROMFIELD ="FROM_FIELD_2" FROMINSTANCE ="COMP_INSERT" FROMINSTANCETYPE ="Router" TOFIELD ="TO_FIELD_1" TOINSTANCE ="ANOTHER_TABLE" TOINSTANCETYPE ="Target Definition"/>
<TARGETLOADORDER ORDER ="1" TARGETINSTANCE ="MAIN_TABLE"/>
        <TARGETLOADORDER ORDER ="1" TARGETINSTANCE ="ERROR_TABLE"/>
        <ERPINFO/>
        <METADATAEXTENSION COMPONENTVERSION ="1111111" DATATYPE ="STRING" DESCRIPTION ="name" DOMAIN ="User Defined Metadata Domain" ISEDITABLE ="YES" ISVISIBLE ="YES" ISREUSABLE ="YES" ISSHAREREAD ="NO" ISSHAREWRITE ="NO" MAXLENGTH ="256" NAME ="kk" VALUE ="" VENDORNAME ="ABC"/>
    </MAP>
</FOLDER>
</DATAMART>
</POWERHOUSE>

Results into

 
<INPUT DATABASE ="ORACLE" DBNAME ="UNIX" NAME ="FACT_TABLE" OWNERNAME ="DIPS">
        <INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" NAME ="STORE_KEY" PICTURETEXT ="" PORTTYPE ="INPUT" PRECISION ="15" SCALE ="0"/>
        <INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" EXPRESSION ="VAR_STORE_KEY" EXPRESSIONTYPE ="GENERAL" NAME ="VAR_PREV_STORE_KEY" PICTURETEXT ="" PORTTYPE ="LOCAL VARIABLE" PRECISION ="15" SCALE ="0"/>
</INPUT>
<INSTANCE DESCRIPTION ="This component pulls all the records from the FACT_TABLE" NAME ="FACT_TABLE" REUSABLE ="NO" INPUT_NAME ="FACT_TABLE" INPUT_TYPE ="Lookup Procedure" TYPE ="INPUT"/>
<INSTANCE DESCRIPTION ="This component pulls all the records from the DIM_TABLE" NAME ="DIM_TABLE" REUSABLE ="NO" INPUT_NAME ="DIM_TABLE" INPUT_TYPE ="Source Qualifier" TYPE ="INPUT">
<ASSOCIATED_INSTANCE NAME ="FACT_TABLE"/>
<ASSOCIATED_INSTANCE NAME ="ANOTHER_TABLE"/>
        </INSTANCE>
<CONNECTOR FROMFIELD ="FROM_FIELD_1" FROMINSTANCE ="COMP_INSERT" FROMINSTANCETYPE ="Router" TOFIELD ="TO_FIELD_1" TOINSTANCE ="FACT_TABLE" TOINSTANCETYPE ="Target Definition"/>
        <CONNECTOR FROMFIELD ="FROM_FIELD_2" FROMINSTANCE ="COMP_INSERT" FROMINSTANCETYPE ="Router" TOFIELD ="TO_FIELD_1" TOINSTANCE ="ANOTHER_TABLE" TOINSTANCETYPE ="Target Definition"/>
<TARGETLOADORDER ORDER ="1" TARGETINSTANCE ="MAIN_TABLE"/>
        <TARGETLOADORDER ORDER ="1" TARGETINSTANCE ="ERROR_TABLE"/>
        <ERPINFO/>
        <METADATAEXTENSION COMPONENTVERSION ="1111111" DATATYPE ="STRING" DESCRIPTION ="name" DOMAIN ="User Defined Metadata Domain" ISEDITABLE ="YES" ISVISIBLE ="YES" ISREUSABLE ="YES" ISSHAREREAD ="NO" ISSHAREWRITE ="NO" MAXLENGTH ="256" NAME ="kk" VALUE ="" VENDORNAME ="ABC"/>
    </MAP>
</FOLDER>
</DATAMART>
</POWERHOUSE>

Please suggest how to select only the lines enclosed in tags <INPUT> & <\INPUT> and having NAME as "FACT_TABLE".
-dips

Try /<INPUT.*NAME ="FACT_TABLE"/ instead of /NAME ="FACT_TABLE"/

1 Like

Hi Scrutinizer,

Neat. It worked perfectly fine!! Thankyou.

This kind of "awk" construct means that the lines between FROM KEYWORD & TO KEYWORD be selected. Am I right?

 
awk '/FROM KEYWORD/,/TO KEYWORD/' infile 

-dips

Yes, it is called a "range pattern"