UNIX shell script required to read two columns from xml

Hello All,

I am new to unix scripting. I need to read FROMINSTANCE, FROMFIELD from below XML code and need to write a script for

insert into SQ_EPIC values( "DID", "PROJECT_NAME")
Select DID, PROJECT_NAME from EPIC
<CONNECTOR TOINSTANCETYPE="Source Qualifier" TOINSTANCE="SQ_EPIC" TOFIELD="DID" FROMINSTANCETYPE="Source Definition" FROMINSTANCE="EPIC" FROMFIELD="DID"/>

<CONNECTOR TOINSTANCETYPE="Source Qualifier" TOINSTANCE="SQ_EPIC" TOFIELD="PROJECT_NAME" FROMINSTANCETYPE="Source Definition" FROMINSTANCE="EPIC" FROMFIELD="PROJECT_NAME"/>

<CONNECTOR TOINSTANCETYPE="Target Definition" TOINSTANCE="STAGE_EPIC" TOFIELD="DID" FROMINSTANCETYPE="Source Qualifier" FROMINSTANCE="SQ_EPIC" FROMFIELD="DID"/>

<CONNECTOR TOINSTANCETYPE="Target Definition" TOINSTANCE="STAGE_EPIC" TOFIELD="PROJECT_NAME"FROMINSTANCETYPE="Source Qualifier" FROMINSTANCE="SQ_EPIC" FROMFIELD="PROJECT_NAME"/>

Kindly some help me how to proceed for this requirement.

Thanks in Advance!!!

Regards,
Sekhar Lekkala.

Handling XML is not trivial, but we get asked for it a lot, and I have a script for the purpose:

# yanx.awk v0.0.9, Tyler Montbriand, 2019.  Yet another noncompliant XML parser
###############################################################################
# XML is a pain to process in the shell, but people need it all the time.
# I've been using and improving this kludge since 2014 or so.  It parses and
# stacks tags and digests parameters, allowing simple XML processing and
# extraction to be managed with a handful of lines addendum.
#
# I've restricted my use of GNU features enough that this script will run on
# busybox's awk.  I think it works with mawk except -e is unsupported.
# You can work around that by running multiple files, i.e.
# mawk -f yanx.awk -f mystuff.awk inputfile
###############################################################################
# Basic use:
#
# Fed this XML, <body><html a="b">Your Web Browser Hates This</html></body>
# yanx will read it token-by-token as so:
#     Line 1:  Empty, skipped
#     Line 2:  $1="body"
#     Line 3:  $1="html a="b"", $2="Your web browser hates this"
#     Line 4:  $1="/html"
#     Line 5:  $1="/body", $2="\n"
#
# The script sets a few new "special" variables along the way.
# TAG           The name of the current tag, uppercased.
# CTAG          If close-tag, name in uppercase.
# TAGS          List of nested tags, like HTML%BODY%, including current tag
# LTAGS         List of nested tags, not including current tag
# ARGS          Array of tag parameters, uppercased.  i.e. ARGS["HREF"]
# DEP           How many tags deep it's nested, including current tag.
#
###############################################################################
# Examples:
# # Rewrite cdata of all divs
# awk -f yanx.awk -e 'TAGS ~ /^DIV%/ { $2="quux froob" } 1' input
# # Extract href's from every link
# awk -f yanx.awk -e 'TAGS~/^A%/ && ("HREF" in ARGS) {
#       print ARGS["HREF"] }' ORS="\n" input
###############################################################################
# Known Bugs:
# A short XML script can't possibly handle DOD, etc.  Entities a la <
# are not translated either.
#
# I've done my best to make it swallow <!--, <? ?> and other such fancy
# XML syntax without choking, but that doesn't mean it handles them
# properly either.
#
# It's an XML parser, not an HTML parser.  It probably won't swallow a
# wild-from-the internet HTML web page without some cleanup first:
# javascript, tags inside comments, etc will be mangled instead of ignored.
#
# Last: Because of its design, when printing raw HTML, yanx adds an extra <
# to the end of the file.  This is because < belongs at the beginning of
# a token but awk is told it's printed at the end.  There is no equivalent
# "line prefix" variable that I know of, if you want it to print smarter
# you'll have to print the <'s yourself, by setting ORS=" and
# printing lines like print "<" $0
###############################################################################
BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# After match("qwertyuiop", /rty/)
#       rbefore("qwertyuiop") is "qwe",
#       rmid("qwertyuipo")    is "r"
#       rall("qwertyuiop")    is "rty"
#       rafter("qwertyuiop")  is "uiop"

# !?!?!
# function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rbefore(STR)   { return(substr(STR, 0, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        sub(/\/$/, "", STR);    # Self-closing tags, mumblegrumble
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
(!SPEC) && match($1, /^[^\/ \r\n\t>]+/) {
        CTAG=""
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS, "", "", "");
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        CTAG=toupper($1)
        TAG=""
#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        # Update TAG with tag on top of stack, if any
#       if(DEP < 0) {   DEP=0;  TAG=""  }
#       else { TAG=TA[DEP]; }
}

Case in point: Congratulations, this specific XML found a bug in yanx, hence v 0.0.9.

$ awk -f yanx-0.0.9.awk -e 'TAG=="CONNECTOR" { print ARGS["FROMINSTANCE"], ARGS["FROMFIELD"] }' ORS="\n" OFS="\t" input.xml
EPIC    DID
EPIC    PROJECT_NAME
SQ_EPIC DID
SQ_EPIC PROJECT_NAME

$
2 Likes

Thank you so much for your quick response !!!!
I need to pass below output into insert query dynamically.

$ awk -f yanx-0.0.9.awk -e 'TAG=="CONNECTOR" { print ARGS["FROMINSTANCE"], ARGS["FROMFIELD"] }' ORS="\n" OFS="\t" input.xml
EPIC    DID
EPIC    PROJECT_NAME
SQ_EPIC DID
SQ_EPIC PROJECT_NAME
insert into SQ_EPIC values( "DID", "PROJECT_NAME")
Select DID, PROJECT_NAME from EPIC

Please help me
Regards,
Sekhar

From your output into the SQL query: what goes where?

Hello,

Here is my question
In my XML code below I need to fetch TOINSTANCETYPE , FROMINSTANCE , FROMFIELD then need create a Insert statement.
For example if TOINSTANCETYPE="Source Qualifier" then corresponding FROMINSTANCE is a source else if TOINSTANCETYPE="Target Definition" then corresponding FROMINSTANCE is a target.

My insert query looks like

insert into SQ_EPIC values( "DID", "PROJECT_NAME")
Select DID, PROJECT_NAME from EPIC

Here table / column names I need to pass it as dynamically.

Below is XML code I am using

<CONNECTOR TOINSTANCETYPE="Source Qualifier" TOINSTANCE="SQ_EPIC" TOFIELD="DID" FROMINSTANCETYPE="Source Definition" FROMINSTANCE="EPIC" FROMFIELD="DID"/>

<CONNECTOR TOINSTANCETYPE="Source Qualifier" TOINSTANCE="SQ_EPIC" TOFIELD="PROJECT_NAME" FROMINSTANCETYPE="Source Definition" FROMINSTANCE="EPIC" FROMFIELD="PROJECT_NAME"/>

<CONNECTOR TOINSTANCETYPE="Target Definition" TOINSTANCE="STAGE_EPIC" TOFIELD="DID" FROMINSTANCETYPE="Source Qualifier" FROMINSTANCE="SQ_EPIC" FROMFIELD="DID"/>

<CONNECTOR TOINSTANCETYPE="Target Definition" TOINSTANCE="STAGE_EPIC" TOFIELD="PROJECT_NAME"FROMINSTANCETYPE="Source Qualifier" FROMINSTANCE="SQ_EPIC" FROMFIELD="PROJECT_NAME"/>

Thanks in advance!!!

Regards,
Sekhar Lekkala.

You didn't answer my question but just repeated (and extended) the original request. The TOINSTANCETYPE is an extension to your original request. How and where does it fit?

To slightly narrow down my question: from your four line, eight field xml extraction result, whích goes where in a three "variable" (table, column1, colum2) insert statement? Is the order of the fields and results constant, or how are the fields targeted?

Hi Rudi,

Greetings!!!
I am sorry for not responding to your question.

Yes, you are true the three "variable" (table, column1, colum2)
example: for source [EPIC (table name), DID (coulum1), Project_name (column2)]
for Target [SQ_EPIC (table name), DID (coulum1), Project_name (column2)]
Need a script to generate insert script dynamically.

Thanks in Advance!!!
Regards,
Sekhar Lekkala.

Please, please, please - with your next problem, apply utmost care from the beginning, and start with the ENTIRE picture, provide sufficient info and representative data samples, and DON'T change goals!

You can pipe your above result through another awk script for creating the SQL insert statement, but irrespective of the TOINSTANCETYPE as it is not known in the result:

awk -f ... |
awk '{IX[$1]; VAL[$1,NR%2] = $2} END {for (x in IX) print "INSERT INTO", x, "values (\"" VAL[x,1] "\" , \"" VAL[x,0] "\");"}'
INSERT INTO EPIC values ("DID" , "PROJECT_NAME");
INSERT INTO SQ_EPIC values ("DID" , "PROJECT_NAME");

You could even try to implement the "insert" creation into the first awk script; this is left as an exercise to the reader.