Split large xml into mutiple files and with header and footer in file

Don_Cragun · February 10, 2019, 9:20pm

Hi karthik,
As usual, I seem to be lost again trying to understand what you are trying to do.

karthik:

Hi Corona,

I have tried the below where as i can find job_id value when trying to replace the job_id values with comma seperator that new_var is blank kindly assist
JobId =$(cat Response.xml | awk -F"Job_Id>" '{print $2}' | awk -F"<" '{print $1}')
echo $JobId
JOB_ID output here is :12345
23415

With a <space> between the first word of the above command and the <equals-sign>, command runs a utility named JobId with one or more arguments (depending on the output from the two awk commands in your script) where the first argument's first character is the <equals-sign>. The echo command after that line will print an empty line unless JobId had been assigned a value somewhere else.

Furthermore, JOB_ID and JobId are not even close to being the same thing AND with no CODE tags I have no idea what value you are saying the output of that awk command was. And, since you haven't shown us any sample file named Response.xml we have no way of recreating the input you are feeding into that script to try it out ourselves.

The echo of an unquoted variable expansion is only going to produce a single <newline> character at the end of the single line of output it produces; so you can't possibly want tr to change that <newline> character into a <comma>? And, you're invoking sed two more times to enclose your results in <single-quote>s, but you don't show any <single-quote>s in the expected output you say you want to store into the shell variable NEW_VAR . So, again, I'm very confused about how this code might be expected to produce the output you want.

Unless you want to run a loop processing the individual job IDs produce by the first awk script above, why not just have it print the results you want in the format in which you want them to be printed instead of producing output you don't want following by invoking four more utilities to reformat the output?

Making the wild assumption that the awk script you showed us produced two output lines with each containing a single Job ID and that you want to set NEW_VAR to a string containing just those two Job IDs separated by a <comma>, we should be able to produce aMUCH more efficient single awk command to produce the output you want to store into the variable NEW_VAR . But, of course, with no sample input to work with, I can't determine whether the script above that calls awk twice is expecting to find two Job IDs on a single line input file or is expecting to find one Job ID on each of two input lines. Therefore, I can't suggest an awk command that might work for you. Either way there is no need for two awk commands, two sed command, a tr command, and two command substitutions.

karthik · February 10, 2019, 9:32pm

Hi Don Cragun,

Please forget about the above awk commands it would be confusing below is the sample xml file
i want string value JOB_ID to be extracted and assigned to a variable NEW_VAR

Output Expected:

 NEW_VAR ='30544,30545,30546'

This value i will pass to Database later

 <?xml version='1.0' encoding='UTF-8'?><S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"><S:Body><ns5:DoPublishFromImportResponse xmlns:ns12="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1/response" xmlns:ns11="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1/response" xmlns:ns10="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1/response" xmlns:ns9="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1" xmlns:ns8="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1/request" xmlns:ns7="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1/request" xmlns:ns6="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1" xmlns:ns5="oracle/documaker/schema/ws/publishing" xmlns:ns4="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1" xmlns:ns3="oracle/documaker/schema/common" xmlns:ns2="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1/request" xmlns="oracle/documaker/schema/ws/publishing/common"><ns5:DoPublishFromImportResponseV1><Result>0</Result><ServiceTimeMillis>13</ServiceTimeMillis><ns6:JobResponse CorrelationId="?"><ns11:JobPayloadType>1</ns11:JobPayloadType><ns11:JobPriority>10</ns11:JobPriority><ns11:JobStatus>111</ns11:JobStatus><ns11:JobUnique_Id>010d9363-6362-4f66-a48a-b3a1e4b90bc9</ns11:JobUnique_Id><ns11:Job_Id>30544</ns11:Job_Id></ns6:JobResponse><ns6:ServiceInfo><ns3:Operation>doPublishFromImport</ns3:Operation><ns3:Version><ns3:Number>1</ns3:Number><ns3:Used>true</ns3:Used></ns3:Version></ns6:ServiceInfo></ns5:DoPublishFromImportResponseV1></ns5:DoPublishFromImportResponse></S:Body></S:Envelope><?xml version='1.0' encoding='UTF-8'?><S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"><S:Body><ns5:DoPublishFromImportResponse xmlns:ns12="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1/response" xmlns:ns11="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1/response" xmlns:ns10="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1/response" xmlns:ns9="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1" xmlns:ns8="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1/request" xmlns:ns7="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1/request" xmlns:ns6="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1" xmlns:ns5="oracle/documaker/schema/ws/publishing" xmlns:ns4="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1" xmlns:ns3="oracle/documaker/schema/common" xmlns:ns2="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1/request" xmlns="oracle/documaker/schema/ws/publishing/common"><ns5:DoPublishFromImportResponseV1><Result>0</Result><ServiceTimeMillis>14</ServiceTimeMillis><ns6:JobResponse CorrelationId="?"><ns11:JobPayloadType>1</ns11:JobPayloadType><ns11:JobPriority>10</ns11:JobPriority><ns11:JobStatus>111</ns11:JobStatus><ns11:JobUnique_Id>f8268dda-9357-45ec-baab-e6fbb30744bd</ns11:JobUnique_Id><ns11:Job_Id>30545</ns11:Job_Id></ns6:JobResponse><ns6:ServiceInfo><ns3:Operation>doPublishFromImport</ns3:Operation><ns3:Version><ns3:Number>1</ns3:Number><ns3:Used>true</ns3:Used></ns3:Version></ns6:ServiceInfo></ns5:DoPublishFromImportResponseV1></ns5:DoPublishFromImportResponse></S:Body></S:Envelope><?xml version='1.0' encoding='UTF-8'?><S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"><S:Body><ns5:DoPublishFromImportResponse xmlns:ns12="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1/response" xmlns:ns11="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1/response" xmlns:ns10="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1/response" xmlns:ns9="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1" xmlns:ns8="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1/request" xmlns:ns7="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1/request" xmlns:ns6="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1" xmlns:ns5="oracle/documaker/schema/ws/publishing" xmlns:ns4="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1" xmlns:ns3="oracle/documaker/schema/common" xmlns:ns2="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1/request" xmlns="oracle/documaker/schema/ws/publishing/common"><ns5:DoPublishFromImportResponseV1><Result>0</Result><ServiceTimeMillis>12</ServiceTimeMillis><ns6:JobResponse CorrelationId="?"><ns11:JobPayloadType>1</ns11:JobPayloadType><ns11:JobPriority>10</ns11:JobPriority><ns11:JobStatus>111</ns11:JobStatus><ns11:JobUnique_Id>35b40e14-77b8-4f63-80c4-6ac0d8020985</ns11:JobUnique_Id><ns11:Job_Id>30546</ns11:Job_Id></ns6:JobResponse><ns6:ServiceInfo><ns3:Operation>doPublishFromImport</ns3:Operation><ns3:Version><ns3:Number>1</ns3:Number><ns3:Used>true</ns3:Used></ns3:Version></ns6:ServiceInfo></ns5:DoPublishFromImportResponseV1></ns5:DoPublishFromImportResponse></S:Body></S:Envelope><?xml version='1.0' encoding='UTF-8'?><S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"><S:Body><ns5:DoPublishFromImportResponse xmlns:ns12="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1/response" xmlns:ns11="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1/response" xmlns:ns10="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1/response" xmlns:ns9="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1" xmlns:ns8="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1/request" xmlns:ns7="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1/request" xmlns:ns6="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1" xmlns:ns5="oracle/documaker/schema/ws/publishing" xmlns:ns4="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1" xmlns:ns3="oracle/documaker/schema/common" xmlns:ns2="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1/request" xmlns="oracle/documaker/schema/ws/publishing/common"><ns5:DoPublishFromImportResponseV1><Result>0</Result><ServiceTimeMillis>15</ServiceTimeMillis><ns6:JobResponse CorrelationId="?"><ns11:JobPayloadType>1</ns11:JobPayloadType><ns11:JobPriority>10</ns11:JobPriority><ns11:JobStatus>111</ns11:JobStatus><ns11:JobUnique_Id>9e4e8e04-167f-46dd-9801-27776728fe05</ns11:JobUnique_Id><ns11:Job_Id>30547</ns11:Job_Id></ns6:JobResponse><ns6:ServiceInfo><ns3:Operation>doPublishFromImport</ns3:Operation><ns3:Version><ns3:Number>1</ns3:Number><ns3:Used>true</ns3:Used></ns3:Version></ns6:ServiceInfo></ns5:DoPublishFromImportResponseV1></ns5:DoPublishFromImportResponse></S:Body></S:Envelope><?xml version='1.0' encoding='UTF-8'?><S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"><S:Body><ns5:DoPublishFromImportResponse xmlns:ns12="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1/response" xmlns:ns11="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1/response" xmlns:ns10="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1/response" xmlns:ns9="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1" xmlns:ns8="oracle/documaker/schema/ws/publishing/doGetPublishingInfo/v1/request" xmlns:ns7="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1/request" xmlns:ns6="oracle/documaker/schema/ws/publishing/doPublishFromImport/v1" xmlns:ns5="oracle/documaker/schema/ws/publishing" xmlns:ns4="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1" xmlns:ns3="oracle/documaker/schema/common" xmlns:ns2="oracle/documaker/schema/ws/publishing/doPublishFromFactory/v1/request" xmlns="oracle/documaker/schema/ws/publishing/common"><ns5:DoPublishFromImportResponseV1><Result>0</Result><ServiceTimeMillis>15</ServiceTimeMillis><ns6:JobResponse CorrelationId="?"><ns11:JobPayloadType>1</ns11:JobPayloadType><ns11:JobPriority>10</ns11:JobPriority><ns11:JobStatus>111</ns11:JobStatus><ns11:JobUnique_Id>cfd9fba3-bc37-4f2f-936e-7b38f7c59f57</ns11:JobUnique_Id><ns11:Job_Id>30548</ns11:Job_Id></ns6:JobResponse><ns6:ServiceInfo><ns3:Operation>doPublishFromImport</ns3:Operation><ns3:Version><ns3:Number>1</ns3:Number><ns3:Used>true</ns3:Used></ns3:Version></ns6:ServiceInfo></ns5:DoPublishFromImportResponseV1></ns5:DoPublishFromImportResponse></S:Body></S:Envelope>

Don_Cragun · February 10, 2019, 11:24pm

Hi Karthik,
PLEASE pay attention to what you are doing! There cannot be a <space> between the name of a shell variable and the <equals-sign> that follows it if you are trying to assign a value to that variable. This has been said several times in this thread and yet you still write that you want the result to be:

 NEW_VAR ='30544,30545,30546'

which, as stated before tells the shell to run a utility named NEW_VAR with one operand that is the string =30544,30545,30546 and note that that operand does not contain the <single-quote> characters that will be removed by the shell as it prepares the arguments to be passed to the NEW_VAR utility when it is invoked.

Note also that you have not told us what operating system you're using. With a sample file that is 8,157 bytes long and contains only a single line, that is not a text file on many BSD, Linux, and UNIX systems and the awk , sed , and most other standard text processing utilities have undefined behavior if the input files being processed are not text files.

Note also that you say that the output to be produced from your sample input should have three numbers (Job IDs) in the output, but there are five Job IDs in the sample input? Why shouldn't all five values be extracted from the XML file?

If we assume that the awk utility on your system can handle text files with unlimited line lengths, the following might do what you want:

NEW_VAR=$(awk -v sq="'" -F'<ns11:Job_Id>' '
		{	for(i = 2; i <= NF; i++) {
				sub(/<.*/, "", $i)
				printf("%s%s", cnt++ ? "," : sq, $i)
			}
		}
		END {	print sq
		}' file
	)

printf 'NEW_VAR has been assigned the value: %s\n' "$NEW_VAR"

which, on macOS Mojave version 10.14.3, produces the output:

NEW_VAR has been assigned the value: '30544,30545,30546,30547,30548'

if the file named file contains the sample data you provided in post #22 in this thread.

karthik · February 10, 2019, 11:42pm

I am using Linux OS and the file is .xml , And I need all the values of job_id as you mentioned not just 3.

--- Post updated at 04:42 AM ---

Thanks a lot it worked and my apologies for all the confusion.

Don_Cragun · February 10, 2019, 11:42pm

OK. So does the code I suggested in post #23 produce the output you want if you change the name of the file in the script to match the name of your input file?

karthik · February 10, 2019, 11:44pm

Yes Don Cragun it worked thanks for your help .

karthik · February 17, 2019, 10:59pm

Hi Don Cragun/Rudic ,

I have built the script based on all the inputs one last thing is renaming files it is still creating just one standard file name kindly assist

Below command is not creating unique names as expected

sample input:

filename:sampletest.xml
                                           sampletest_111.xml

Actual Output:

Extrfile001.xml just 1 file is getting created

Expected Output:

Extrfile001.xml
Extrfile002.xml

arr=($(ls | grep "../Inbound/Extrfile[0-9]*.xml"))

#!/bin/sh

# Add all Input files to array
FileList=($(ls | grep "../Inbound/sampletest*\\_[0-9]"))
  
echo  "$FileList"  

#loop array for Input files

for x in "${FileList[@]}"
do
 #for each element in array
 

#File Split Begin
awk -f xml_tag_handler.awk -f File_split.awk OUT=$x"" ROWS="500" $x $x
mv $x ../Staging
done

rm Response.xml Extr*.xml 


for f in ../Inbound/sampletest_*
  do    TMP="${f/sampletest_/Extrfile}"
         mv "$f" "${TMP%.*}"
  done

# add all files to array
arr=($(ls | grep "../Inbound/Extrfile[0-9]*.xml"))

Don_Cragun · February 17, 2019, 11:47pm

karthik:

Hi Don Cragun/Rudic ,

I have built the script based on all the inputs one last thing is renaming files it is still creating just one standard file name kindly assist

Below command is not creating unique names as expected

sample input:

filename:sampletest.xml
   sampletest_111.xml

Actual Output:

Extrfile001.xml just 1 file is getting created

Expected Output:

Extrfile001.xml
Extrfile002.xml

arr=($(ls | grep "../Inbound/Extrfile[0-9]*.xml"))

#!/bin/sh

# Add all Input files to array
FileList=($(ls | grep "../Inbound/sampletest*\\_[0-9]"))
  
echo  "$FileList"  

#loop array for Input files

for x in "${FileList[@]}"
do
 #for each element in array
 

#File Split Begin
awk -f xml_tag_handler.awk -f File_split.awk OUT=$x"" ROWS="500" $x $x
mv $x ../Staging
done

rm Response.xml Extr*.xml 


for f in ../Inbound/sampletest_*
  do    TMP="${f/sampletest_/Extrfile}"
   mv "$f" "${TMP%.*}"
  done

# add all files to array
arr=($(ls | grep "../Inbound/Extrfile[0-9]*.xml"))

Hi karthik,
All of the code marked in red above will ALWAYS expand to nothing because the output from ls when invoked with no operands will NEVER yield any string containing ../ . Therefore the script you showed us is logically equivalent to the script:

#!/bin/sh

# Add all Input files to array
FileList=()
  
echo  ""  
#loop array for Input files
#for each element in array
#File Split Begin

rm Response.xml Extr*.xml

for f in ../Inbound/sampletest_*
  do    TMP="${f/sampletest_/Extrfile}"
         mv "$f" "${TMP%.*}"
  done

# add all files to array
arr=()

I assume that you are not getting what you want because you never run any of the awk scripts in your shell script; you only move around and change the names of files that already existed before you started running this script.

karthik · February 18, 2019, 12:30am

Hi Don,

My Apologies for confusing you again AWK commands are perfectly working fine and it splits file correctly as expected

Hope I am not confusing you further

1) If my input file name is sampletest_111.xml after AWK command file name will be like sampletest_111.xml.0001
2)sampletest_111.xml.0001 is renamed to Extrfile111.xml
3)when there are multiple input files AWK is spliting files and creating unique files but
below piece of code is not renaming files in a sequence its just appending to 1 file
Output Expected:Extrfile111.xml,Extrfile1112.xml etc i mean unique name

for f in ../Inbound/sampletest_*
  do    TMP="${f/sampletest_/Extrfile}"
         mv "$f" "${TMP%.*}"
  done

Total code :

#!/bin/sh

#pass all Input files to array
FileList=($(ls | grep "sampletest*\\_[0-9]"))
  
echo  "$FileList"  

#loop array for Input files

for x in "${FileList[@]}"
do
 #for each element in array
 

#File Split Begin
awk -f xml_tag_handler.awk -f File_split.awk OUT=$x"" ROWS="500" $x $x
mv $x ../
done

rm Response.xml Extr*.xml

for f in sampletest_*
echo "$f"
  do    TMP="${f/sampletest_/Extrfile}"
         mv "$f" "${TMP%.*}"
  done

# add all files to array
arr=($(ls | grep "Extrfile[0-9]*.xml"))


 #loop array
for i in "${arr[@]}"
do
 #for each element in array
  echo "$i"

   sed -i '/<com1:URI>/c\<com1:URI>file:///tmp/karthik/'$i'</com1:URI>' soaprequest.xml
  
#WebService Call Begin
sleep 5
curl --header "Content-Type: text/xml;charset=UTF-8" --data @soaprequest.xml {WSDLURL} --insecure >> Response.xml
echo ":Webservice call Begin"
done

  sed -i '/<com1:URI>/c\<com1:URI>file:///tmp/karthik/'$i'</com1:URI>' soaprequest.xml
  
echo ":Webservice call End"

NEW_VAR=$(awk -v sq="'" -F'<ns11:Job_Id>' '
		{	for(i = 2; i <= NF; i++) {
				sub(/<.*/, "", $i)
				printf("%s%s", cnt++ ? "," : sq, $i)
			}
		}
		END {	print sq
		}' Response.xml	
	)

printf 'NEW_VAR has been assigned the value: %s\n' "$NEW_VAR"

#End Web Service Call

xml_tag_handler.awk:

###############################################################################
BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# After match("qwertyuiop", /rty/)
#       rbefore("qwertyuiop") is "qwe",
#       rmid("qwertyuipo")    is "r"
#       rall("qwertyuiop")    is "rty"
#       rafter("qwertyuiop")  is "uiop"

# !?!?!
# function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rbefore(STR)     { return(substr(STR, 0, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
(!SPEC) && match($1, /^[^\/ \r\n\t>]+/) {
        CTAG=""
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS, "", "", "");
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        CTAG=toupper($1)
        TAG=""
#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        # Update TAG with tag on top of stack, if any
#       if(DEP < 0) {   DEP=0;  TAG=""  }
#       else { TAG=TA[DEP]; }
}

File_split.awk

BEGIN {
        ORS=""
        #OUT="x."
        ROWS=5
        ROWTAG="^RECIPIENT[0-9]*$"
        HDRTAG="^DOCUMENTSET$"
        FTRTAG="^DOCUMENTSET$"
}

# First pass, remember headers and footers
NR==FNR {
        if(!HDREND)
        {
                HDR=HDR RS $1 OFS $2
                if(TAG ~ HDRTAG) HDREND=FNR
                next
        }

        if(FTRSTART || (CTAG ~ FTRTAG))
        {
                FTR=FTR RS $1 OFS $2
                if(CTAG ~ FTRTAG) FTRSTART=FNR
        }

        next
}

# Skip header and footer
(FNR <= HDREND) || (FNR >= FTRSTART) { next }

# Close output file once enough DOCUMENT records
((XNR%(ROWS+1)) == 0) {
#       printf("FNR==%d XNR==%d FILE=%s\n", FNR, XNR, FILE)>"/dev/stderr"
        if(!length(OUT)) FBASE=FILENAME "."
                else FBASE = OUT "."
				
        if(FILE) {
                print FTR > FILE
                close(FILE);
        }

        FILE=sprintf("%s%04d", FBASE,++FILENUM);
        print HDR > FILE
        XNR++
}

{       print RS $0 > FILE      }

CTAG ~ ROWTAG { XNR++ }

END {   if(FILE) print FTR > FILE       }

#8 in the same thread got the sample xml structure for your reference

Don_Cragun · February 18, 2019, 1:12am

Hi karthik,
I am not confused at all this time. Please go back and read closely what I said in post #28!

I don't care how well your awk script works when you invoke it with the name of a file to be processed. The script you showed us in post #27 NEVER EVER under any circumstances invokes awk ; not even once! And, if your script doesn't run awk , it just plain is not possible that awk is splitting anything.

Showing us a few hundred more lines of awk code doesn't alter the fact that you are never running that code.

Until you alter the code that is initializing the FileList array correctly, there is nothing else to talk about. If you change the sixth line in your script from:

echo "$FileList"

to:

echo "Files to be processed: ${FileList[@]}"

and look at the output that line produces when you run your script, maybe you'll believe me. And, yes, I noticed that you changed the way you initialized that array from:

FileList=($(ls | grep "../Inbound/sampletest*\\_[0-9]"))

to:

FileList=($(ls | grep "sampletest*\\_[0-9]"))

but it doesn't alter the fact that the FileList array will still be an empty array and your awk script will never be executed. The empty line produced by the echo in your script should have been a strong indication to you that something was wrong, but you seem to be ignoring that fact. With the above change, hopefully it will be crystal clear.

The grep utility takes a basic regular expression as its first operand; not a filename matching pattern. BREs and filename matching patters have some similarities, but they are not the same. Since none of your filenames contain a literal backslash character (i.e. \ ), the grep can't match any lines in the output produced by ls .

You would do well to change the second line in your script from an empty line to:

set -xv

to enable tracing so you can actually see what your script is doing.

karthik · February 18, 2019, 1:29am

Hi Don,

See the below it is able to find the input files and i have pasted my output in debug mode it is able to rename only 1 file Extrfile112.xml where as it ignored or not able to
read sampletest_111.xml is the issue

+ FileList=($(ls | grep "sampletest*\\_[0-9]"))
++ ls
++ grep 'sampletest*\_[0-9]'
+ echo sampletest_111.xml
sampletest_111.xml
+ echo 'Files to be processed: sampletest_111.xml' sampletest_112.xml
Files to be processed: sampletest_111.xml sampletest_112.xml
+ for x in '"${FileList[@]}"'
+ awk -f xml_tag_handler.awk -f File_split.awk OUT=sampletest_111.xml ROWS=500 sampletest_111.xml sampletest_111.xml
+ mv sampletest_111.xml ../
+ for x in '"${FileList[@]}"'
+ awk -f xml_tag_handler.awk -f File_split.awk OUT=sampletest_112.xml ROWS=500 sampletest_112.xml sampletest_112.xml
+ mv sampletest_112.xml ../
+ rm Response.xml 'Extr*.xml'
rm: cannot remove `Extr*.xml': No such file or directory
+ for f in 'sampletest_*'
+ TMP=Extrfile112.xml.0001
+ mv sampletest_112.xml.0001 Extrfile112.xml
+ echo sampletest_112.xml.0001
sampletest_112.xml.0001
+ arr=($(ls | grep "Extrfile[0-9]*.xml"))
++ ls
++ grep 'Extrfile[0-9]*.xml'
+ for i in '"${arr[@]}"'
+ echo Extrfile112.xml
Extrfile112.xml
+ sed -i '/<com1:URI>/c\<com1:URI>file:///tmp/karthik/Extrfile112.xml</com1:URI>' soaprequest.xml
+ sleep 5
+ curl --header 'Content-Type: text/xml;charset=UTF-8' --data @soaprequest.xml https://cobodmsoa-vip.dev4.cbd.extnp.national.com.au:8002/DWSAL1/PublishingService --insecure
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3196    0  1631  100  1565    793    761  0:00:02  0:00:02 --:--:--  1603
+ echo ':Webservice call Begin'

Don_Cragun · February 18, 2019, 2:51am

OK. You lucked out... The BRE sampletest*\\_[0-9] tells grep to match and print lines that contain the string sampletes followed by zero or more occurrences of t followed by whatever unspecified characters are matched by the character sequence \_ on the regular expression matching engine used on your operating system followed by a decimal digit. It looks like your operating system's RE matching engine chooses to use that sequence to match an underscore character,

To match the filenames you want to process, the following BRE would work more reliably:

grep 'sampletest_[0-9][0-9]*.xml'

If you want to exclude matching filenames like sampletest_112.xml.0001 , you could force the xml to only be matched at the end of a filename with:

grep 'sampletest_[0-9][0-9]*.xml$'

Now that we have gotten past that... What statement in your script is failing to do what you want it to do? What are the arguments being passed to that command according to the trace output you're seeing? What arguments did you hope would be passed to that command instead of the arguments that are actually being passed to that command?

karthik · February 18, 2019, 6:25pm

Hi Don,

I have corrected the grep command as suggested now the issue is after split it will create multiple files with ending like below but see the mv command its moving all the files to 1 single file basically Extrfile110.xml it should create new unique file kindly suggest where iam goin wrong

ex:
sampletest_111.xml.0001
sampletest_112.xml.000i

+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0001
+ mv sampletest_110.xml.0001 Extrfile110.xml
+ echo sampletest_110.xml.0001
sampletest_110.xml.0001
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0002
+ mv sampletest_110.xml.0002 Extrfile110.xml
+ echo sampletest_110.xml.0002
sampletest_110.xml.0002
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0003
+ mv sampletest_110.xml.0003 Extrfile110.xml
+ echo sampletest_110.xml.0003
sampletest_110.xml.0003
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0004
+ mv sampletest_110.xml.0004 Extrfile110.xml
+ echo sampletest_110.xml.0004
sampletest_110.xml.0004
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0005
+ mv sampletest_110.xml.0005 Extrfile110.xml
+ echo sampletest_110.xml.0005
sampletest_110.xml.0005
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0006
+ mv sampletest_110.xml.0006 Extrfile110.xml
+ echo sampletest_110.xml.0006
sampletest_110.xml.0006
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0007
+ mv sampletest_110.xml.0007 Extrfile110.xml
+ echo sampletest_110.xml.0007
sampletest_110.xml.0007
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0008
+ mv sampletest_110.xml.0008 Extrfile110.xml

--- Post updated at 11:25 PM ---

karthik:

Hi Don,

I have corrected the grep command as suggested now the issue is after split it will create multiple files with ending like below but see the mv command its moving all the files to 1 single file basically Extrfile110.xml it should create new unique file kindly suggest where iam goin wrong

ex:
sampletest_111.xml.0001
sampletest_112.xml.000i

+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0001
+ mv sampletest_110.xml.0001 Extrfile110.xml
+ echo sampletest_110.xml.0001
sampletest_110.xml.0001
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0002
+ mv sampletest_110.xml.0002 Extrfile110.xml
+ echo sampletest_110.xml.0002
sampletest_110.xml.0002
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0003
+ mv sampletest_110.xml.0003 Extrfile110.xml
+ echo sampletest_110.xml.0003
sampletest_110.xml.0003
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0004
+ mv sampletest_110.xml.0004 Extrfile110.xml
+ echo sampletest_110.xml.0004
sampletest_110.xml.0004
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0005
+ mv sampletest_110.xml.0005 Extrfile110.xml
+ echo sampletest_110.xml.0005
sampletest_110.xml.0005
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0006
+ mv sampletest_110.xml.0006 Extrfile110.xml
+ echo sampletest_110.xml.0006
sampletest_110.xml.0006
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0007
+ mv sampletest_110.xml.0007 Extrfile110.xml
+ echo sampletest_110.xml.0007
sampletest_110.xml.0007
+ for f in 'sampletest_*'
+ TMP=Extrfile110.xml.0008
+ mv sampletest_110.xml.0008 Extrfile110.xml

Below Mv Command is the issue it is creating same file name

for f in sampletest_*
  do    TMP="${f/sampletest_/Extrfile}"
         mv "$f" "${TMP%.*}"
echo "$f"
 done

Don_Cragun · February 18, 2019, 7:45pm

You have an awk script that is creating uniquely named files. You then add another loop following your awk script that removes the final part of those filenames taking away the part that makes them unique and guarantees that only the last file created by each invocation of your awk script will be kept as the renaming loop overwrites each of the output files with the next output file in sequence.

If you want unique names, why do you have the renaming loop that intentionally strips off the part of their names that makes them unique?

karthik · February 18, 2019, 7:56pm

Hi Don,

The reason I am renaming split file is to convert that to proper xml name thats the reason I am renaming after that it invokes a wsdl
My Expected Output should look like below or any sequence will do but .xml should be there at the end

sampletest_110.xml.0004 to Extrfile110_4.xml or Extrfile1104.xml
sampletest_110.xml.0005  to   Extrfile110_5.xml  or Extrfile1105.xml
sampletest_111.xml.0001  to  Extrfile111_1.xml   or Extrfile1111.xml

Don_Cragun · February 18, 2019, 11:06pm

Hi karthik,
You mean that after 34 posts have been entered into this thread, you're now finally going to tell us what output filenames you want your script to create. We would all have saved ourselves a lot o agony if you had told us the names of the files you wanted to create in post #1 in this thread!

One could, of course, change:

for f in sampletest_*
  do    TMP="${f/sampletest_/Extrfile}"
         mv "$f" "${TMP%.*}"
echo "$f"
 done

to:

for f in sampletest_*
do	TMP=${f/sampletest_/Extrfile}
	TMP=${TMP/.xml./_}.xml
	echo mv "$f" "$TMP"
done

and remove the echo if the names match what you're expecting. And, you can modify that to strip off the leading zeros we created for you if you want to, but I'm not going to show you how to do that since I think it would be a mistake that you would later regret (as explained in a couple of earlier posts in this thread). But you'd be much better off changing:

        if(!length(OUT)) FBASE=FILENAME "."
                else FBASE = OUT "."

        ... ... ...

        FILE=sprintf("%s%04d", FBASE,++FILENUM);

in File_split.awk to just set FILE to the name of the output file you really want to create to begin with instead of creating a bunch of files with the wrong names and adding another loop of code to fix what you did incorrectly the first time.

karthik · February 19, 2019, 12:33am

Hi Don,

total 34 posts i have asked different doubts related in building one bash script. I am a beginner and i knew the fact i have confused in couple of posts hopefully will not
repeat the same

Regards,
Kart