Split large xml into mutiple files and with header and footer in file

Split large xml into mutiple files and with header and footer in file

tried below
it splits unevenly and also i need help in adding header and footer
command :

csplit -s -k -f my_XML_split.xml extrfile.xml "/<Document>/" {1}

sample xml

<?xml version="1.0" encoding="UTF-8"?><Recipient>
  <Header>
    <tag1></tag1>
    <tag2>1212233</tag2>
      --
	  ----
	  ---
  </Header>
  
  <Document>
  ---
  ---
  ---
  </Document>
   <Document>
  ---
  ---
  ---
  </Document>
   <Document>
  ---
  ---
  ---
  </Document>
  
 <Footer>
---
-- 
</Footer>

Parsing XML isn't trivial, but we get asked for it all the time, so:

# yanx.awk v0.0.8, Tyler Montbriand, 2017.  Yet another noncompliant XML parser
###############################################################################
# XML is a pain to process in the shell, but people need it all the time.
# I've been using and improving this kludge since 2014 or so.  It parses and
# stacks tags and digests parameters, allowing simple XML processing and
# extraction to be managed with a handful of lines addendum.
#
# I've restricted my use of GNU features enough that this script will run on
# busybox's awk.  I think it works with mawk except -e is unsupported.
# You can work around that by running multiple files, i.e.
# mawk -f yanx.awk -f mystuff.awk inputfile
###############################################################################
# Basic use:
#
# Fed this XML, <body><html a="b">Your Web Browser Hates This</html></body>
# yanx will read it token-by-token as so:
#     Line 1:  Empty, skipped
#     Line 2:  $1="body"
#     Line 3:  $1="html a="b"", $2="Your web browser hates this"
#     Line 4:  $1="/html"
#     Line 5:  $1="/body", $2="\n"
#
# The script sets a few new "special" variables along the way.
# TAG           The name of the current tag, uppercased.
# CTAG          If close-tag, name in uppercase.
# TAGS          List of nested tags, like HTML%BODY%, including current tag
# LTAGS         List of nested tags, not including current tag
# ARGS          Array of tag parameters, uppercased.  i.e. ARGS["HREF"]
# DEP           How many tags deep it's nested, including current tag.
#
###############################################################################
# Examples:
# # Rewrite cdata of all divs
# awk -f yanx.awk -e 'TAGS ~ /^DIV%/ { $2="quux froob" } 1' input
# # Extract href's from every link
# awk -f yanx.awk -e 'TAGS~/^A%/ && ("HREF" in ARGS) {
#       print ARGS["HREF"] }' ORS="\n" input
###############################################################################
# Known Bugs:
# A short XML script can't possibly handle DOD, etc.  Entities a la <
# are not translated either.
#
# I've done my best to make it swallow <!--, <? ?> and other such fancy
# XML syntax without choking, but that doesn't mean it handles them
# properly either.
#
# It's an XML parser, not an HTML parser.  It probably won't swallow a
# wild-from-the internet HTML web page without some cleanup first:
# javascript, tags inside comments, etc will be mangled instead of ignored.
#
# Last: Because of its design, when printing raw HTML, yanx adds an extra <
# to the end of the file.  This is because < belongs at the beginning of
# a token but awk is told it's printed at the end.  There is no equivalent
# "line prefix" variable that I know of, if you want it to print smarter
# you'll have to print the <'s yourself, by setting ORS=" and
# printing lines like print "<" $0
###############################################################################
BEGIN {
        FS=">"; OFS=">";
        RS="<"; ORS="<"
}

# After match("qwertyuiop", /rty/)
#       rbefore("qwertyuiop") is "qwe",
#       rmid("qwertyuipo")    is "r"
#       rall("qwertyuiop")    is "rty"
#       rafter("qwertyuiop")  is "uiop"

# !?!?!
# function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rbefore(STR)   { return(substr(STR, 0, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
(!SPEC) && match($1, /^[^\/ \r\n\t>]+/) {
        CTAG=""
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS, "", "", "");
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
        CTAG=toupper($1)
        TAG=""
#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        # Update TAG with tag on top of stack, if any
#       if(DEP < 0) {   DEP=0;  TAG=""  }
#       else { TAG=TA[DEP]; }
}

You can use it with this:

# xmlsplit.awk
BEGIN {
        ORS=""
        X="x."
        ROWS=5
}

# First pass, remember headers and footers
NR==FNR {
        if(F || TAG == "FOOTER")
        {
                if(!F) {
                        FTRSTART=FNR
                        F=1
                }
                FTR=FTR "<" $1 OFS $2
        }
        else if((!H) && (TAG == "DOCUMENT"))
        {
                HDREND=FNR
                H=1
        }
        else if(!H)     HDR=HDR "<" $1 OFS $2
        next
}

# Skip header and footer
(FNR < HDREND) || (FNR >= FTRSTART) { next }

# Close output file once enough DOCUMENT records
((XNR%(ROWS+1)) == 0) {
        if(FILE) {
                print FTR > FILE
                close(FILE);
        }
        FILE=sprintf("%s%04d", X,++FILENUM);
        print HDR > FILE
        XNR++
}

{       print "<" $0 > FILE     }

CTAG == "DOCUMENT" { XNR++ }

END {   if(FILE) print FTR > FILE }

Like this:

# Yes, it's fed inputfile twice
awk -f yanx.awk -f xmlsplit.awk X="x." ROWS="10" input input

With this input:

<?xml version="1.0" encoding="UTF-8"?><Recipient>
  <Header>
    <tag1></tag1>
    <tag2>1212233</tag2>
      --
          ----
          ---
  </Header>

<Document>001</Document>
<Document>002</Document>
<Document>003</Document>
<Document>004</Document>
<Document>005</Document>
<Document>006</Document>
<Document>007</Document>
<Document>008</Document>
<Document>009</Document>
<Document>010</Document>
<Document>011</Document>
<Document>012</Document>
<Document>013</Document>
<Document>014</Document>
<Document>015</Document>
<Document>016</Document>
<Document>017</Document>
<Document>018</Document>
<Document>019</Document>
<Document>020</Document>
<Document>021</Document>
<Document>022</Document>
<Document>023</Document>
<Document>024</Document>
<Document>025</Document>
<Document>026</Document>
<Document>027</Document>
<Document>028</Document>
<Document>029</Document>
<Document>030</Document>
<Document>031</Document>
<Document>032</Document>
<Document>033</Document>
<Document>034</Document>
<Document>035</Document>
<Document>036</Document>
<Document>037</Document>
<Document>038</Document>
<Document>039</Document>
<Document>040</Document>
<Document>041</Document>
<Document>042</Document>
<Document>043</Document>
<Document>044</Document>
<Document>045</Document>
<Document>046</Document>
<Document>047</Document>
<Document>048</Document>
<Document>049</Document>
<Document>050</Document>
<Document>051</Document>
<Document>052</Document>
<Document>053</Document>
<Document>054</Document>
<Document>055</Document>
<Document>056</Document>
<Document>057</Document>
<Document>058</Document>
<Document>059</Document>
<Document>060</Document>
<Document>061</Document>
<Document>062</Document>
<Document>063</Document>
<Document>064</Document>
<Document>065</Document>
<Document>066</Document>
<Document>067</Document>
<Document>068</Document>
<Document>069</Document>
<Document>070</Document>
<Document>071</Document>
<Document>072</Document>
<Document>073</Document>
<Document>074</Document>
<Document>075</Document>
<Document>076</Document>
<Document>077</Document>
<Document>078</Document>
<Document>079</Document>
<Document>080</Document>
<Document>081</Document>
<Document>082</Document>
<Document>083</Document>
<Document>084</Document>
<Document>085</Document>
<Document>086</Document>
<Document>087</Document>
<Document>088</Document>
<Document>089</Document>
<Document>090</Document>
<Document>091</Document>
<Document>092</Document>
<Document>093</Document>
<Document>094</Document>
<Document>095</Document>
<Document>096</Document>
<Document>097</Document>
<Document>098</Document>
<Document>099</Document>
<Document>100</Document>

 <Footer>
---
--
</Footer>

To produce output like this:

$ cat x.0001

<?xml version="1.0" encoding="UTF-8"?><Recipient>
  <Header>
    <tag1></tag1>
    <tag2>1212233</tag2>
      --
          ----
          ---
  </Header>

<Document>001</Document>
<Document>002</Document>
<Document>003</Document>
<Document>004</Document>
<Document>005</Document>
<Document>006</Document>
<Document>007</Document>
<Document>008</Document>
<Document>009</Document>
<Document>010</Document>
<Footer>
---
--
</Footer>

$ cat x.0010

<?xml version="1.0" encoding="UTF-8"?><Recipient>
  <Header>
    <tag1></tag1>
    <tag2>1212233</tag2>
      --
          ----
          ---
  </Header>

<Document>091</Document>
<Document>092</Document>
<Document>093</Document>
<Document>094</Document>
<Document>095</Document>
<Document>096</Document>
<Document>097</Document>
<Document>098</Document>
<Document>099</Document>
<Document>100</Document>

 <Footer>
---
--
</Footer>

$
1 Like

Hi Corona,

Thanks for your quick response with code
Do i need to install any xml_splitter libraries in the unix and you have provided 2 big scripts which 1 do i need to consider

Iam new to scripting kindly assist on the above

--- Post updated at 11:27 PM ---

I have created two files yanx.awk xml_split.awk and triggered the command it exited without files being created please guide me where hte out put file path provided

awk -f yanx.awk -f xmlsplit.awk X="x." ROWS="10" input input

--- Post updated 12-14-18 at 06:14 AM ---

Hi Corona,

Please assist on the below error few files iam able to split few i cannot getting below error kindly assist

awk: xmlsplit.awk:37: (FILENAME=sampletest11.xml FNR=4) fatal: can't redirect to `/0001' (Permission denied)

It creates them in the current directory. If you want it to put them somewhere else, set the value of X.

awk -f yanx.awk -f xmlsplit.awk X="x." ROWS="10" X="/path/to/folder/outputname" input input

Please show exactly what you're doing, word for word, letter for letter, keystroke for keystroke. What you have posted is obviously not what you're doing, the filenames differ.

Use nawk on solaris.

Hi Corona,

Below are the steps I am doing sampletest11.xml is my sample file and the xml node slightly differs and the body node is "Recipient"
party_ID is my footer so changed it accordingly in xml_split.awk

Script worked fine with 200 records and when the xml file got 18k records which is expected file it throws the below exception

awk: xmlsplit.awk:37: (FILENAME=sampletest11.xml FNR=4) fatal: can't redirect to `/0001' (Permission denied)

Sample xml skeleton:

<?xml version="1.0" encoding="UTF-8"?>
<DocumentSet>
    <Recipient>
        <Context>
            <TESTER>08</TESTER>
            <name>TEST</name>
            <Locale>en_AU</Locale>
            <Channel>kjsdhfuis</Channel>
            <UserId>8</UserId>
            <HLX>000000</HLX>
            <Key1>TEST1</Key1>
            <Key2>TEST2</Key2>
            <Key3>TEST3</Key3>
            <KeyID>hotdirectorytest</KeyID>
            <dummy2222>TEST7</dummy2222>
            <EffectiveFrom>20170612000000</EffectiveFrom>
            <Currency>AUD</Currency>
        </Context>
        <Document>
            <Form>
                <Name>TESTER2</Name>
                <Data>
                    <DocumentSetC>
                        <HeaderData>
                            <TESTER>08</TESTER>
                            <Channel>kjsdhfuis</Channel>
                            <UserId>X009189</UserId>
                            <HLX>000000</HLX>
                            <dummy>08VIC000000</dummy>
                            <Key1>TEST2</Key1>
                            <Key2>TEST3</Key2>
                            <Key3/>
                            <KeyID>TEST70</KeyID>
                            <dummy2222>Approval Letter</dummy2222>
                            <TEST7>APPA08120617206891</TEST7>
                            <EffectiveFrom>20170612000000</EffectiveFrom>
                            <HLX44>12345</HLX44>
                            <SystemDate>20170612</SystemDate>
                        </HeaderData>
                        <FormData>
                            <Name>TESTER2</Name>
                            <Context>
                                <UniqueDocID>1240525</UniqueDocID>
                                <dummy11112233>LEN_APP_0010_OUT</dummy11112233>
                                <TEST2ApprovedAmount>8989</TEST2ApprovedAmount>
                            </Context>
                            <ReceivingParty>
                                <Applicant>
                                    <TEST45456>sfdsfnsdfnff  </TEST45456>
                                </Applicant>
                                <IndividualDemographics>
                                
                                </IndividualDemographics>
                                <DeliveryChannel>POST</DeliveryChannel>
                                <NoOfCopies>1</NoOfCopies>
                            </ReceivingParty>
                            <Application>
                                <ProductGroups>
                            <TEST454567>sfdsfnsdfnff  </TEST454567>

                                </ProductGroups>
                            </Application>
                        </FormData>
                    </DocumentSetC>
                </Data>
            </Form>
            <TYP1>5</TYP1>
        </Document>
    </Recipient>
       <Recipient2> ---</Recipient2>
           ---------------
           -------------- 
           -----------------
            -----------------
          <Recipient18000> ---</Recipient18000>
    <PartyID>12345</PartyID>
 </DocumentSet>

Command :

$ awk -f yanx.awk -f xmlsplit.awk X="x." ROWS="10" sampletest11.xml sampletest11.xml

--- Post updated at 11:44 PM ---

Kindly assist on the above is the issue because of the file size or number of records in the file??

X might not be the wisest variable name chosen to convey the output files' path as it is used (conditionally) in yanx.awk as the index in a for loop IF the input file contains xml specification info (that might be the reason that it works on a test file if that is missing the xml specs) and thus may be overwritten.

Try again but replace the X variable name with another, e.g. FP (for "file path") in xmlsplit.awk and on the command line, NOT in yanx.awk .

Thanks for showing what input you actually have. What output do you actually want?

Code modified to rudic's suggestions:

BEGIN {
	ORS=""
	OUT="x."
	ROWS=5
	ROWTAG="DOCUMENT"
	FTRTAG="FOOTER"
}

# First pass, remember headers and footers
NR==FNR {
	if(F || TAG == FTRTAG)
	{
		if(!F) {
			FTRSTART=FNR
			F=1
		}
		FTR=FTR RS $1 OFS $2
	}
	else if((!H) && (TAG == ROWTAG))
	{
		HDREND=FNR
		H=1
	}
	else if(!H)	HDR=HDR RS $1 OFS $2
	next
}

# Skip header and footer
(FNR < HDREND) || (FNR >= FTRSTART) { next }

# Close output file once enough DOCUMENT records
((XNR%(ROWS+1)) == 0) {
	if(FILE) {
		print FTR > FILE
		close(FILE);
	}
	FILE=sprintf("%s%04d", OUT,++FILENUM);
	print HDR > FILE
	XNR++
}

{	print RS $0 > FILE	}

CTAG == "DOCUMENT" { XNR++ }

END {	if(FILE) print FTR > FILE }

...but it won't work until I know what tags you're actually using for header and footer. Modify HDRTAG and FTRTAG accordingly.

Thanks Rudic for the input after Your suggestion Corona updated the code and it worked and i need small change to it my footer is different i will update it with the xml input and output how it should look like.

--- Post updated at 11:08 PM ---

Hi Corona,

Thank you so much it worked with your updated code I am able to split the large file into mutiple chunks and i need small change in the output as my footer is different now.Kindly assist on the below

First 2 lines is considered as header:

<?xml version="1.0" encoding="UTF-8"?>
<DocumentSet>

Last line which is a EOF is the footer
---Footer

 </DocumentSet>

Input :

Header

<?xml version="1.0" encoding="UTF-8"?>
<DocumentSet>

---Body 
    <Recipient>
        <Context>
            <TESTER>08</TESTER>
            <name>TEST</name>
            <Locale>en_AU</Locale>
            <Channel>kjsdhfuis</Channel>
            <UserId>8</UserId>
            <HLX>000000</HLX>
            <Key1>TEST1</Key1>
            <Key2>TEST2</Key2>
            <Key3>TEST3</Key3>
            <KeyID>hotdirectorytest</KeyID>
            <dummy2222>TEST7</dummy2222>
            <EffectiveFrom>20170612000000</EffectiveFrom>
            <Currency>AUD</Currency>
        </Context>
        <Document>
            <Form>
                <Name>TESTER2</Name>
                <Data>
                    <DocumentSetC>
                        <HeaderData>
                            <TESTER>08</TESTER>
                            <Channel>kjsdhfuis</Channel>
                            <UserId>X009189</UserId>
                            <HLX>000000</HLX>
                            <dummy>08VIC000000</dummy>
                            <Key1>TEST2</Key1>
                            <Key2>TEST3</Key2>
                            <Key3/>
                            <KeyID>TEST70</KeyID>
                            <dummy2222>Approval Letter</dummy2222>
                            <TEST7>APPA08120617206891</TEST7>
                            <EffectiveFrom>20170612000000</EffectiveFrom>
                            <HLX44>12345</HLX44>
                            <SystemDate>20170612</SystemDate>
                        </HeaderData>
                        <FormData>
                            <Name>TESTER2</Name>
                            <Context>
                                <UniqueDocID>1240525</UniqueDocID>
                                <dummy11112233>LEN_APP_0010_OUT</dummy11112233>
                                <TEST2ApprovedAmount>8989</TEST2ApprovedAmount>
                            </Context>
                            <ReceivingParty>
                                <Applicant>
                                    <TEST45456>sfdsfnsdfnff  </TEST45456>
                                </Applicant>
                                <IndividualDemographics>
                                
                                </IndividualDemographics>
                                <DeliveryChannel>POST</DeliveryChannel>
                                <NoOfCopies>1</NoOfCopies>
                            </ReceivingParty>
                            <Application>
                                <ProductGroups>
                            <TEST454567>sfdsfnsdfnff  </TEST454567>

                                </ProductGroups>
                            </Application>
                        </FormData>
                    </DocumentSetC>
                </Data>
            </Form>
            <TYP1>5</TYP1>
        </Document>
    </Recipient>
       <Recipient2> ---</Recipient2>
           ---------------
           -------------- 
           -----------------
            -----------------
          <Recipient18000> ---</Recipient18000>
    
---Footer
 </DocumentSet>



Output:
Below is the output I am expecting its 1 file example so every file should have those header and footer
File1:
<?xml version="1.0" encoding="UTF-8"?>
<DocumentSet>
<Recipient1>  </Recipient1>
<Recipient2>  </Recipient2>
<Recipient3>  </Recipient3>
-------------------
-------------------
-------------------
<Recipient100>  </Recipient100>
</DocumentSet>

That is not a small change. I will have to completely rewrite it.

Do you truly want all the data stripped out of your recipient tags? Really? Show representative output.

xmlsplit2.awk

BEGIN {
        ORS=""
        OUT="x."
        ROWS=5
        ROWTAG="^RECIPIENT[0-9]*$"
        HDRTAG="^DOCUMENTSET$"
        FTRTAG="^DOCUMENTSET$"
}

# First pass, remember headers and footers
NR==FNR {
        if(!HDREND)
        {
                HDR=HDR RS $1 OFS $2
                if(TAG ~ HDRTAG) HDREND=FNR
                next
        }

        if(FTRSTART || (CTAG ~ FTRTAG))
        {
                FTR=FTR RS $1 OFS $2
                if(CTAG ~ FTRTAG) FTRSTART=FNR
        }

        next
}

# Skip header and footer
(FNR <= HDREND) || (FNR >= FTRSTART) { next }

# Close output file once enough DOCUMENT records
((XNR%(ROWS+1)) == 0) {
#       printf("FNR==%d XNR==%d FILE=%s\n", FNR, XNR, FILE)>"/dev/stderr"
        if(FILE) {
                print FTR > FILE
                close(FILE);
        }

        FILE=sprintf("%s%04d", OUT,++FILENUM);
        print HDR > FILE
        XNR++
}

{       print RS $0 > FILE      }

CTAG ~ ROWTAG { XNR++ }

END {   if(FILE) print FTR > FILE       }

input3

<?xml version="1.0" encoding="UTF-8"?>
<DocumentSet>
    <Recipient><Context></Context><Document></Document></Recipient>
    <Recipient2><Context></Context><Document></Document></Recipient2>
    <Recipient3><Context></Context><Document></Document></Recipient3>
    <Recipient4><Context></Context><Document></Document></Recipient4>
    <Recipient5><Context></Context><Document></Document></Recipient5>
    <Recipient6><Context></Context><Document></Document></Recipient6>
    <Recipient7><Context></Context><Document></Document></Recipient7>
    <Recipient8><Context></Context><Document></Document></Recipient8>
    <Recipient9><Context></Context><Document></Document></Recipient9>
    <Recipient10><Context></Context><Document></Document></Recipient10>
    <Recipient11><Context></Context><Document></Document></Recipient11>
    <Recipient12><Context></Context><Document></Document></Recipient12>
    <Recipient13><Context></Context><Document></Document></Recipient13>
    <Recipient14><Context></Context><Document></Document></Recipient14>
    <Recipient15><Context></Context><Document></Document></Recipient15>
    <Recipient16><Context></Context><Document></Document></Recipient16>
    <Recipient17><Context></Context><Document></Document></Recipient17>
    <Recipient18><Context></Context><Document></Document></Recipient18>
    <Recipient19><Context></Context><Document></Document></Recipient19>
    <Recipient20><Context></Context><Document></Document></Recipient20>
    <Recipient21><Context></Context><Document></Document></Recipient21>
    <Recipient22><Context></Context><Document></Document></Recipient22>
    <Recipient23><Context></Context><Document></Document></Recipient23>
    <Recipient24><Context></Context><Document></Document></Recipient24>
    <Recipient25><Context></Context><Document></Document></Recipient25>
    <Recipient26><Context></Context><Document></Document></Recipient26>
    <Recipient27><Context></Context><Document></Document></Recipient27>
    <Recipient28><Context></Context><Document></Document></Recipient28>
    <Recipient29><Context></Context><Document></Document></Recipient29>
    <Recipient30><Context></Context><Document></Document></Recipient30>
    <Recipient31><Context></Context><Document></Document></Recipient31>
    <Recipient32><Context></Context><Document></Document></Recipient32>
    <Recipient33><Context></Context><Document></Document></Recipient33>
    <Recipient34><Context></Context><Document></Document></Recipient34>
    <Recipient35><Context></Context><Document></Document></Recipient35>
    <Recipient36><Context></Context><Document></Document></Recipient36>
    <Recipient37><Context></Context><Document></Document></Recipient37>
    <Recipient38><Context></Context><Document></Document></Recipient38>
    <Recipient39><Context></Context><Document></Document></Recipient39>
    <Recipient40><Context></Context><Document></Document></Recipient40>
    <Recipient41><Context></Context><Document></Document></Recipient41>
    <Recipient42><Context></Context><Document></Document></Recipient42>
    <Recipient43><Context></Context><Document></Document></Recipient43>
    <Recipient44><Context></Context><Document></Document></Recipient44>
    <Recipient45><Context></Context><Document></Document></Recipient45>
    <Recipient46><Context></Context><Document></Document></Recipient46>
    <Recipient47><Context></Context><Document></Document></Recipient47>
    <Recipient48><Context></Context><Document></Document></Recipient48>
    <Recipient49><Context></Context><Document></Document></Recipient49>
    <Recipient50><Context></Context><Document></Document></Recipient50>
    <Recipient51><Context></Context><Document></Document></Recipient51>
    <Recipient52><Context></Context><Document></Document></Recipient52>
    <Recipient53><Context></Context><Document></Document></Recipient53>
    <Recipient54><Context></Context><Document></Document></Recipient54>
    <Recipient55><Context></Context><Document></Document></Recipient55>
    <Recipient56><Context></Context><Document></Document></Recipient56>
    <Recipient57><Context></Context><Document></Document></Recipient57>
    <Recipient58><Context></Context><Document></Document></Recipient58>
    <Recipient59><Context></Context><Document></Document></Recipient59>
    <Recipient60><Context></Context><Document></Document></Recipient60>
    <Recipient61><Context></Context><Document></Document></Recipient61>
    <Recipient62><Context></Context><Document></Document></Recipient62>
    <Recipient63><Context></Context><Document></Document></Recipient63>
    <Recipient64><Context></Context><Document></Document></Recipient64>
    <Recipient65><Context></Context><Document></Document></Recipient65>
    <Recipient66><Context></Context><Document></Document></Recipient66>
    <Recipient67><Context></Context><Document></Document></Recipient67>
    <Recipient68><Context></Context><Document></Document></Recipient68>
    <Recipient69><Context></Context><Document></Document></Recipient69>
    <Recipient70><Context></Context><Document></Document></Recipient70>
    <Recipient71><Context></Context><Document></Document></Recipient71>
    <Recipient72><Context></Context><Document></Document></Recipient72>
    <Recipient73><Context></Context><Document></Document></Recipient73>
    <Recipient74><Context></Context><Document></Document></Recipient74>
    <Recipient75><Context></Context><Document></Document></Recipient75>
    <Recipient76><Context></Context><Document></Document></Recipient76>
    <Recipient77><Context></Context><Document></Document></Recipient77>
    <Recipient78><Context></Context><Document></Document></Recipient78>
    <Recipient79><Context></Context><Document></Document></Recipient79>
    <Recipient80><Context></Context><Document></Document></Recipient80>
    <Recipient81><Context></Context><Document></Document></Recipient81>
    <Recipient82><Context></Context><Document></Document></Recipient82>
    <Recipient83><Context></Context><Document></Document></Recipient83>
    <Recipient84><Context></Context><Document></Document></Recipient84>
    <Recipient85><Context></Context><Document></Document></Recipient85>
    <Recipient86><Context></Context><Document></Document></Recipient86>
    <Recipient87><Context></Context><Document></Document></Recipient87>
    <Recipient88><Context></Context><Document></Document></Recipient88>
    <Recipient89><Context></Context><Document></Document></Recipient89>
    <Recipient90><Context></Context><Document></Document></Recipient90>
    <Recipient91><Context></Context><Document></Document></Recipient91>
    <Recipient92><Context></Context><Document></Document></Recipient92>
    <Recipient93><Context></Context><Document></Document></Recipient93>
    <Recipient94><Context></Context><Document></Document></Recipient94>
    <Recipient95><Context></Context><Document></Document></Recipient95>
    <Recipient96><Context></Context><Document></Document></Recipient96>
    <Recipient97><Context></Context><Document></Document></Recipient97>
    <Recipient98><Context></Context><Document></Document></Recipient98>
    <Recipient99><Context></Context><Document></Document></Recipient99>
    <Recipient100><Context></Context><Document></Document></Recipient100>
</DocumentSet>
awk -f yanx.awk -f xmlsplit.awk ROWS=10 input3 input3

x.0001, etc

<?xml version="1.0" encoding="UTF-8"?>
<DocumentSet>
    <Recipient><Context></Context><Document></Document></Recipient>
    <Recipient2><Context></Context><Document></Document></Recipient2>
    <Recipient3><Context></Context><Document></Document></Recipient3>
    <Recipient4><Context></Context><Document></Document></Recipient4>
    <Recipient5><Context></Context><Document></Document></Recipient5>
    <Recipient6><Context></Context><Document></Document></Recipient6>
    <Recipient7><Context></Context><Document></Document></Recipient7>
    <Recipient8><Context></Context><Document></Document></Recipient8>
    <Recipient9><Context></Context><Document></Document></Recipient9>
    <Recipient10><Context></Context><Document></Document></Recipient10>
    </DocumentSet>

Thanks a lot for your help. It worked with the latest code that was my expected output.:slight_smile:

Hello Corona,

Happy New Year !!

Need one small input for the same thread requirement for the below script what I am trying to do is looping thru input files
and passing it to split command in a loop

Issue is every loop it creates unique file name with x.001 so already existing x.001 file gets replaced is there a way
i can pass variable to output file X="x." or can i move the file name before the second iteration kindly assist

# Add all Input files to array
FileList=($(ls | grep "sampletest\\.[0-9]"))

#loop array for Input files

for x in "${FileList[@]}"
do
 #for each element in array
 
   echo "$x"

#File Split Begin

awk -f xml_String_split.awk -f xml_split.awk X="x." ROWS="400" $x $x
done

for f in x.*; do mv "$f" "${f/x/Extrfile}.xml";
done
# add all files to array
arr=($(ls | grep "Extrfile\\.[0-9]"))

Thanks .

Firstly I'm assuming you are using Corona688 's code from post #10.

You don't need to specify X on the command line for this version (OUT= was set in the BEGIN block instead).

If you change the code as follows (changes is red):

BEGIN {
        ORS=""
        # OUT="x."
        ROWS=5
        ROWTAG="^RECIPIENT[0-9]*$"
        HDRTAG="^DOCUMENTSET$"
        FTRTAG="^DOCUMENTSET$"
}

# First pass, remember headers and footers
NR==FNR {
        if(!HDREND)
        {
                HDR=HDR RS $1 OFS $2
                if(TAG ~ HDRTAG) HDREND=FNR
                next
        }

        if(FTRSTART || (CTAG ~ FTRTAG))
        {
                FTR=FTR RS $1 OFS $2
                if(CTAG ~ FTRTAG) FTRSTART=FNR
        }

        next
}

# Skip header and footer
(FNR <= HDREND) || (FNR >= FTRSTART) { next }

# Close output file once enough DOCUMENT records
((XNR%(ROWS+1)) == 0) {
#       printf("FNR==%d XNR==%d FILE=%s\n", FNR, XNR, FILE)>"/dev/stderr"
        if(!length(OUT)) FBASE=FILENAME "."
                else FBASE = OUT "."
        if(FILE) {
                print FTR > FILE
                close(FILE);
        }

        FILE=sprintf("%s%04d", FBASE,++FILENUM);
        print HDR > FILE
        XNR++
}

{       print RS $0 > FILE      }

CTAG ~ ROWTAG { XNR++ }

END {   if(FILE) print FTR > FILE       }

This will create files with your XML filename followed by .nnnnn filenumbers or you can specify a name on the command line eg:

awk -f xml_String_split.awk -f xml_split.awk OUT=$x"_split" ROWS="400" $x $x
1 Like

Thanks a lot Chubler it worked .

Hello All.

$x =sampletest_110.xml
sampletest_111.xml

Command :

awk -f xml_String_split.awk -f xml_split.awk OUT=$x"" ROWS="500" $x $x

For the above its generating file names as below

sampletest_110.xml.0001
sampletest_111.xml.0001 

when i use mv command to rename the files
I used below command to mv

for f in sampletest_*; do mv "$f" "${f/sampletest_/Extrfile}.xml";

Output file names:

Extrfile110.xml.0001.xml
Extrfile111.xml.0001.xml

Iam expecting filenames to be as below please help me to achieve below :

Extrfile110.xml
Extrfile111.xml

----

I've got some difficilties understanding what you are trying to do, so maybe just some comments:

  • If you're trying to assign two lines to a variable with
    text $x =sampletest_110.xml sampletest_111.xml
    , you'll have a) a syntax error, b) a logical error and (probably) c) two "command not found" errors. Remove the $sign, the space, and quote the entire string
  • text awk -f xml_String_split.awk -f xml_split.awk OUT=$x"" ROWS="500" $x $x
    is a very strange construct. If I interpret it correctly, it will run the split script eight times: twice on the first file name in $x, twice on the second, and again for the second instance of $x. Is that what you want?
  • text for f in sampletest_*; do mv "$f" "${f/sampletest_/Extrfile}.xml";
    (A done is missing here!) You replace the "sampletest" string with "Extrfile", and append ".xml". The result is exactly that. You desired target will be achieved in multiple steps, like
    text for f in sampletest_* do TMP="${f/sampletest_/Extrfile}" echo mv "$f" "${TMP%.*}" done
1 Like

In addition to what RudiC has already said, note that asking us to tell you why the command:

awk -f xml_String_split.awk -f xml_split.awk OUT=$x"" ROWS="500" $x $x

doesn't work without showing us your code contained in the files xml_split.awk and xml_String_split.awk is rather difficult.

My crystal ball isn't working well enough to spot your problems in these files this morning.

1 Like

Thanks Rudic it worked.

--- Post updated at 11:06 PM ---

Hi Rudic,

Need one more help I have a xml file with the below job_id repetition, for every response i get from service it will generate one job_id i will store that in a response.xml file
my requirement is to pick the job_id's and store it in a variable with comma seperator as the below format and pass that variable value to DB as input

Required output:

30537,30538,30539 

XML :

JobUnique_Id><ns11:Job_Id>30537</ns11:Job_Id></ns6:JobResponse>
             <ns11:Job_Id>30538</ns11:Job_Id
             <ns11:Job_Id>30539</ns11:Job_Id

--- Post updated at 11:08 PM ---

Rudic reply helped me sorry for not briefing you the problem in detail.

My generic XML script includes instructions and examples for extracting data. What have you tried?

Hi Corona,

I have tried the below where as i can find job_id value when trying to replace the job_id values with comma seperator that new_var is blank kindly assist

JobId =$(cat Response.xml | awk -F"Job_Id>" '{print $2}' | awk -F"<" '{print $1}')
echo $JobId

JOB_ID output here is :12345
23415

NEW_VAR=$(echo $JobId | sed -e "s/^/'/" | sed -e "s/$/'/" | tr '\n' ',')

Output:12345,23415 is the expected output for the NEW_VAR