Split xml file into multiple xml based on letterID

vx04 · February 5, 2016, 7:09am

Hi All,
We need to split a large xml into multiple valid xml with same header(2lines) and footer(last line) for N number of letterId.
In the example below we have first 2 lines as header and last line as footer.(They need to be in each split xml file)

Header:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

Footer:

</CustomerBatch>

I tried below command:

awk '!/\<CustomerLetter\>/ && !/\<\/CustomerLetter\>/ && !/\<element4\>/ && !/\<element5\>/ && !/\<element6\>/ {f=f"\n"$0} /\<\/CustomerLetter\>/ {print f > "CustomerLetter"++i".xml";f=""}' File.xml

This splits file for each letter without header and footer lines.

Sample:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19333" dateRequested="2016-02-04" letterCountryCd="US" letterStateCd="FL">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="aFRG@gmail.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="19334"/>
</CustomerLetter>
<CustomerLetter letterId="19334" dateRequested="2016-02-04" langRgnCd="EN-US" letterDate="2016-02-04" letterStateCd="CA">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="ABC4@yahoo.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="455020941"/>
</CustomerLetter>
</CustomerBatch>

The above file should be split in to for 500 letterId:

file 1:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19333" dateRequested="2016-02-04" letterCountryCd="US" letterStateCd="FL">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="aFRG@gmail.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="19334"/>
</CustomerLetter>
</CustomerBatch>

File2:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19334" dateRequested="2016-02-04" langRgnCd="EN-US" letterDate="2016-02-04" letterStateCd="CA">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="ABC4@yahoo.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="455020941"/>
</CustomerLetter>
</CustomerBatch>

Thanks

RavinderSingh13 · February 5, 2016, 7:34am

Hello vx04,

Please use code tags as per forum rules, could you please try following and let me know if this helps.

awk -vheader1="<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"no\"?>" -vheader2="<CustomerBatch batchId=\"423433\" xmlns:xsi=\"404 not found">"  -vfooter="</CustomerBatch>" 'BEGIN{file=1;} /<\/CustomerLetter>/{;print header1 ORS header2 ORS line ORS $0 ORS footer > file".xml";file++;line="";next} !/<?xml version="1.0"/ && !/<CustomerBatch batchId/{line=line?line ORS $0:$0}'  Input_file

Following will be the 2 output files as per given sample output.

cat 2.xml
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19334" dateRequested="2016-02-04" langRgnCd="EN-US" letterDate="2016-02-04" letterStateCd="CA">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="ABC4@yahoo.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="455020941"/>
</CustomerLetter>
</CustomerBatch>

cat 1.xml
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19333" dateRequested="2016-02-04" letterCountryCd="US" letterStateCd="FL">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="aFRG@gmail.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="19334"/>
</CustomerLetter>
</CustomerBatch>

EDIT: Adding a non-one liner form of solution of same.

awk -vheader1="<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"no\"?>" -vheader2="<CustomerBatch batchId=\"423433\" xmlns:xsi=\"404 not found">"  -vfooter="</CustomerBatch>" 
                                              'BEGIN{
                                                        file=1;
                                                    }
                                               /<\/CustomerLetter>/{;
                                                                        print header1 ORS header2 ORS line ORS $0 ORS footer > file".xml";
                                                                        file++;
                                                                        line=""
                                                                        next
                                                                  }
                                               !/<?xml version="1.0"/ && !/<CustomerBatch batchId/{
                                                                                                        line=line?line ORS $0:$0
                                                                                                }
                                              '    Input_file

Thanks,
R. Singh

vx04 · February 5, 2016, 8:25am

Thanks the awk worked but it split for each letter ,we need to split for 700 number of letterid

RudiC · February 5, 2016, 8:35am

There's an unescaped double quote in header2.

RavinderSingh13 · February 5, 2016, 8:36am

Hello vx04,

If I understood your requirement correctly you need to create xml files with name of CustomLetter ids as follows, let me know if you have any queries on same.

awk -vheader1="<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"no\"?>" -vheader2="<CustomerBatch batchId=\"423433\" xmlns:xsi=\"404 not found">"  -vfooter="</CustomerBatch>"                             
                                            'BEGIN{
                                                        file=1;
                                                    }
                                               /<\/CustomerLetter>/{;
                                                                        print header1 ORS header2 ORS line ORS $0 ORS footer > file".xml";
                                                                        line=""
                                                                        next
                                                                  }
                                              /CustomerLetter letterId/{
                                                                        gsub(/letterId=|\"/,X,$2);
                                                                        file=$2
                                                                       }
                                              !/<?xml version="1.0"/ && !/<CustomerBatch batchId/{
                                                                                                        line=line?line ORS $0:$0
                                                                                                }
                                              '    Input_file

Output files named 19334.xml and 19333.xml are as follows.

 cat 19333.xml
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19333" dateRequested="2016-02-04" letterCountryCd="US" letterStateCd="FL">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="aFRG@gmail.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="19334"/>
</CustomerLetter>
</CustomerBatch>

AND

cat 19334.xml
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19334" dateRequested="2016-02-04" langRgnCd="EN-US" letterDate="2016-02-04" letterStateCd="CA">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="ABC4@yahoo.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="455020941"/>
</CustomerLetter>
</CustomerBatch>

Please let me know if you have any queries.
EDIT: Adding one liner solution for same now.

awk -vheader1="<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"no\"?>" -vheader2="<CustomerBatch batchId=\"423433\" xmlns:xsi=\"404 not found">"  -vfooter="</CustomerBatch>" 'BEGIN{file=1;} /<\/CustomerLetter>/{;print header1 ORS header2 ORS line ORS $0 ORS footer > file".xml";line="";next} /CustomerLetter letterId/{B=$2;gsub(/letterId=|\"/,X,B);file=B} !/<?xml version="1.0"/ && !/<CustomerBatch batchId/{line=line?line ORS $0:$0}'  Input_file

EDIT2: Sorry my copy-paste have issues so attaching the script in case you are not able to copy it properly.

Thanks,
R. Singh

RudiC · February 5, 2016, 9:17am

What does "we need to split for 700 number of letterid" mean? You want groups of 700 letters to be output to a single file (i.e. 3500 letter will be five files)? You want letters whose ID has "700" in them in separate files?

---------- Post updated at 15:17 ---------- Previous update was at 15:00 ----------

Would this do?

awk '
NR == 1                 {FT = $0
                         next   
                        }
NR < 4                  {HD = HD DL $0
                         DL = RS
                         next   
                        }
/letterId/              {if (FN) {print FT > FN}
                         if (!(LCNT%LN))        {FN = "file" ++FCNT ".xml"
                                                 print HD > FN
                                                }
                         LCNT++
                        }
                        {print > FN
                        }
' LN=700 <(tail -1 file1) file1