Split xml file into multiple xml based on letterID

Hi All,
We need to split a large xml into multiple valid xml with same header(2lines) and footer(last line) for N number of letterId.
In the example below we have first 2 lines as header and last line as footer.(They need to be in each split xml file)

Header:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

Footer:

</CustomerBatch>

I tried below command:

awk '!/\<CustomerLetter\>/ && !/\<\/CustomerLetter\>/ && !/\<element4\>/ && !/\<element5\>/ && !/\<element6\>/ {f=f"\n"$0} /\<\/CustomerLetter\>/ {print f > "CustomerLetter"++i".xml";f=""}' File.xml

This splits file for each letter without header and footer lines.

Sample:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19333" dateRequested="2016-02-04" letterCountryCd="US" letterStateCd="FL">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="aFRG@gmail.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="19334"/>
</CustomerLetter>
<CustomerLetter letterId="19334" dateRequested="2016-02-04" langRgnCd="EN-US" letterDate="2016-02-04" letterStateCd="CA">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="ABC4@yahoo.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="455020941"/>
</CustomerLetter>
</CustomerBatch>

The above file should be split in to for 500 letterId:

file 1:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19333" dateRequested="2016-02-04" letterCountryCd="US" letterStateCd="FL">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="aFRG@gmail.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="19334"/>
</CustomerLetter>
</CustomerBatch>

File2:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19334" dateRequested="2016-02-04" langRgnCd="EN-US" letterDate="2016-02-04" letterStateCd="CA">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="ABC4@yahoo.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="455020941"/>
</CustomerLetter>
</CustomerBatch>

Thanks

Hello vx04,

Please use code tags as per forum rules, could you please try following and let me know if this helps.

awk -vheader1="<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"no\"?>" -vheader2="<CustomerBatch batchId=\"423433\" xmlns:xsi=\"404 not found">"  -vfooter="</CustomerBatch>" 'BEGIN{file=1;} /<\/CustomerLetter>/{;print header1 ORS header2 ORS line ORS $0 ORS footer > file".xml";file++;line="";next} !/<?xml version="1.0"/ && !/<CustomerBatch batchId/{line=line?line ORS $0:$0}'  Input_file

Following will be the 2 output files as per given sample output.

cat 2.xml
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19334" dateRequested="2016-02-04" langRgnCd="EN-US" letterDate="2016-02-04" letterStateCd="CA">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="ABC4@yahoo.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="455020941"/>
</CustomerLetter>
</CustomerBatch>
cat 1.xml
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19333" dateRequested="2016-02-04" letterCountryCd="US" letterStateCd="FL">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="aFRG@gmail.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="19334"/>
</CustomerLetter>
</CustomerBatch>

EDIT: Adding a non-one liner form of solution of same.

awk -vheader1="<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"no\"?>" -vheader2="<CustomerBatch batchId=\"423433\" xmlns:xsi=\"404 not found">"  -vfooter="</CustomerBatch>" 
                                              'BEGIN{
                                                        file=1;
                                                    }
                                               /<\/CustomerLetter>/{;
                                                                        print header1 ORS header2 ORS line ORS $0 ORS footer > file".xml";
                                                                        file++;
                                                                        line=""
                                                                        next
                                                                  }
                                               !/<?xml version="1.0"/ && !/<CustomerBatch batchId/{
                                                                                                        line=line?line ORS $0:$0
                                                                                                }
                                              '    Input_file
 

Thanks,
R. Singh

Thanks the awk worked but it split for each letter ,we need to split for 700 number of letterid

There's an unescaped double quote in header2.

Hello vx04,

If I understood your requirement correctly you need to create xml files with name of CustomLetter ids as follows, let me know if you have any queries on same.

awk -vheader1="<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"no\"?>" -vheader2="<CustomerBatch batchId=\"423433\" xmlns:xsi=\"404 not found">"  -vfooter="</CustomerBatch>"                             
                                            'BEGIN{
                                                        file=1;
                                                    }
                                               /<\/CustomerLetter>/{;
                                                                        print header1 ORS header2 ORS line ORS $0 ORS footer > file".xml";
                                                                        line=""
                                                                        next
                                                                  }
                                              /CustomerLetter letterId/{
                                                                        gsub(/letterId=|\"/,X,$2);
                                                                        file=$2
                                                                       }
                                              !/<?xml version="1.0"/ && !/<CustomerBatch batchId/{
                                                                                                        line=line?line ORS $0:$0
                                                                                                }
                                              '    Input_file

Output files named 19334.xml and 19333.xml are as follows.

 cat 19333.xml
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19333" dateRequested="2016-02-04" letterCountryCd="US" letterStateCd="FL">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="aFRG@gmail.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="19334"/>
</CustomerLetter>
</CustomerBatch>

AND

cat 19334.xml
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19334" dateRequested="2016-02-04" langRgnCd="EN-US" letterDate="2016-02-04" letterStateCd="CA">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="ABC4@yahoo.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="455020941"/>
</CustomerLetter>
</CustomerBatch>
 

Please let me know if you have any queries.
EDIT: Adding one liner solution for same now.

awk -vheader1="<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"no\"?>" -vheader2="<CustomerBatch batchId=\"423433\" xmlns:xsi=\"404 not found">"  -vfooter="</CustomerBatch>" 'BEGIN{file=1;} /<\/CustomerLetter>/{;print header1 ORS header2 ORS line ORS $0 ORS footer > file".xml";line="";next} /CustomerLetter letterId/{B=$2;gsub(/letterId=|\"/,X,B);file=B} !/<?xml version="1.0"/ && !/<CustomerBatch batchId/{line=line?line ORS $0:$0}'  Input_file

EDIT2: Sorry my copy-paste have issues so attaching the script in case you are not able to copy it properly.

Thanks,
R. Singh

1 Like

What does "we need to split for 700 number of letterid" mean? You want groups of 700 letters to be output to a single file (i.e. 3500 letter will be five files)? You want letters whose ID has "700" in them in separate files?

---------- Post updated at 15:17 ---------- Previous update was at 15:00 ----------

Would this do?

awk '
NR == 1                 {FT = $0
                         next   
                        }
NR < 4                  {HD = HD DL $0
                         DL = RS
                         next   
                        }
/letterId/              {if (FN) {print FT > FN}
                         if (!(LCNT%LN))        {FN = "file" ++FCNT ".xml"
                                                 print HD > FN
                                                }
                         LCNT++
                        }
                        {print > FN
                        }
' LN=700 <(tail -1 file1) file1