vx04
February 5, 2016, 7:09am
1
Hi All,
We need to split a large xml into multiple valid xml with same header(2lines) and footer(last line) for N number of letterId.
In the example below we have first 2 lines as header and last line as footer.(They need to be in each split xml file)
Header:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
Footer:
</CustomerBatch>
I tried below command:
awk '!/\<CustomerLetter\>/ && !/\<\/CustomerLetter\>/ && !/\<element4\>/ && !/\<element5\>/ && !/\<element6\>/ {f=f"\n"$0} /\<\/CustomerLetter\>/ {print f > "CustomerLetter"++i".xml";f=""}' File.xml
This splits file for each letter without header and footer lines.
Sample:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19333" dateRequested="2016-02-04" letterCountryCd="US" letterStateCd="FL">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="aFRG@gmail.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="19334"/>
</CustomerLetter>
<CustomerLetter letterId="19334" dateRequested="2016-02-04" langRgnCd="EN-US" letterDate="2016-02-04" letterStateCd="CA">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="ABC4@yahoo.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="455020941"/>
</CustomerLetter>
</CustomerBatch>
The above file should be split in to for 500 letterId:
file 1:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19333" dateRequested="2016-02-04" letterCountryCd="US" letterStateCd="FL">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="aFRG@gmail.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="19334"/>
</CustomerLetter>
</CustomerBatch>
File2:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19334" dateRequested="2016-02-04" langRgnCd="EN-US" letterDate="2016-02-04" letterStateCd="CA">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="ABC4@yahoo.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="455020941"/>
</CustomerLetter>
</CustomerBatch>
Thanks
Hello vx04,
Please use code tags as per forum rules, could you please try following and let me know if this helps.
awk -vheader1="<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"no\"?>" -vheader2="<CustomerBatch batchId=\"423433\" xmlns:xsi=\"404 not found">" -vfooter="</CustomerBatch>" 'BEGIN{file=1;} /<\/CustomerLetter>/{;print header1 ORS header2 ORS line ORS $0 ORS footer > file".xml";file++;line="";next} !/<?xml version="1.0"/ && !/<CustomerBatch batchId/{line=line?line ORS $0:$0}' Input_file
Following will be the 2 output files as per given sample output.
cat 2.xml
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19334" dateRequested="2016-02-04" langRgnCd="EN-US" letterDate="2016-02-04" letterStateCd="CA">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="ABC4@yahoo.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="455020941"/>
</CustomerLetter>
</CustomerBatch>
cat 1.xml
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19333" dateRequested="2016-02-04" letterCountryCd="US" letterStateCd="FL">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="aFRG@gmail.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="19334"/>
</CustomerLetter>
</CustomerBatch>
EDIT: Adding a non-one liner form of solution of same.
awk -vheader1="<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"no\"?>" -vheader2="<CustomerBatch batchId=\"423433\" xmlns:xsi=\"404 not found">" -vfooter="</CustomerBatch>"
'BEGIN{
file=1;
}
/<\/CustomerLetter>/{;
print header1 ORS header2 ORS line ORS $0 ORS footer > file".xml";
file++;
line=""
next
}
!/<?xml version="1.0"/ && !/<CustomerBatch batchId/{
line=line?line ORS $0:$0
}
' Input_file
Thanks,
R. Singh
vx04
February 5, 2016, 8:25am
3
Thanks the awk worked but it split for each letter ,we need to split for 700 number of letterid
RudiC
February 5, 2016, 8:35am
4
There's an unescaped double quote in header2.
Hello vx04,
If I understood your requirement correctly you need to create xml files with name of CustomLetter ids as follows, let me know if you have any queries on same.
awk -vheader1="<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"no\"?>" -vheader2="<CustomerBatch batchId=\"423433\" xmlns:xsi=\"404 not found">" -vfooter="</CustomerBatch>"
'BEGIN{
file=1;
}
/<\/CustomerLetter>/{;
print header1 ORS header2 ORS line ORS $0 ORS footer > file".xml";
line=""
next
}
/CustomerLetter letterId/{
gsub(/letterId=|\"/,X,$2);
file=$2
}
!/<?xml version="1.0"/ && !/<CustomerBatch batchId/{
line=line?line ORS $0:$0
}
' Input_file
Output files named 19334.xml
and 19333.xml
are as follows.
cat 19333.xml
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19333" dateRequested="2016-02-04" letterCountryCd="US" letterStateCd="FL">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="aFRG@gmail.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="19334"/>
</CustomerLetter>
</CustomerBatch>
AND
cat 19334.xml
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<CustomerBatch batchId="423433" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<CustomerLetter letterId="19334" dateRequested="2016-02-04" langRgnCd="EN-US" letterDate="2016-02-04" letterStateCd="CA">
<Recipient dFlag="N">
<RecipientName fullName="Customer"/>
<RecipientDetails emailAddress="ABC4@yahoo.com"/>
<CoRecipientDetails dFlag="N"/>
</Recipient>
<ReturnAddress brandCD="3"/>
<LoanStructure loanNumber="455020941"/>
</CustomerLetter>
</CustomerBatch>
Please let me know if you have any queries.
EDIT: Adding one liner solution for same now.
awk -vheader1="<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"no\"?>" -vheader2="<CustomerBatch batchId=\"423433\" xmlns:xsi=\"404 not found">" -vfooter="</CustomerBatch>" 'BEGIN{file=1;} /<\/CustomerLetter>/{;print header1 ORS header2 ORS line ORS $0 ORS footer > file".xml";line="";next} /CustomerLetter letterId/{B=$2;gsub(/letterId=|\"/,X,B);file=B} !/<?xml version="1.0"/ && !/<CustomerBatch batchId/{line=line?line ORS $0:$0}' Input_file
EDIT2: Sorry my copy-paste have issues so attaching the script in case you are not able to copy it properly.
Thanks,
R. Singh
1 Like
RudiC
February 5, 2016, 9:17am
6
What does "we need to split for 700 number of letterid" mean? You want groups of 700 letters to be output to a single file (i.e. 3500 letter will be five files)? You want letters whose ID has "700" in them in separate files?
---------- Post updated at 15:17 ---------- Previous update was at 15:00 ----------
Would this do?
awk '
NR == 1 {FT = $0
next
}
NR < 4 {HD = HD DL $0
DL = RS
next
}
/letterId/ {if (FN) {print FT > FN}
if (!(LCNT%LN)) {FN = "file" ++FCNT ".xml"
print HD > FN
}
LCNT++
}
{print > FN
}
' LN=700 <(tail -1 file1) file1