Source xml file splitter

I have a source file that contains multiple XML files concatenated in it. The separator string between files is <?xml version="1.0" encoding="utf-8"?>. I wanted to split files in multiple files with mentioned names. I had used a awk code earlier to spilt files in number of lines i.e.

 awk 'NR%"'"${splitCount}"'"==1{x="'"${SrcFileName}_"'" sprintf("%04d",++i) ".txt"}{print > x}' $SrcFileName.txt 

Not much aware of awk scripting hence dont know if we can use this code or there other better/ faster way of doing this. Below is example of source and expected target files.

Source File: XML_DUMP.txt
<?xml version="1.0" encoding="utf-8"?>
<XML1 data>
<?xml version="1.0" encoding="utf-8"?>
<XML2 data>
<?xml version="1.0" encoding="utf-8"?>
<XML3 data>
 
Target Files: 
File 1: XML_DUMP_0001.txt
<?xml version="1.0" encoding="utf-8"?>
<XML1 data>
 
File 2: XML_DUMP_0002.txt
<?xml version="1.0" encoding="utf-8"?>
<XML1 data>
 
File 3: XML_DUMP_0003.txt
<?xml version="1.0" encoding="utf-8"?>
<XML1 data>

try

awk '/xml version="1.0"/{x="'"${SrcFileName}_"'" sprintf("%04d",++i) ".txt";}
 {print > x}' $SrcFileName.txt
1 Like
awk '/<\?xml version="1\.0" encoding="utf-8"\?>/{n++}
n{f="XML_DUMP" sprintf("%04d",n) ".txt"
print >> f
close(f)}' XML_DUMP.txt
1 Like

Thanks @pamu and @elixir

I am happy with below code except two things.
1> Need a variable in awk that can be assigned so shell parameter
2> was not able to assign <?xml version="1.0" encoding="utf-8"?> to a variable

existing working code 
awk '/xml version="1.0"/{x="'"${SrcFileName}_"'" sprintf("%04d",++i) ".txt";}  {print > x}' $SrcFileName.txt
 
need a 2 line code something like
word=<?xml version="1.0" encoding="utf-8"?>
 
awk ' $word {x="'"${SrcFileName}_"'" sprintf("%04d",++i) ".txt";}  {print > x}' $SrcFileName.txt
 

try this...

word='<?xml version="1.0" encoding="utf-8"?>'
 
awk -v serch="$word" '{if($0 ~ serch){x="'"${SrcFileName}_"'" sprintf("%04d",++i) ".txt";print > x}else{print > x}}' $SrcFileName.txt

if above word assignment works properly then it's great:)

if not try using below.(Not tested. away from machine:D)

word="<?xml version=\"1.0\" encoding=\"utf-8\"?>"

@pamu.. word assignment is working fine...awk command s giving error as

awk: 0602-576 A print or getline function must have a file name.
 The input line number is 1. The file is a.txt.
 The source line number is 1.

This might be because of word is not getting properly in awk.

If you don't have any problem. minimize the word length.
Now check:)

word="<?xml version="
 
awk -v serch="$word" '{if($0 ~ serch){x="'"${SrcFileName}_"'" sprintf("%04d",++i) ".txt";print > x}else{print > x}}' $SrcFileName.txt

It will be difficult for me to justify business but even minimizing word length, Its giving same error for large source file. its running fine for small file.

try with this..

Might be we are trying to write to the x which is not present..:slight_smile:

word='<?xml version="1.0" encoding="utf-8"?>'
 
awk -v serch="$word" '{if($0 ~ serch){x="'"${SrcFileName}_"'" sprintf("%04d",++i) ".txt";print > x}else if(x){print > x}}' $SrcFileName.txt

code ran successfully with writing any new file.. so basically no output file is generated.

---------- Post updated at 11:07 AM ---------- Previous update was at 10:37 AM ----------

earlier code which exilir gave is working but wanted to parameterise word. can awk script be changed to use word parameter

SrcFileName=XML_DUMP
word='<?xml version="1.0" encoding="utf-8"?>'
 
awk '/<\?xml version="1\.0" encoding="utf-8"\?>/{n++} 
n{f="'"${SrcFileName}_"'" sprintf("%04d",n) ".txt"
print >> f
close(f)}' $SrcFileName.txt

You need to use double slash :smiley:

try

word='<\\?xml version="1\\.0" encoding="utf-8"\\?>'

awk -v serch="$word" '{if($0 ~ serch){x="'"${SrcFileName}_"'" sprintf("%04d",++i) ".txt";print > x}else if(x){print > x}}' $SrcFileName.txt