Help with Splitting a Large XML file based on size AND tags

Aviktheory11 · July 2, 2014, 7:16am

Hi All,

This is my first post here. Hoping to share and gain knowledge from this great forum !!!!

I've scanned this forum before posting my problem here, but I'm afraid I couldn't find any thread that addresses this exact problem.

I'm trying to split a large XML file (with multiple tag sets) into smaller files of equal size so that the splitting doesn't happen between tags, i.e. I'm trying to have a complete tag set in a file. The size limit of the smaller files is specified in a parameter file. For example, if the size limit is 100 KB, and the Large file is 440 KB, I should have five smaller files of sizes 100 KB,100 KB,100 KB,100 KB and 40 KB.

My initial approach was to create the large file with all the complete tag sets in a single line each, and then to use the split function based on the size limit. However, the complete tag sets are not getting accommodated in single lines since the XMLs are itself Huge. So I was thinking of splitting the large file based on tags, as well as within the size limit.

Below is what I tried to do so far

#!/bin/bash
export ORACLE_HOME=.........
export ORACLE_SID=...........
export PATH=........
. ./params       # contains the parameter sizelimit
FILE="datafile.txt"
sqlplus -s userid/password@DB <<EOF
SET HEADING OFF
SET PAGESIZE 0
SET LINESIZE 32000
SET LONG 32000
SET NEWPAGE NONE
SET FEEDBACK OFF
SET TRIMSPOOL ON
SET DEFINE ON     
SET VERIFY OFF
SET SERVEROUTPUT OFF
SPOOL $FILE
[....query to create the master file...]
SPOOL OFF
EXIT
EOF
filesize= ls -l $FILE | awk '{print $5}'
#echo $filesize
#echo $sizelimit 
if ! echo "$filesize $sizelimit -p" | bc | grep  > /dev/null ^-;
then split -b $sizelimit $FILE part
else echo "less than the limit"
fi

This was the first attempt in using Split function. However, I don't think this can be used, given my criterion. Assuming the tag sets are like <URL>...</URL>, can anyone suggest any other way out?

Thanks a lot,

Avik

rbatte1 · July 2, 2014, 7:42am

Welcome Aviktheory11,

Please change your post to wrap your code/output in & rather than & [/NOPARSE][/ICODE]. It makes it far easier to read.

Thanks for clearly putting in some effort before posting and for posting a good amount of information. Just a few questions:-

What OS and version are you running?
What are your preferred tools to work in? e.g. just ksh/bash, awk, etc.
Can you post a small sample of input and the expected output, or perhaps (with a width of 32000) just a representative sample with fewer 'columns'
How close have you got with your requirement?
Is the data you want to split just a single column for each record? If not, then perhaps a tweak to your SELECT statement may be enough.

Thanks again,
Robin

Chubler_XL · July 2, 2014, 8:17pm

how about this:

#!/bin/bash
export ORACLE_HOME=.........
export ORACLE_SID=...........
export PATH=........
. ./params        # contains the parameter sizelimit
...

if [ $(stat -c%s $FILE) -gt $sizelimit ]
then
    awk -v limit=$sizelimit '
        BEGIN { num=1 }
        {
          if ((bytes+=length)>limit) {
             close(FILENAME "." num)
             num++
          }
          printf "%s%s",$0,RS > FILENAME "." num
        } ' RS="</URL>" $FILE
else
   echo "$FILE: already less than the limit of $sizelimit"
fi

Just be careful awk and many other unix utilities have limits on the length of a single line you may be better off putting a newline character after each </URL>

---------- Post updated at 10:17 AM ---------- Previous update was at 10:06 AM ----------

Depending on your OS the stat command I used above may not be available. A much more portable (but possible less efficient) version would be:

if [ $(wc -c < $FILE) -gt $sizelimit ]

Aviktheory11 · July 3, 2014, 12:17am

Hi Robin,

Thanks for the reply. Below are the information you asked for :

I'm running Linux kernel 2.6.39.

The script I'm trying to write is for bash. I'm comfortable working with Awk/sed etc.

The actual table that I'm querying is something like below

ID (PK, NUMBER)      URL (the queried column, XMLTYPE)

1		     <URL><A>v1</A><B>v2</B><C>v3</C></URL>
2		     <URL><A>x1</A><B>x2</B><C>x3</C></URL>
3		     <URL><A>y1</A><B>y2</B><C>y3</C></URL>
4		     <URL><A>z1</A><B>z2</B><C>z3</C></URL>

There are about 10K rows in this table, and only the URL column needs to be dumped into a file and sent to an FTP server. There should be a check if the file is more than 10MB in size, and if it is, the file needs to be splitted into smaller files, each of size 10MB or lower.
The structure of the output file will be just something simple like this:

<URL><A>v1</A><B>v2</B><C>v3</C></URL>
<URL><A>x1</A><B>x2</B><C>x3</C></URL>
<URL><A>y1</A><B>y2</B><C>y3</C></URL>
<URL><A>z1</A><B>z2</B><C>z3</C></URL>

In case the size exceeds 10MB, splitting this file will be easy with the SPLIT function, based on the size limit. But the problem is that each URL XML is extremely large, and the output file has got it's limit on the length of a single line, so the output file is being generated as something like below

<URL><A>v1</A><B>v2</B>
<C>v3</C></URL>
<URL><A>x1</A><B>x2</B>
<C>x3</C></URL>
<URL><A>y1</A><B>y2</B>
<C>y3</C></URL>
<URL><A>z1</A><B>z2</B>
<C>z3</C></URL>

I found it tricky to split this file, since the SPLIT function won't understand XML tags.

I have been successfully splitting file with sample data in my table where the XML length is much smaller. As in the above example, my script works perfectly. However, it's the Production data that's causing the problem.

chubler_xl:

how about this:
#!/bin/bash
export ORACLE_HOME=.........
export ORACLE_SID=...........
export PATH=........
. ./params        # contains the parameter sizelimit
...

if [ $(stat -c%s $FILE) -gt $sizelimit ]
then
   awk -v limit=$sizelimit '
   BEGIN { num=1 }
   {
   if ((bytes+=length)>limit) {
   close(FILENAME "." num)
   num++
   }
   printf "%s%s",$0,RS > FILENAME "." num
   } ' RS="</URL>" $FILE
else
   echo "$FILE: already less than the limit of $sizelimit"
fi
Just be careful awk and many other unix utilities have limits on the length of a single line you may be better off putting a newline character after each </URL>

---------- Post updated at 10:17 AM ---------- Previous update was at 10:06 AM ----------

Depending on your OS the stat command I used above may not be available. A much more portable (but possible less efficient) version would be:
if [ $(wc -c < $FILE) -gt $sizelimit ]

Thanks a lot Chubler_XL. I'll surely try out your idea. It looks good to me. I've found that in my version of Linux, the STAT command is available. I'll try this out and let you know.

Thanks again,

Avik

Chubler_XL · July 3, 2014, 12:43am

Sorry I should have tried my code on more than 1 large URL as I have forgotten to reset the bytes variable please accept this updated version:

#!/bin/bash
export ORACLE_HOME=.........
export ORACLE_SID=...........
export PATH=........
. ./params        # contains the parameter sizelimit
...

if [ $(stat -c%s $FILE) -gt $sizelimit ]
then
    awk -v limit=$sizelimit '
        BEGIN { num=1 }
        {
          if ((bytes+=length)>limit) {
             close(FILENAME "." num)
             bytes=length
             num++
          }
          printf "%s%s",$0,RS > FILENAME "." num
        } ' RS="</URL>" $FILE
else
   echo "$FILE: already less than the limit of $sizelimit"
fi

Aviktheory11 · July 3, 2014, 1:11am

chubler_xl:

Sorry I should have tried my code on more than 1 large URL as I have forgotten to reset the bytes variable please accept this updated version:

#!/bin/bash
export ORACLE_HOME=.........
export ORACLE_SID=...........
export PATH=........
. ./params        # contains the parameter sizelimit
...

if [ $(stat -c%s $FILE) -gt $sizelimit ]
then
   awk -v limit=$sizelimit '
   BEGIN { num=1 }
   {
   if ((bytes+=length)>limit) {
   close(FILENAME "." num)
   bytes=length
   num++
   }
   printf "%s%s",$0,RS > FILENAME "." num
   } ' RS="</URL>" $FILE
else
   echo "$FILE: already less than the limit of $sizelimit"
fi

Hi Chubler_XL,

I tried out your code snippet. There're a couple of observations that I made.

Firstly, the splitted files are being generated, but the sizelimit is not being considered as that in the parameter file, but the size of the initial file itself. For e.g., suppose the initial file ("output") was created with size 324010 bytes, whereas the parameter file specified the size limit of 10240 bytes. However there are two files created by the script, one is "output" (the initial file) with size 324010 bytes, and "output.2" with size 324017, both with the same data.

I guess there might be something amiss with the bytes variable assignment, but I'm not sure.

Secondly, just for my knowledge, is your script supposedly appending </URL> to the end of every file that gets generated?

Thanks,

Avik

Chubler_XL · July 3, 2014, 1:30am

This is the exact symptom you would see if your file doesn't contain </URL> at all.

Note: awk is case-sensitive, could it be that the file actually contains </url> ? If this is the issue you can put IGNORECASE=1 just before the RS="</URL>"

No, if your file ends in </URL> as it should from your definition then nothing should change.

Aviktheory11 · July 3, 2014, 3:24am

You're absolutely right and I believe I got the exact cause behind this problem. The XML for even one record in the table is so large that it's not spooling into the file completely. To give an example, the first row of the table contained

<URL><A>v1</A><B>v2</B><C>v3</C></URL>

But it's getting spooled into the file as

<URL><A>v1</A><B>v2</B><C>v3

Is there any limit to the file size in Unix ?

---------- Post updated at 12:54 PM ---------- Previous update was at 12:14 PM ----------

Hi All,

I think this problem is resolved. Before spooling the file from the query, I just set the LONG and LONGCHUNKSIZE to their maximum values, and the script is working fine now. I just don't know what will happen in case the production data gets more massive in future !!!

Thanks to Robin & Chubler_XL.

Chubler_XL, your script is very clean and efficient !! Thanks again !!

Avik