Split a 30GB XML file into 16 pieces

shoaibjameel123 · August 29, 2011, 3:10am

I have a 30 GB XMl file which looks like this:

<page>
<title>APRIL</title>
.........(text contents that I need to extract and store in 1.dat including the <title> tag)
</page>
<page>
<title>August</title>
....(text contents that I need to store in 2.dat including the <title> tag)
</page>

I want to split this XML file into 16 pieces.

I used "split" command on my Linux to break this file into 16 files but what I found was that the tags were not intact. For example, the in below code, I found one file had half the content and other file had other half.

<page>
<title>August</title>
....(text contents that I need to store in 2.dat including the <title> tag)
</page>

Something like this:

<page>
<title>August</title>

....(text contents that I need to store in 2.dat including the <title> tag)
</page>

Can anybody please help me out with this?

sk1418 · August 29, 2011, 4:55am

are there only 16 <page>...</page> segments in your big file?
what is the rule of splitting :
for each

 <page>
<title>foo</title> 
[whatever]
</page>

you want to get a file containing info:

<title>foo</title> 
[whatever]

?

shoaibjameel123 · August 29, 2011, 5:05am

Thanks for replying. My only criteria is to split the BIG 30 GB file in to 16 pieces of around 1.8GB each. This means the files could be named as 1.part, 2.part until 16.part.
What you are referring to is to extract each of the XML segment tags and the text between them and store them in separate files. That is not my requirement. There are over 3.5 million such page tags in the BIG XML file.

If you use "split" utility in Linux, it splits the file based on certain options that you give like number of lines, size etc. I used split too but that broke out some of the tags as I have shown in my example above.
So, if I use split and break it into 1.8 Gb each, this is what I would have done:

split -b=18000000 BIG_XMl_FILE

sk1418 · August 29, 2011, 5:14am

if the big file is a well-formed xml file, how to handle the root element?

shoaibjameel123 · August 29, 2011, 5:19am

Well, that is not a strict criteria for me, I just want to split the file into 16 pieces by keeping the tags intact.

sk1418 · August 29, 2011, 5:26am

ok, you only want the <page>...</page>.

last question: is the order important?
say, in the big file you have

<page> (1) .(2)..... <page>(n)

you want

page(1)..(2).(3)..page(k) in file.part1
page(k+1)..(K+2).(k+3)..page(j) in file.part2
...

but is it ok if

page(1), (17), (23),.... in  file.part1
page(2), (18)...      in file.part2
..

?

shoaibjameel123 · August 29, 2011, 5:30am

yes order is also not important. but only tags should be there properly.

yazu · August 29, 2011, 5:37am

It's easy enough with low-level programming with C (or Perl). The principle is the same as in "tail" command. You rewind the file pointer to FILESIZE/16 and go back to find the first <page> and remember the byte position, then rewind to 2*FILESIZE/16 and so on. When you get your positions you split the file with dd (or in this program).
If no one comes with a solution, I'll write a program but a little later.

shoaibjameel123 · August 29, 2011, 5:40am

So, it means that this task cannot be handled very well with simple shell scripting. It will take some time but will be worth trying in C. I'll write a program for it and put it here.

sk1418 · August 29, 2011, 5:58am

I wrote an awk script to do the job, check if it is what you need:

under the directory of your bigFile:

touch {1..16}.txt

this will create 1-16.txt 16 empty files. then run this:

awk 'BEGIN{flag=0; file=1 }{
if($0~/<page>/) flag=1;
if($0~/<\/page>/) {
        buf=buf$0"\n";
        flag=0;
        printf buf>>file".txt"
        buf="";
        
        file++;
        file=(file<=16)?file:1;
};
if(flag==1){
        buf=buf$0;
}
}' your30G_BIG.xml

well the code can be optimized but try if it is working 4 u first.
(you can change the output file name in the code).

yazu · August 29, 2011, 11:31am

Quick & dirty and not tested thoroughly. But it prints something. If there will be problems with '\0' or with 64-bit sizes or offsets it's better to translate this to perl.

#define _FILE_OFFSET_BITS 64

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define NCHUNKS 16
#define NBUF  512
#define WORD "<page>"


void usage(char *progname) {
    printf("Usage: %s FILENAME\n", progname);
}

long long find_pos(FILE *fd, long long pos) {
    
    char BUF[NBUF+1];
    BUF[NBUF+1] = '\0';  // BUG !!!
    
    fseeko(fd, pos, SEEK_SET);
    int count = -1;
    char *found = NULL;

    while (!found) {
        fread(BUF, NBUF, 1, fd);
        found = strstr(BUF, WORD);
        count++;
    }
    int offset = found - BUF;

    return pos + NBUF*count + offset;
}

int main(int argc, char** argv) {
    FILE *fd = fopen(argv[1], "r"); // fopen should be fopen64 really
    if (! fd) {
        fprintf(stderr, "Couldn't open %s\n", argv[1]);
        usage(argv[0]);
        exit(-1);
    }

    long long fs;
    if (fseeko(fd, 0, SEEK_END)) {
        fprintf(stderr, "Couldn't go to the end of the file\n");
        exit(-1);
    }

    fs = ftello(fd);

    int count = 0;
    while (count < NCHUNKS-1) {
        printf("%lld ", find_pos(fd, ++count * fs/NCHUNKS));
    }
    printf("\n");
    
    fclose(fd);
    return 0;
}

You can test this with:

xxd -sNUM testfile | head -1

Yes, there is a bug. Too quick... ))) Thanks, Corona688!

Corona688 · August 29, 2011, 11:50am

Nice-looking program, though I would note one problem:

char BUF[NBUF+1];
BUF[NBUF+1] = '\0';

Replace NBUF with 4 and follow along:

char BUF[4+1];
buf[0]=0; // first element
buf[1]=1; // second element
buf[2]=2; // third element
buf[3]=3; // fourth element
buf[4]=4; // fifth element
buf[5]=5; // SIXTH element!  buf[4+1] is beyond the end!

If you're lucky, this will do nothing.

If you're unlucky, it will crash your program.

If you're very unlucky, it will corrupt stack values in strange ways that alter other local variables and cause unpredictable misbehavior.

This often results in programs that work fine when compiled for debugging, but do strange things when optimized -- suddenly memory values which didn't matter get stripped out and you're only stomping on ones that do.

char BUF[NBUF+1];
    BUF[NBUF] = '\0';

That should be all it needs I think.

yazu · August 29, 2011, 12:13pm

Yes, a baby mistake. It has one more problem - if there are not enough WORDs in a file, then it will crash (or infinitely loop) so you cannot test it on an arbitrary file. But I think in your situation it's impossible.

fpmurphy · August 29, 2011, 5:04pm

    long long fs;
    if (fseeko(fd, 0, SEEK_END)) {
        fprintf(stderr, "Couldn't go to the end of the file\n");
        exit(-1);
    }

    fs = ftello(fd);

to make your code more portable you should use off_t instead of long long.