<page>
<title>APRIL</title>
.........(text contents that I need to extract and store in 1.dat including the <title> tag)
</page>
<page>
<title>August</title>
....(text contents that I need to store in 2.dat including the <title> tag)
</page>
I want to split this XML file into 16 pieces.
I used "split" command on my Linux to break this file into 16 files but what I found was that the tags were not intact. For example, the in below code, I found one file had half the content and other file had other half.
<page>
<title>August</title>
....(text contents that I need to store in 2.dat including the <title> tag)
</page>
Something like this:
<page>
<title>August</title>
....(text contents that I need to store in 2.dat including the <title> tag)
</page>
Thanks for replying. My only criteria is to split the BIG 30 GB file in to 16 pieces of around 1.8GB each. This means the files could be named as 1.part, 2.part until 16.part.
What you are referring to is to extract each of the XML segment tags and the text between them and store them in separate files. That is not my requirement. There are over 3.5 million such page tags in the BIG XML file.
If you use "split" utility in Linux, it splits the file based on certain options that you give like number of lines, size etc. I used split too but that broke out some of the tags as I have shown in my example above.
So, if I use split and break it into 1.8 Gb each, this is what I would have done:
It's easy enough with low-level programming with C (or Perl). The principle is the same as in "tail" command. You rewind the file pointer to FILESIZE/16 and go back to find the first <page> and remember the byte position, then rewind to 2*FILESIZE/16 and so on. When you get your positions you split the file with dd (or in this program).
If no one comes with a solution, I'll write a program but a little later.
So, it means that this task cannot be handled very well with simple shell scripting. It will take some time but will be worth trying in C. I'll write a program for it and put it here.
Quick & dirty and not tested thoroughly. But it prints something. If there will be problems with '\0' or with 64-bit sizes or offsets it's better to translate this to perl.
#define _FILE_OFFSET_BITS 64
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define NCHUNKS 16
#define NBUF 512
#define WORD "<page>"
void usage(char *progname) {
printf("Usage: %s FILENAME\n", progname);
}
long long find_pos(FILE *fd, long long pos) {
char BUF[NBUF+1];
BUF[NBUF+1] = '\0'; // BUG !!!
fseeko(fd, pos, SEEK_SET);
int count = -1;
char *found = NULL;
while (!found) {
fread(BUF, NBUF, 1, fd);
found = strstr(BUF, WORD);
count++;
}
int offset = found - BUF;
return pos + NBUF*count + offset;
}
int main(int argc, char** argv) {
FILE *fd = fopen(argv[1], "r"); // fopen should be fopen64 really
if (! fd) {
fprintf(stderr, "Couldn't open %s\n", argv[1]);
usage(argv[0]);
exit(-1);
}
long long fs;
if (fseeko(fd, 0, SEEK_END)) {
fprintf(stderr, "Couldn't go to the end of the file\n");
exit(-1);
}
fs = ftello(fd);
int count = 0;
while (count < NCHUNKS-1) {
printf("%lld ", find_pos(fd, ++count * fs/NCHUNKS));
}
printf("\n");
fclose(fd);
return 0;
}
You can test this with:
xxd -sNUM testfile | head -1
Yes, there is a bug. Too quick... ))) Thanks, Corona688!
Nice-looking program, though I would note one problem:
char BUF[NBUF+1];
BUF[NBUF+1] = '\0';
Replace NBUF with 4 and follow along:
char BUF[4+1];
buf[0]=0; // first element
buf[1]=1; // second element
buf[2]=2; // third element
buf[3]=3; // fourth element
buf[4]=4; // fifth element
buf[5]=5; // SIXTH element! buf[4+1] is beyond the end!
If you're lucky, this will do nothing.
If you're unlucky, it will crash your program.
If you're very unlucky, it will corrupt stack values in strange ways that alter other local variables and cause unpredictable misbehavior.
This often results in programs that work fine when compiled for debugging, but do strange things when optimized -- suddenly memory values which didn't matter get stripped out and you're only stomping on ones that do.
Yes, a baby mistake. It has one more problem - if there are not enough WORDs in a file, then it will crash (or infinitely loop) so you cannot test it on an arbitrary file. But I think in your situation it's impossible.