Multithreading in reading file

arunkumar_mca · November 23, 2011, 1:32pm

Dear all,

I am having a huge XML file, as below structure
<EMPLOYEE>
<RECORD id =aaa>
<Salary>99999</Salary>
<section>ssss</section>
</RECORD>
<RECORD id =bbb>
<Salary>77777</Salary>
<section>ssss</section>
</RECORD>
</EMPLOYEE>

This is a 50 GB file I want to read this file in multithreading mode and write to a multiple files(one for each thread) with the Salary and section (salary~section) . Trying to do it in C++

After that i will have to merge this file, i am pretty new to threading concepts in C++. Can any one please suggest me a way in doing this

Thanks
Arun

Corona688 · November 23, 2011, 2:48pm

For such simple processing as this, a single-threaded program is going to be far faster than your disk by far.

You can only process a file as fast as the disk can read it no matter how many threads you have. Files don't have a multithreading "go-faster" mode.

What exactly are you trying to do? What do you mean by "merge"? Describe what you're doing in more detail, we may be able to help track down the slow step.

P.S: Is this really what your XML looks like, or did you pretty it up for posting?

arunkumar_mca · November 23, 2011, 2:54pm

Thanks for your reply. below are the task I am asked to do

1.The file is about 100GB, I was asked to read the file by multiple thread using producer-consumer scenario and populate the salary~Section in multiple file to split the 100GB load.
2.After it done , I need to merge the file and I need to report the date in the order Section.

Thanks,
Arun

Corona688 · November 23, 2011, 3:47pm

Your CPU's going to be faster than your disk whether you have one thread or 100 threads. Disks do not have a multithreading "go faster" mode.

If you were doing some difficult processing on the data, multithreading would make sense -- a multicore CPU can process several sets of data at once -- but there isn't. The toughest part is just splitting into records, which can't be multithreaded anyway since you must read the file in order to figure out where records begin and end.

Sounds a bit contrived honestly. Is this homework?

arunkumar_mca · November 23, 2011, 3:53pm

Thanks for your reply and BTW this is not a homework. If reading in multithreading in not a feasible solution can you suggest me a way and algorithm to do this in a single threaded mode

Thanks,
Arun

Corona688 · November 23, 2011, 3:58pm

The data you have right now is extremely easy. If the data's actually different that could make it very hard. So please answer this question from earlier:

arunkumar_mca · November 23, 2011, 4:06pm

I know that reading the file record by record and using the xmlparcer we can print this into another file. The real thing is that i was given this file and asked to do in multithreading mode using XML parser. And then I will be given the original XML file which I have to modify this code and make that to work

Also to your note that this a 100GB file.so if we process by a single read it will take time , this is what my thinking I may be wrong.

Corona688 · November 23, 2011, 4:32pm

How, exactly, is a multithreaded CPU going to speed up your disk? Disks don't multithread, CPU's do. You have to read the records one-at-a-time no matter what.

Also, please answer the question I already asked twice, and now ask a third time:

I can't do anything useful unless you start answering my questions.

arunkumar_mca · November 23, 2011, 4:44pm

Thanks for your reply, Below is the record in my XML file. I am also new to XML parser in C++.

Thanks,
Arun

Corona688 · November 23, 2011, 5:08pm

$ cat getsecsal.c

#include <stdio.h>
#include <string.h>

int main(void)
{
        char buf[16384];
        char section[16384];

        while(fgets(buf, 16384, stdin))
        {
                char *substr;
                if(substr=strstr(buf,"<Section>"))
                {
                        substr += strlen("<Section>");
                        char *end=strstr(substr, "</ Section>");
                        if(end) (*end)='\0';
                        strcpy(section, substr);
                }
                else if(substr=strstr(buf,"<Salary>"))
                {
                        substr += strlen("<Salary>");
                        char *end=strstr(substr, "</Salary>");
                        if(end) (*end)='\0';

                        printf("%s~%s\n", section, substr);
                }
        }
}

$ gcc getsecsal.c
$ ls -lh data[12]
-rw-r--r-- 1 tyler users 1.7G Nov 23 15:57 data1
-rw-r--r-- 1 tyler users  208 Nov 23 15:57 data2
$ cat data2

<EMPLOYEE>
<RECORD id =XYZ >
<SSN>123</SSN>
<Section>dfdf</ Section>
<Salary>34343</Salary>
</RECORD>
<RECORD id =XZY >
< SSN >321</ SSN >
<Section>dfd</ Section>
<Salary>34343</Salary>
</RECORD>
</EMPLOYEE>

$ ./a.out < data2
dfdf~34343
dfd~34343

$ time ./a.out < data1 > /dev/null

real    0m46.661s
user    0m42.820s
sys     0m2.221s

$

1.7 gigabytes of data in 46.7 seconds is 37 megabytes per second -- the fastest speed my disk controller gets. I'd need a faster disk controller to do any better. More threads wouldn't do a thing.

If the tags in any of your records are actually different from what you posted, it won't work.