URGENT Help required regarding the use of FORK system call

I desperately wanted one of the UNIX Gurus to help me resolve my problem asap(I have to deliver the code to the client by Monday 08-oct).

I have a file with around 5 million records (50 lakhs). Now my original process was taking around 30 hours to read the complete file, process each and every record and write it to another file. we do a lot of calculations for each and every record so it takes that time.

Now I planned to implement PARALLEL processing in my program. So I am dividing the complete input file into 5 chunks (each of 1 million) and sending every chunk to every child process to process. Now every child process will process its own chunk and write it to its own temporary file. Finally in the parent process I am planning to merge all the temporary files together. By doing this I believe I shall save a lot of processing time.

What I am basically interested to know is that what are the side effects of using FORK in the C programs? Are there any SYSTEM level impacts by using FORK? Is there any system call to merge multiple files into ONE? I am interested in knowing what happens if a Child is killed? How can I reprocess the child that is killed? How do I ensure that there are no ZOMBIES or INFANT process created?

Can some one briefly advise how I can proceed with my Logic. I have already written the logic but I want to cross check if there is something I am missing in my logic.

Thanks,
Kumar

Well, a few general comments...cracking a program into 5 subprocesses like this makes sense only if you have 5 or more cpu's available.

To have children a process must never fork(). To have no zombies, a process must issue a wait() each time a child dies. A process can catch SIGCHLD to be notified of the death of a child.

I would try to drive my processing time per record down and use buffered i/o. Make sure that you don't fork any processes per record.

Finally, if your program sucks, the client will not give you any brownie points for the fact that it was on time. And if it's a few days late but perfect, he will forget the lateness eventually. Don't sacrifice the quality for the timetable, it's not worth it.

Having said that..... it is still not clear how the proposed algorithm by the poster will greatly speed up the processing. As stated, having 5 parallel tasks on the same platform with one CPU does not offer very many enhancements over one task on the the platform with one CPU.

Prior to diving into coding and programming, it would be wise to develop an architecture and/or processing algorithm that works. I don't think was have got to that point yet, have we?

For example:

What is the CPU? Is the system CPU constrained?

How much memory is on the platform? Is the system memory (and/or swap) contrained?

Questions like these need to be addressed before looking at the system calls. It is quite possible the system is simply memory constrained and trashing due to swap problems...... (or simply needs more memory).

The previous guru's are quite correct.

I might even suggest that you forget about fork(), wait(),
SIGCHLD, etc. and just launch 5 different instances of the
same program with say a file name as an argument...

#!/bin/ksh

myprog file1 > /tmp/file1.$$ &
kid1pid=$!
myprog file2 > /tmp/file2.$$ &
kid2pid=$!
....
myprog file5 > /tmp/file3.$$ &
kid5pid=$!

# at this point you wait on the kids
wait

# here you can assemble the files...
Cwd=`pwd`
cd /tmp
for i in `ls file?.`
do
cat $i >> $Cwd/newfile.$$
done
rm -f file?.

cd $Cwd

exit 0

...so... it ain't pretty, it ain't slick or elegant, hell, it ain't even C
but it is simple and will work by Monday :stuck_out_tongue:

Thanks a lot rwb1959!!

Your suggestion looks great. I shall better try this rather than using fork. I have already implemented the fork mechanism but I started to be a bit worried after looking at your responses.

Anyway thanks a lot for the help.