concurrent processes

Amruta_Pitkar · April 10, 2007, 12:07am

We have a very large text file..contains almost 100K lines.
We want to process this file to generate another text file as per our data requirement.
As for now the parsing of data takes 20-25 mins each , for 100K lines.

the current script uses :
While Read Each Line
do parsing..
Done

We want to cut down the runtime from 20-25 mins to aorund 5-10 mins..
Hence it was decided that we must split the large input file into 10 files of 10K lines each and run the parsing script for all the 10 files at the same time..something like concurrent run..

How do I achieve this..I am new to the concept of concurrent runs ...
Please guide.

Raghuram.P · April 10, 2007, 12:15am

HI,
As the file size is more it is advised to use C language.
U can make use of pthread to read files. and u can mage the threads easily.

Thanks
Raghuram

Yogesh_Sawant · April 10, 2007, 12:49am

do consider Perl for the job that you are doing. Perl is well suited for handling large files and should improve on the time if implemented properly. read this if you are interested

matrixmadhan · April 10, 2007, 1:20am

By concurrency, I assume you are referring to simultaneous execution of processes. Unless you are on a multi-processor system the real concurrent execution of any processes is not possible.

In your case, you could try something like,

split the files such that there are 10 10K files
run each of the script individually as a background process
<script> for the first file with 10K & ( in background mode)

run through the loop for all the files

Amruta_Pitkar · April 10, 2007, 10:04pm

Hi MatrixMadhan,

I am not sure how to code this exactly....
Also..from what I read...do I have to stop the background processes explicitly..
What if there are some problems with the data parsing...how will the error.log will be created...
Can u explain more, guide for some sample scripts ?

matrixmadhan · April 11, 2007, 11:32pm

There is no need to stop your background process unless they is a situation to be done so!

would something like this be of help!

split the files ...100k to 10 ( 10k s )
now the process that was used to run against the 100k sample, should be used against each of the smaller chunks

i=1
while[ $i -le 10 ]
do
/somedir/process chunk$i & //Make that a background process
i=$(($i + 1))
done

Now with the above loop, smaller chunks are fed to the individual process and would start processing.

By default background process have a lower priority when compared to the foreground process.

You need to arrive / determine at a threshold value ( more of a bench mark stuff ) where running several processes with smaller chunks is actually not bringing down the performance when compared to running it with a single chunk and just a single process.

Creating the error logs is as the usual way as you had been doing it for the foreground process!

ennstate · April 15, 2007, 9:05am

I tried to develop a script using matrixmadhan's comment.

Assuming the file as Server.log and its available in the CWD.

#!/bin/ksh
split -l $(($(wc -l < Server.log)/10)) Server.log /tmp/LogFile.

function ProcessFile {
 echo "Processing File:$1"
while read line ; do
  echo "$line" >> /tmp/Server.Processed
done < $1
}

for F in /tmp/LogFile.* ; do
 echo "Processing $F file"
 ProcessFile $F &
 echo "The Last Child PID is $!"
done

echo "Waiting for Childs"
wait
echo "All Child Process are done"
exit

Expert please comment on this approach.Is this true concurrent processing stuff or will this increase the performance.

Thanks
Nagarajan Ganesan