Break up file into n number of subsets and run in parallel

gina.lizar · February 28, 2014, 1:20pm

Hi Guys,

I want to break down one of my input files into say 25 parts , run the same script in parallel and then merge the output into a single script.
I have access to computing resources that can deal with 25 files, if I just run the original file the total time is about 15 days every time.

Is this possible? So if I have an awk script gina.awk, these would be the steps.

Split Input.file into Input1.file, Input2.file,....Input25.file

for file in Input*
do
./gina.awk $file > out_$file
done

cat out* > Output.file

Is this possible? and will it help my cause in speeding up? I have access to 25 CPU cores.

bartus11 · February 28, 2014, 1:52pm

Try:

LINES=`wc -l Input.file | awk '{print int($1/25)}'`
split -dl$LINES Input.file Input.file
mv Input.file /somewhere/else

for file in Input*; do 
  ./gina.awk $file > out_$file &
done

cat out* > Output.file

Corona688 · February 28, 2014, 1:54pm

Is it just a really slow awk script? Or is there hundreds of gigabytes of data? Splitting it in 25 won't speed up a slow disk.

Does it make sense to split up the input data into sections? Is each line considered individually or does context matter?

Certainly it's possible to do what you want... Whether it's a good idea we don't know enough to say yet.

ahamed101 · February 28, 2014, 2:17pm

@bartus11

Don't we need a wait before the last statement?

--ahamed

bartus11 · February 28, 2014, 2:23pm

Indeed, it might be useful. Alternatively, the OP could be checking the status of the jobs with:

jobs

And execute the last part manually when there are no more jobs executing.

gina.lizar · February 28, 2014, 2:30pm

Yes, each line in the input is independent ..and the awk script does a series of greps from another file and writes to the output..

Corona688 · February 28, 2014, 3:04pm

It may be possible to speed up the awk script...