Break up file into n number of subsets and run in parallel

Hi Guys,

I want to break down one of my input files into say 25 parts , run the same script in parallel and then merge the output into a single script.
I have access to computing resources that can deal with 25 files, if I just run the original file the total time is about 15 days every time.

Is this possible? So if I have an awk script gina.awk, these would be the steps.

  1. Split Input.file into Input1.file, Input2.file,....Input25.file

for file in Input*
do
./gina.awk $file > out_$file
done
cat out* > Output.file

Is this possible? and will it help my cause in speeding up? I have access to 25 CPU cores.

Try:

LINES=`wc -l Input.file | awk '{print int($1/25)}'`
split -dl$LINES Input.file Input.file
mv Input.file /somewhere/else

for file in Input*; do 
  ./gina.awk $file > out_$file &
done

cat out* > Output.file
1 Like

Is it just a really slow awk script? Or is there hundreds of gigabytes of data? Splitting it in 25 won't speed up a slow disk.

Does it make sense to split up the input data into sections? Is each line considered individually or does context matter?

Certainly it's possible to do what you want... Whether it's a good idea we don't know enough to say yet.

1 Like

@bartus11

Don't we need a wait before the last statement?

--ahamed

Indeed, it might be useful. Alternatively, the OP could be checking the status of the jobs with:

jobs

And execute the last part manually when there are no more jobs executing.

Yes, each line in the input is independent ..and the awk script does a series of greps from another file and writes to the output..

It may be possible to speed up the awk script...