help to parallelize work on thousands of files

vhope07 · July 11, 2010, 9:12am

I need to find a smarter way to process about 60,000 files in a single directory.

Every night a script runs on each file generating a output on another directory; this used to take 5 hours, but as the data grows it is taking 7 hours.

The files are of different sizes, but there are 16 cores on the box so I want to run at least 10 parallel processes. (the report generating script is not very cpu intensive)

I can manually split "ls -1" in to 10 lists, then run foreach on every file in the background. This makes the process run in 2 hours; but it isn't the smartest way because the list with the largest files (some over a gig) always takes the longest and the list with small files finishes first.

One way of solving the problem is by listing the files in order of size and somehow putting every 10th file in to a list.

Another way could be to start processing each file one after the other, but maintaining not more than 10 threads?

Finally I was also thinking of keeping a zipped up tarball. gtar or tar piped through gzip takes over 12 hours to run! It would be good to be able to create 10 smaller tarballs in a shorter time.

thanks!
-VH

guruprasadpr · July 11, 2010, 9:33am

Hi VH

Could you give us a sample layout of your script. I mean how it is provided with the input file whether through command line args or directly?

Guru.

bartus11 · July 11, 2010, 9:54am

You could use something like this to split files into 10 lists with even size distribution:

#!/bin/ksh
for i in {1..10}
do
  list[$i]=$(ls -lS | awk "NR>$i" | awk 'NR%10==1{print $9}')
done
echo "${list[1]}" #prints contents of the first list (just to show how to get to the filenames contained there)

Scrutinizer · July 11, 2010, 9:56am

Something like this?

par=10
for i in *                                        # For every file in the directory
do 
  while [ $(ps |grep -c "[c]ommand") -ge $par ]   # wait until a slot is free if there are 10 or more processes 
  do   
     sleep 1
  done
  command "$i" &                                  # run "command" on file "$i" in background
done
wait                                              # wait for last background processes to finish

replace "command" with the command you are actually using.

This will create a crude 10 slots in which the commands can run in parallel (a single queue 10 server model).

vhope07 · July 11, 2010, 10:07am

script to list the files and split them

#!/bin/sh
$dest=/my/destination/directory
ls -l | sort -n | awk {`print $9'} > /tmp/all
cd /tmp ; split -l 10000 /tmp/all

Then I can go to /tmp and run

for file in `cat /tmp/xaa`
do
/usr/local/bin/genrep.pl $file /my/destination/directory
done

and the same for /tmp/xab, /tmp/xac, etc ..

Scrutinizer · July 11, 2010, 10:42am

So I think it would become:

#!/bin/sh
$dest=/my/destination/directory
par=10
for i in *                                        # For every file in the directory
do 
  while [ $(ps |grep -c "[g]enrep") -ge $par ]    # wait until a slot is free if there are $par or more processes 
  do   
     sleep 1
  done
  /usr/local/bin/genrep.pl "$i" "$dest" &         # run "command" on file "$i" in background
done
wait                                              # wait for last background processes to finish

vhope07 · July 11, 2010, 11:32am

#!/bin/bash
for i in {1..10}
do
  list[$i]=$(ls -lS | awk "NR>$i" | awk 'NR%10==1'{'print $8'})
done
for i in {1..10}
do
  for file in ${list[$i]}; do genrep.pl $file ../test2 ;done &
done

The above is working for me now. Thanks
The only drawback is when i stop the script, i need another script to kill the background processes but thats ok.

---------- Post updated at 01:00 AM ---------- Previous update was at 12:59 AM ----------

LOL! now I have to try SCRUTINIZER's post and compare results!

---------- Post updated at 01:32 AM ---------- Previous update was at 01:00 AM ----------

Thanks All,

Initially scrutinizer's script was slower, but I got rid of the sleep and it runs like a charm. Advantage is that cancelling the script is easy.

Cheers!

VH

drl · July 11, 2010, 1:37pm

Hi.

The Linux xargs has a feature to perform this kind of task:

       --max-procs=max-procs
       -P max-procs
              Run up to max-procs processes at a time; the default is  1.   If
              max-procs  is 0, xargs will run as many processes as possible at
              a time.

See man xargs for details ... cheers, drl

fpmurphy · July 11, 2010, 8:55pm

BTW, ksh93t has built-in support for automatically limiting the number of background jobs that run at the same time. See the MAXJOBS parameter. You can also use the SIGCHLD trap to find out which background job has completed and get it's exit status.

Scrutinizer · July 12, 2010, 1:20am

Thanks! A small correction: I think it is ksh93t+ and higher and the variable is call JOBMAX. It works nicely. If you do:

#!/bin/ksh
JOBMAX=10  # works in ksh93t+
for i in *
do
  sleep 10 &
done
wait

ps will show 11 processes at every moment including the parent.

fpmurphy · July 12, 2010, 6:53am

My mistake. It was late. I took it from a Dave Korn email without checking the sources.

https://mailman.research.att.com/pipermail/ast-users/2010q2/002931.html

For completeness, here is the ksh93 release note on JOBMAX.

08-12-04 +SHOPT_BGX enables background job extensions. Noted by "J" in
the version string when enabled. (1) JOBMAX=n limits the number
of concurrent & jobs to n; the n+1 & job will block until a
running background job completes. (2) SIGCHLD traps are queued
so that each completing background job gets its own trap; $! is
set to the job pid and $? is set to the job exit status at the
beginning of the trap. (3) sleep -s added to sleep until the time
expires or until a signal is delivered.