I need to find a smarter way to process about 60,000 files in a single directory.
Every night a script runs on each file generating a output on another directory; this used to take 5 hours, but as the data grows it is taking 7 hours.
The files are of different sizes, but there are 16 cores on the box so I want to run at least 10 parallel processes. (the report generating script is not very cpu intensive)
I can manually split "ls -1" in to 10 lists, then run foreach on every file in the background. This makes the process run in 2 hours; but it isn't the smartest way because the list with the largest files (some over a gig) always takes the longest and the list with small files finishes first.
One way of solving the problem is by listing the files in order of size and somehow putting every 10th file in to a list.
Another way could be to start processing each file one after the other, but maintaining not more than 10 threads?
Finally I was also thinking of keeping a zipped up tarball. gtar or tar piped through gzip takes over 12 hours to run! It would be good to be able to create 10 smaller tarballs in a shorter time.
You could use something like this to split files into 10 lists with even size distribution:
#!/bin/ksh
for i in {1..10}
do
list[$i]=$(ls -lS | awk "NR>$i" | awk 'NR%10==1{print $9}')
done
echo "${list[1]}" #prints contents of the first list (just to show how to get to the filenames contained there)
par=10
for i in * # For every file in the directory
do
while [ $(ps |grep -c "[c]ommand") -ge $par ] # wait until a slot is free if there are 10 or more processes
do
sleep 1
done
command "$i" & # run "command" on file "$i" in background
done
wait # wait for last background processes to finish
replace "command" with the command you are actually using.
This will create a crude 10 slots in which the commands can run in parallel (a single queue 10 server model).
#!/bin/sh
$dest=/my/destination/directory
par=10
for i in * # For every file in the directory
do
while [ $(ps |grep -c "[g]enrep") -ge $par ] # wait until a slot is free if there are $par or more processes
do
sleep 1
done
/usr/local/bin/genrep.pl "$i" "$dest" & # run "command" on file "$i" in background
done
wait # wait for last background processes to finish
#!/bin/bash
for i in {1..10}
do
list[$i]=$(ls -lS | awk "NR>$i" | awk 'NR%10==1'{'print $8'})
done
for i in {1..10}
do
for file in ${list[$i]}; do genrep.pl $file ../test2 ;done &
done
The above is working for me now. Thanks
The only drawback is when i stop the script, i need another script to kill the background processes but thats ok.
---------- Post updated at 01:00 AM ---------- Previous update was at 12:59 AM ----------
LOL! now I have to try SCRUTINIZER's post and compare results!
---------- Post updated at 01:32 AM ---------- Previous update was at 01:00 AM ----------
Thanks All,
Initially scrutinizer's script was slower, but I got rid of the sleep and it runs like a charm. Advantage is that cancelling the script is easy.
The Linux xargs has a feature to perform this kind of task:
--max-procs=max-procs
-P max-procs
Run up to max-procs processes at a time; the default is 1. If
max-procs is 0, xargs will run as many processes as possible at
a time.
BTW, ksh93t has built-in support for automatically limiting the number of background jobs that run at the same time. See the MAXJOBS parameter. You can also use the SIGCHLD trap to find out which background job has completed and get it's exit status.
For completeness, here is the ksh93 release note on JOBMAX.
08-12-04 +SHOPT_BGX enables background job extensions. Noted by "J" in
the version string when enabled. (1) JOBMAX=n limits the number
of concurrent & jobs to n; the n+1 & job will block until a
running background job completes. (2) SIGCHLD traps are queued
so that each completing background job gets its own trap; $! is
set to the job pid and $? is set to the job exit status at the
beginning of the trap. (3) sleep -s added to sleep until the time
expires or until a signal is delivered.