Running 4 jobs across 4 machines simultaneously using 'parallel' command

ckmehta · August 22, 2023, 8:27pm

I have tried to run the following parallel command (and variations of it) calling a script and I was hoping to run a 4 items in true parallel, but my results are decidedly serial. For example purposes, I have made the following test script and made sure the script runs for approximately 5+ seconds each for easier resolution

${testDir}/script/test.sh 
#!/bin/bash
hostname;date;sleep 5;date;crontab -l

PARALLEL CMD:

#!/bin/bash
export testDir=/my/test/dir
parallel --no-run-if-empty --will-cite --onall --keep-order --joblog ${testDir}/jobLog.txt --results ${testDir}/resultsDir  --jobs 0 "ssh -qt {} ${testDir}/script/test.sh" ::: LNXVP51 LNXVP52 LNXVP62 LNXVP63

RESULTS:


LNXVP51
Fri Aug 18 15:09:11 EDT 2023
Fri Aug 18 15:09:16 EDT 2023
50 2 * * 0 /home/user/script/svcCtrl.sh stop
LNXVP52
Fri Aug 18 15:09:19 EDT 2023
Fri Aug 18 15:09:25 EDT 2023
50 2 * * 0 /home/user/script/svcCtrl.sh stop
LNXVP62
Fri Aug 18 15:09:29 EDT 2023
Fri Aug 18 15:09:34 EDT 2023
50 2 * * 0 /home/user/script/svcCtrl.sh stop
LNXVP63
Fri Aug 18 15:09:35 EDT 2023
Fri Aug 18 15:09:40 EDT 2023
50 2 * * 0 /home/user/script/svcCtrl.sh stop

I was expecting all 4 nodes to show approximately the same results like the following, give or take 1-2 seconds depending on system load

<HOSTNAME>
Fri Aug 18 15:09:11 EDT 2023
Fri Aug 18 15:09:16 EDT 2023
50 2 * * 0 /home/user/script/svcCtrl.sh stop

I tried the parallel CMD with:
--jobs 0 (supposedly uses max # of processes, I would think 4 is safe)
--jobs 4 (the exact # of nodes)
(omitted --jobs argument

But every time I run this, it serially goes through one node at a time.

Am I missing something?

drysdalk · August 23, 2023, 7:02am

Hello,

I have to admit, I've never really used the parallel command before, so I've been playing about this morning for a while to see if I can replicate the issue. And I think I might know what's going on here.

So, to test, I used a command very similar to your own, specifically this:

/usr/bin/parallel --no-run-if-empty --will-cite --onall --keep-order --jobs 0 'ssh {} "date ; sleep 10"' ::: host1 host2 host3 host4

And initially, using that command, I do indeed see the commands starting one after the other, like you observe yourself:

$ time /usr/bin/parallel --no-run-if-empty --will-cite --onall --keep-order --jobs 0 'ssh {} "date ; sleep 10"' ::: host1 host2 host3 host4
Wed Aug 23 07:34:38 BST 2023
Wed Aug 23 07:34:48 BST 2023
Wed Aug 23 07:34:58 BST 2023
Wed Aug 23 07:35:08 BST 2023

real    0m41.452s
user    0m0.429s
sys     0m0.157s
$

Here, we can clearly see that each command started ten seconds after the other - in other words, parallel only executed the next command after the previous command had finished. Otherwise, what we'd expect to see is that the output of date would be identical (give or take), since the date command would have run near-simultaneously on each host.

And just in case we were in any doubt, my use of time here to show the total execution time clearly shows this entire command taking 40 seconds - exactly what we'd expect if it ran them all serially, and not the 10 seconds we'd expect if they were truly being run in parallel.

So I played about with the options to parallel for a while, and I think I've found the culprit. Let's see what we get if we omit the --onall flag:

$ time /usr/bin/parallel --no-run-if-empty --will-cite --keep-order --jobs 0 'ssh {} "date ; sleep 10"' ::: host1 host host3 host4
Wed Aug 23 07:42:00 BST 2023
Wed Aug 23 07:42:00 BST 2023
Wed Aug 23 07:42:00 BST 2023
Wed Aug 23 07:42:00 BST 2023

real    0m10.434s
user    0m0.217s
sys     0m0.095s
$

And there we go - true near-parallel execution happening there. So if you could please try the same and let us know the outcome, hopefully that will do the trick for you.

As to why that flag causes this serial rather than parallel execution: let's see what the man page for parallel has to say about this flag:

  --onall (beta testing)
                Run all the jobs on all computers given with --sshlogin. GNU parallel will log into --jobs number of computers in parallel and run one job at
                a time on the computer. The order of the jobs will not be changed, but some computers may finish before others.

                When using --group the output will be grouped by each server, so all the output from one server will be grouped together.

                --joblog will contain an entry for each job on each server, so there will be several job sequence 1.

So I think what's happening here is that it's effectively only running one SSH at a time when the --onall flag is used, since the command we're trying to parallelise already uses ssh itself. So when we omit this flag, it does the connections in parallel like we'd expect, since it's the ssh in the command we're parallelising that then takes priority.

As an aside: is there a reason you don't want to use something simpler and solely shell-driven, like (for example) this:

#!/bin/bash

for host in host1 host2 host3 host4
do
        /usr/bin/ssh "$host" "/bin/date ; /usr/bin/sleep 10" &
done

wait

This script executes the SSH sessions to the four hosts in parallel (since it backgrounds each ssh command when it runs it), but still waits for all child processes to completely exit before the script itself exits (the meaning of the wait command at the end).

Anyway, hope at least some of this helps ! If you could let us know how you get on, then we can take things from there.

bendingrodriguez · August 23, 2023, 7:43am

Hi @ckmehta,

GNU parallel has tons of options and is not a simple tool. For ssh there are simpler ones like parallel-ssh from the pssh package (available on Redhat, Debian etc):

$ time parallel-ssh -o outdir -H host1 -H host2 -H host3 "hostname; date; sleep 4.2; date"

For other options see parallel-ssh --help and man parallel-ssh.

Of course you can do this with pure shell, too, as @drysdalk already suggested.

ckmehta · August 23, 2023, 5:06pm

@drysdalk , THANKS!!

The removal of "--onall" made /bin/parallel worked as I needed it to.

Agreed that /bin/parallel is a rather complicated beast, but it was something that came up in my search for parallel runtime options in my scripts (including some that aren't SSH-based). It also has some other features that provided some solid utility like:
--keep-order : Printing results in the order of requested parallel elements (in this case machines)
--joblog : Summary Job log file with start time, length of runtime, and exit signals for each job
--results : Captures stdOut and stdErr for each job in separate files per parallel element (in this case machines)

system · September 6, 2023, 5:07pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.