Running jobs in parallel

gctex · April 6, 2011, 5:29pm

I need to process 50 sqlplus scripts which are listed in a text file. I need to develop a shell script that'll read this file and run these sqlplus scripts. At any point of time, the number of sqlplus scripts running shouldn't exceed 6. If any of the sqlplus scripts completes successfully then another one needs to kick off, till all scripts are processed.

How would I go about doing this please?

My idea is to have 2 loops. One looping through the sql list and another one checking the number sql sessions opened at any time.

Say, the first loop kicks off the first 6 sqls. Now the sqls read and running are the same : 6. After some time, 2 sqls get done. Now the sqls read is 6 but the sqls running is 4. So the first loop should read the next two sqls, 7 & 8, and execute them.

However, I am not able to put together a shell script for this. Any help would be much appreciated!

Corona688 · April 6, 2011, 5:41pm

What's your system? What's your shell?

If you have bash or ksh, you can do something like:

#!/bin/bash

PROCS=5       ;       WAIT=0  ;       END=0

# Wait for next thread.
function waitnext
{       # Needs BASH/KSH
        wait "${PIDS[$(( (WAIT++) % PROCS))]}"
}

while read LINE
do
        [ "$((END-WAIT))" -ge "$PROCS" ] && waitnext
        stuff "$LINE" &
        PIDS[$(( (END++) % PROCS ))]=$!
done < textfile
# wait for ALL remaining processes
wait

I don't know of a way to make a shell wait for any particular single process though. Either you give it the PID and it waits for that exact process, or you do a general 'wait' and wait for ALL of them. So this has to wait for them in order.

Meaning if one program takes 10x longer than the rest it might end up waiting for that long one before it launches another 5 jobs.

methyl · April 6, 2011, 6:11pm

Further to Corona688's last word. See the Shell "jobs" command. You can use this in most Bourne-type Shells to determine how many background jobs are running.

Please state what Shell you are using.

gctex · April 6, 2011, 6:16pm

Wow, that was a super fast response! thank you ...

System is SunOS 5.10 Generic_144488-06, shell is ksh.

Yeah, I tried with "wait" before but I found that since "wait" waits for the successful completion of jobs, if a script takes too long, then it will block the next batch of scripts from getting executed. That's not what I am looking for.

Btw, what does "stuff "$LINE" &" mean?

methyl · April 6, 2011, 6:41pm

It is pseudocode to describe your process which runs a sqlplus program in your environment. We cannot guess what it is called or what it contains.
The "&" of course backgrounds the command.

My turn for a Btw.
A sequence of commands written in sqlplus is called a program.
A sequence of commands written in unix Shell is called a script.
There is no such thing as a sqlplus script.

Chubler_XL · April 6, 2011, 7:26pm

Using jobs to count background jobs running:

#!/bin/ksh
PROCS=6
SLEEP=20
while read LINE
do
     while true
     do
         NUM=$(jobs | wc -l)
         echo NUM=$NUM
         if [ $NUM -lt $PROCS ]
         then
             stuff "$LINE" &
             break
         else
             sleep $SLEEP
         fi
    done
done < textfile
# wait for ALL remaining processes
wait

gctex · April 6, 2011, 7:32pm

methyl, you are absolutely wrong about this: "A sequence of commands written in sqlplus is called a program."

There were always sql scripts and always will be. A set of sql commands run through sql*plus is called by us Oracle DBAs as sqlplus script (or sql script for short).

But that's besides the point. Do you have any sample code to post like Corona688 did?

Chubler_XL · April 6, 2011, 7:38pm

Well I think post #6 is pretty much what methyl suggested.

gctex · April 6, 2011, 8:09pm

Thanks Chubler for the code. I ran it a couple of times and I got the same issue. Hits a long running sql script and ignores the others.

This is the output I was getting:

NUM= 0
[1] 3278
NUM= 1
[2] 3282
NUM= 2
[3] 3286
NUM= 3
[4] 3290
NUM= 4
[5] 3294
NUM= 5
[6] 3298
NUM= 6
NUM= 6
NUM= 6
NUM= 6
NUM= 6
NUM= 6
..........

It gets stuck at 6 and doesn't proceed further. I check the sqlplus jobs and I find nothing is running:

ps -ef|grep "sqlplus -s"

oracle 4341 15634 0 20:08:09 pts/1 0:00 grep sqlplus -s

methyl · April 6, 2011, 8:28pm

@gctex
Please post the complete and unedited code (after blotting confidential information with X's) which produced the output in post #9.

Ps. I suspect that I first became a certified Oracle DBA before you were born. There is much popular usage of basic computer terms in general circulation. In the context of your posts it is impossible to determine whether you are referring to a program written in Oracle SQL or a unix Script. In my above request I use the word "code" to refer to either.

Chubler_XL · April 6, 2011, 8:39pm

@gctex, From the output (I left the NUM=$NUM debugging line in by mistake), we can see it has fired off 6 jobs and the fact that NUM=6 is poping up means the jobs are still running.

The process ID's started were 3278, 3282, 3286, 3290, 3294 and 3299

I dont know why your ps -ef | grep "sqlplus -s" didn't show them. Next time check the individual process with something like ps -fp 3278

gctex · April 6, 2011, 9:47pm

@chubler, is there a quick way to trap "ksh: cnt=: bad number" errors?

My code is working just fine, just a couple of places it throws this bad number error, I need to see what value is being assigned, if it is not a number.

Chubler_XL · April 6, 2011, 9:55pm

Would help if you could show the code your running. I'd assume you are trying to do something with a variable called cnt, to test if cnt has a valid numeric try something like:

[ $cnt -gt 0 -o $cnt -le 0 ] 2> /dev/null || echo "cnt has bad numeric value of: $cnt"

gctex · April 6, 2011, 10:39pm

@Chubler, you are right, I was trying to find the number of sql processes running and assigning that to cnt variable. Here's the code:

while read sql
do

stuff "sqlplus"

let cnt=$cnt\+1

    if \(\( $cnt == 6 \)\) ; then
        while true
        do
            if \(\( $\(ps -ef|grep "sqlplus -s"|grep -v grep|wc -l\) &gt;= 6 \)\); then
            sleep 5
            elif \(\( $\(ps -ef|grep "sqlplus -s"|grep -v grep|wc -l\) &lt; 6 \)\); then
            let cnt=$\(ps -ef|grep "sqlplus -s"|grep -v grep|wc -l\)
            [ $cnt -gt 0 -o $cnt -le 0 ] 2&gt; /dev/null | echo "cnt has bad numeric value of : $cnt"
            break
            fi
        done
    fi

done < sqlscript.lst

(In your code snippet, do we need 2 pipes? I removed one, it worked fine)

This was the output:

[1] 5970
[2] 5972
[3] 5974
[4] 5975
[5] 5978
[6] 5980
ksh: cnt=: bad number
cnt has bad numeric value of : 0
[7] 6000
[8] 6002
[9] 6004
[10] 6005
[11] 6007
[12] 6011
ksh: cnt=: bad number
cnt has bad numeric value of : 0

But our error trap should print $cnt only if $cnt -gt or -lt 0, right? Not sure why it is printing when $cnt = 0.

And it looks like the error appears when the next batch of 6 sqls are processed, starting with the second batch.

Chubler_XL · April 6, 2011, 11:03pm

The below code is still returning zero when you have 6 processes running

ps -ef|grep "sqlplus -s"|grep -v grep|wc -l

You can see that the number of jobs goes to [6] in the output, you then get the error about cnt=0 and then in jumps to [12].

My code was doing the correct thing to start with. It started 6 jobs and was was waiting for one of them to finish. You did ps -ef | grep "sqlplus -s" and didn't see anything so you assumed no jobs were started. The issue was that the processes aren't being picked up by the grep search.

As I suggested before start a job and then do ps -fp <pid> to find out what the command line is, this will help debug why grep "sqlplus -s" is not matching it.

The test on numeric value should have || this is "or" in shell not "pipe".

gctex · April 6, 2011, 11:22pm

Oh ok, got it about || ... forgot about it ...

For me "ps -ef|grep "sqlplus -s"|grep -v grep|wc -l" is showing the number of sessions running correctly. It shows 6 processes or less when the script is running. Not sure why you are getting 0.

Your code with "jobs" and "wait" just didn't go beyond the first 6 sqls for me. This whole job involving 50 tables takes only 2 minutes but when running your code it processed only 6 sqls and just didn't move further even after 5 minutes. I had to kill (ctrl+C) it everytime. I just tried it again and it gives the same results, only 6 sqls are processed and 6 logfiles are generated.

Chubler_XL · April 6, 2011, 11:32pm

If you feel like debugging it just run jobs with output to the terminal before the sleep:

#!/bin/ksh
PROCS=6
SLEEP=20
while read LINE
do
     while true
     do
         NUM=$(jobs | wc -l)
         echo NUM=$NUM
         if [ $NUM -lt $PROCS ]
         then
             stuff "$LINE" &
             break
         else
             jobs
             echo "========"
             sleep $SLEEP
         fi
    done
done < textfile
# wait for ALL remaining processes
wait

It works fine if I run the sleep command as my background task, but I'm curious about what in your environment is causing jobs to still keep reporting stuff thats finished.

gctex · April 7, 2011, 8:48am

Yeah Chubler, something in our envirorment does behave weird. If I replace "ps -ef" with "jobs" it just doesn't go beyond the first 6 sqls. Anyways, "ps -ef" works fine for me, occasional bad number error is there, when the "ps -ef" count is 0.

Thanks a great lot