ksh behavior in scripts spawned w/nohup

safedba · February 23, 2018, 6:26pm

I have a need to run any number of identical scripts simultaneously, so I've created a driver script which reads a template script, edits these appropriately and then submits them via nohup. The spawned scripts all check to see at some point how many of their number are running and once the count has dropped to 2 (I'm grepping so that means only 1 is still running) it should spawn the final script which should only run once.

The variable of the count is gotten at by

number_running=`ps -ef | grep simple | -wc l`

This number_running is always zero somehow.

jim_mcnamara · February 23, 2018, 7:31pm

should it be

number_running=`ps -ef | grep simple | wc -l`

Don_Cragun · February 23, 2018, 7:41pm

Expanding a bit on what Jim has already said...

safedba:

I have a need to run any number of identical scripts simultaneously, so I've created a driver script which reads a template script, edits these appropriately and then submits them via nohup. The spawned scripts all check to see at some point how many of their number are running and once the count has dropped to 2 (I'm grepping so that means only 1 is still running) it should spawn the final script which should only run once.

The variable of the count is gotten at by
number_running=`ps -ef | grep simple | -wc l`
This number_running is always zero somehow.

Unless you redirected diagnostic output to /dev/null , one would assume that the code marked in red above would produce a message similar to:

-ksh: -wc: not found

since most systems do not have a utility named -wc . One would guess that you intended to use:

number_running=`ps -ef | grep simple | wc -l`

Note that the output produced by wc -l may include leading <space>s (depending on what version of wc you're using). This may make a difference in the way you write the test to determine the value returned. And, although you can look for 2 instead of 1 to account for the grep command being captured by the grep , there are ways to avoid that. One way to do that is to add another grep to filter out the grep :

number_running=`ps -ef | grep simple | grep -v grep | wc -l`

but that is rather inefficient. A preferred way to do that is to use a BRE that will match the lines you want to match and avoid matching the grep command too:

number_running=`ps -ef | grep 'imple' | wc -l`

which will just return the number of processes with that contain the string simple in the initial parts of their argument list.

safedba · February 23, 2018, 11:24pm

Ok...I ran a test and on my mac, everything works as it should, but not when it's spawned via dbms_scheduler on exadata....

This is the logic (it works)
This is the template:test.ksh

#!/bin/ksh
set -x
export PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin
MAIN_DIR="/var/root/scripts"
cd $MAIN_DIR
DATE=$(date)
HOSTNAME=$(hostname)
FILE_CFG=$MAIN_DIR/simple_INSTANCE.cfg
FILE_LOG=$MAIN_DIR/simple_INSTANCE.log
sleep $(print $((RANDOM%100+1)))
typeset -i number_running=`ps -ef | grep runit_ | grep -v grep | wc -l` 
if [[ $number_running -eq 1 ]]
  then
    echo $number_running > $FILE_LOG
    nohup ./run_once_last.ksh &
fi

This is the cfg file (simple.cfg)

NCDP x y z
NCDT q r s
EDBP a b c
JUNO d e f

This code reads the template and creates as many ksh's and spawns as exists in the cfg (runit.ksh)

#!/bin/ksh
set -x
export PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin
MAIN_DIR="/var/root/scripts"
cd $MAIN_DIR
DATE=$(date)
HOSTNAME=$(hostname)
FILE_CFG=$MAIN_DIR/simple.cfg
FILE_LOG=$MAIN_DIR/simple.log

function doit {
sed 's/INSTANCE/$1/g' < test.ksh > runit_$1.ksh
echo ${arr_cfg[@]}  > runit_$1.cfg
chmod +x runit_$1.ksh
nohup ./runit_$1.ksh &
}

cat ${FILE_CFG}|grep -v '#'|while read PARAMS
do
  set -A arr_cfg $PARAMS
  doit ${arr_cfg[0]} ${arr_cfg[@]}
done

Here's one genned:

#!/bin/ksh
set -x
export PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin
MAIN_DIR="/var/root/scripts"
cd $MAIN_DIR
DATE=$(date)
HOSTNAME=$(hostname)
FILE_CFG=$MAIN_DIR/simple_$1.cfg
FILE_LOG=$MAIN_DIR/simple_$1.log
sleep $(print $((RANDOM%100+1)))
typeset -i number_running=`ps -ef | grep runit_ | grep -v grep | wc -l` 
if [[ $number_running -eq 1 ]]
  then
    echo $number_running > $FILE_LOG
    nohup ./run_once_last.ksh &
fi

This is run_once_last.ksh

#!/bin/ksh
set -x
export PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin
MAIN_DIR="/var/root/scripts"
cd $MAIN_DIR
DATE=$(date)
HOSTNAME=$(hostname)
echo imdone > imdone.txt

It appears to work here, but when I schedule the job to run in oracle's dbms_scheduler, which kicks off the base job, everything works except the count check which would kick off the run_once_last.ksh

Obviously this is just an example, but it's all about attempting to run all oracle instances rman backups simultaneously but only run one tape backup, and that after the last backup that finishes completes....

BTW, I appreciate the sounding board. I'm not much of a scripter, I just hack away...

Don_Cragun · February 23, 2018, 11:43pm

There are lots of scripts involved here. All of them include set -x to enable tracing, but you haven't shown us any of the results of those traces.

You say it doesn't work, but you don't explain the symptoms of how it fails? Is the count falling to 0 without kicking off the last job? Does the count never fall below 2?

After you start this job, what does the output from ps -ef look like at some point before your code fails? (Is there some oracle job running with an argument that contains the string you're counting?)

Peasant · February 24, 2018, 2:11am

Why use nohup ?
Why generate scripts from scripts ?

A simple while loop reading one configuration file in one script should

Spawn the required rman backup processes in background.
After the while loop has finished (spawned the processes), using wait , wait for those to finish.
After that line, a invoke a backup to tape.

This is most rudimentary, and it will not check success of background processes.
But the code you posted only counts processes, so i guess that is not the requirement.

As i see it, you wish your program to know if the processes which backup database on some filesystem are done or do not exist anymore, then issue a tape backup on that filesystem.

Is this correct ?

Regards
Peasant.

safedba · February 24, 2018, 8:14am

That's an excellent suggestion. I'm already using a wait. I'll try that monday. That would simplify things. Oh, and the scripts spawned check their own success by checking the logs they've produced, grepping for errors and emailing a success or failure message. The only failures I've run into of late is with netbackup's media manager hiccoughing and trying on a second channel. Since we use ASM, we can only push to netbackup, the people here didn't want to get the zfs backup appliance and there's no nfs mount which can handle the backups not to mention the 1gb pipe we have to back up tens of terabytes of data. So I'm having to use a single channel per node to back up to asm, compress this and then push to netbackup. If I ramp up 8 channels per node I end up creating hundreds of pieces in backup sets and then netbackup's media manager treats each piece as a separate job (no multiplexing this way) and there's the positioning/repositioning which greatly slows down the backup to tape (which is the biggest choke point). ....I added that FYI as a background story...

apmcd47 · February 26, 2018, 4:57am

You appear to be running these scripts on a Mac. I believe OSX supports the pgrep/pkill commands, so you could simplify your process counting to

pgrep runit_ | wc -l

Andrew