I have built the following script to check if processes supplied by the argument are running or not.
#!/bin/bash
PROCLIST=$1
PROCESS="0"
ERROR_PROCS=""
IFS='+'
read -ra ADDR <<< "$PROCLIST"
for PROC in "${ADDR[@]}"; do
if [ `ps ax | grep $PROC | grep -v grep | wc -l` -lt 1 ]; then
PROCESS=1
ERROR_PROCS="$ERROR_PROCS""$PROC ";
fi
done
if [ $PROCESS -eq 1 ]; then
echo "CRITICAL - One or more processes ($ERROR_PROCS) not running"
exit 2
fi
echo "OK - All monitored processes are running. Process: $PROCLIST"
exit 0
it seems it works fine apart from the fact the "ps ax | grep "process" | wc -l" gives a higher count than expected.
For example, if we take the process named "test" (which doesn't exist), it returns a count of 2.
Did you consider "false positives"? Processes with the search string as part of the command (fittest, hottest, testcase)? E.g. grep man would show mman , manager on my system.
if we run the script I have returned a value of 3, 1 for the process 2 not sure why the bash script is originating this number.
I could eventually subtract the value but doesn't make sense.
Strange. I vaguely remember we had a similar problem quite some time ago, but can't find the solution.
For debugging, in the script, echo the variables, and run the ps ax | ... pipe on its own to see its result.
So, if master is running, you get a count of 1 for the process you're looking for, an additional 1 because bash is running ./test2.sh master , and a third 1 because you're running bash -x ./test2.sh master .
Sometimes it is easier to debug things like this by changing:
[ `ps ax | tee step1 | grep $PROC | tee step2 | grep -v grep | tee step3 | wc -l` -lt 1 ]
and examine the contents of the files step1 , step2 , and step3 to see what processes were matched that you hadn't expected.
As RudiC suggested, using ps -ax -ocomm gets rid of the problem here. But adding tee s in a pipeline frequently helps when shortcuts like -o comm don't apply.
#!/bin/bash
plist=${1//+/$'\n'}
TMP=$(fgrep -vxf <(ps -eo comm= | fgrep -x "$plist") <<< "$plist")
if [ -n "$TMP" ]
then
echo "processes not running:"
echo "$TMP"
else
echo "Ok"
fi
Note that [ $TMP ] is not robust in case $TMP contains shell-special characters or test-operators like -n or =
So should be quoted and prefixed with a -n operator. [[ $TMP ]] might be safe as well.
if pgrep isn't what you want, could you make an expression from the item you are searching for? I avoid using a contruct like ps -ef | grep this | grep -v grep byt writing it as ps -ef | grep -E "thi" so the expression does not match its own process. If you are passing it a loop of items to check, it could get a bit fiddly, but with variable substitution you could achieve it, perhaps like this:-
set -x
for PROC in "${ADDR[@]}"; do
do
PROC_a="${PROC%?}" # Chop off last character
PROC_b="${PROC#$PROC_a}" # Work out the last character
PROC_E="${PROC_a}[${PROC_b}]" # Assemble expression
if ! $(ps ax | grep -Eq "$PROC_E") # Test for a non-zero return code when looking for processes
PROCESS=1
ERROR_PROCS="${ERROR_PROCS} ${PROC}"
fi
done
echo "Failed to find ${ERROR_PROCS}"
set +x
You still might have to be careful because there is a risk that there are false positives, e.g. someone stops a service called MAINPROC (so there are no process like that running, but then edits the file /var/log/MAINPROC, and the editor command shows up as a process matching your search and therefore you think it is still running.
Can you tell us more about the processes you are looking for and therefore might be a better way to be checking for them. Perhaps if they write their process-id in a file in /var/run/name then you can read that file and make sure the process is what it should be.
It depends how far you want to push this. You processes might respond to a signal to say that they are running okay, for instance and you could actually give them a nudge to make sure that they are happy and not stuck in a loop, for instance or they could frequently be re-writing a file with the current date (best as date +%s format) and if it is out of date by too long (you decide what is too long and compare to current date +%s value) then raise an alert.