Check hung process and restart

Hi all

I have networker running on a RHEL 5.7 and over time it hangs. So the solution backup team proposed is to check if the process is hung, to stop and start it.

Unfortunately for me, the rc script only allows three commands, start, stop and status (no restart option) so I managed to set following script but when I executed it-even when networker has been stopped I get the OK message in my /var/log/messages. Why is that? Can someone please help me look into this? Where did I go wrong? Sorry I am rushing this, they need to get this setup on prod servers by today at COB...

#!/bin/bash
cmdstop='/etc/rc.d/init.d/networker stop'
cmdstart='/etc/rc.d/init.d/networker start'

if [ "${?}" != 0 ] ; then
 echo "`date` CRITICAL:Networker hung, will be restarted" >>/var/log/messages
$cmdstart
 else
echo "`date` OK:Networker running" >>/var/log/messages
fi
exit

This variable assignment:

cmdstart='/etc/rc.d/init.d/networker start'

always succeeds, so $? is always 0

@Scrutinizer

But what is the process hangs? Will it still show 0?

It does not matter if the process hangs, the script does not check for that, it also does not restart the process and all it will do is write "`date` OK:Networker running" into the log.

well, how do I incorporate stop and start into the script that looks for hung process? :confused:
Sorry I am so lost!!

How would you characterize the state as "hung"?

Maybe a better approach is to stop and start it on a regular basis.

@otheus

I've attached the screenshot of what was done when we last found that the proces was hung...

Is it a good idea to stop & start it a regular interval instead? But how do I check with command if the process is hung? I use

ps auxw | grep db2vend

What does the status option do? Can you post the networker stop / start script?

@Scrutinizer

[root@H99 bin]# /etc/rc.d/init.d/networker start

[root@H99 bin]# /etc/rc.d/init.d/networker status
+--o nsrexecd (10762)

[root@H99 bin]# ps auxw | grep nsrexecd
root     10762  0.1  0.0 219884  8436 ?        Ssl  14:26   0:00 /usr/sbin/nsrexecd
root     11002  0.0  0.0  62924   776 pts/3    S+   14:26   0:00 grep nsrexecd

[root@H99 bin]# ps auxw | grep db2vend
db2s12     807  0.0  0.0 28549844 57192 ?      S    12:10   0:01 db2vend (db2logmgr.meth125S12))
root     11302  0.0  0.0  62928   784 pts/3    S+   14:34   0:00 grep db2vend
db2s12   30835  0.0  0.0 292396 49596 ?        S    11:25   0:00 db2vend (PD Vendor Process - 1)  

[root@H99 bin]# /etc/rc.d/init.d/networker stop
[root@H99 bin]# /etc/rc.d/init.d/networker status
nsr_shutdown: There are currently no running NetWorker processes.

and here's the script

[root@H99A100 bin]# more /etc/rc.d/init.d/networker
#! /bin/sh

# Copyright (c) 1990-2011, EMC Corporation 

# All rights reserved.

# chkconfig: 35 95 05
# description: EMC Networker. A backup and restoration software package.

### BEGIN INIT INFO
# Provides: networker
# Required-Start: syslog network
# Required-Stop: syslog network
# X-UnitedLinux-Should-Start: portmap
# Should-Start: portmap
# Default-Start: 3 5
# Default-Stop: 0 1 2 6
# Description: EMC Networker. A backup and restoration software package.
### END INIT INFO

case $1 in
    start)
        (echo 'starting NetWorker daemons:') > /dev/console
        LD_LIBRARY_PATH=/usr/lib/nsr/lib64:$LD_LIBRARY_PATH
        export LD_LIBRARY_PATH
        if [ -f /usr/sbin/nsrexecd ]; then
                if [ -f /usr/sbin/NetWorker.clustersvr ]; then
                        if [ -f /nsr.NetWorker.local -o \
                            -h /nsr.NetWorker.local ]; then
                                if [ -h /nsr ]; then
                                        rm -f /nsr
                                        ln -s /nsr.NetWorker.local /nsr
                                fi
                        fi
                fi
                (/usr/sbin/nsrexecd) 2>&1 | /usr/bin/tee /dev/console
                (echo ' nsrexecd') > /dev/console
        fi
        if [ -f /usr/sbin/lgtolmd ]; then
                (/usr/sbin/lgtolmd -p /nsr/lic -n 1) 2>&1 | \
                        /usr/bin/tee /dev/console
                (echo ' lgtolmd') > /dev/console
        fi
        if [ -f /usr/sbin/nsrd -a \
             ! -f /usr/sbin/NetWorker.clustersvr ]; then
                (/usr/sbin/nsrd) 2>&1 | /usr/bin/tee /dev/console
                (echo ' nsrd') > /dev/console
        fi
        ;;
    stop)
        (echo 'stopping NetWorker daemons:') > /dev/console
        if [ -f /usr/sbin/nsr_shutdown ]; then
                if [ -f /usr/sbin/NetWorker.clustersvr ]; then
                        (/usr/sbin/nsr_shutdown -q) 2>&1 | \
                                /usr/bin/tee /dev/console
                        (echo ' nsr_shutdown -q') > /dev/console
                else
                        (/usr/sbin/nsr_shutdown -q) 2>&1 | \
                                /usr/bin/tee /dev/console
                        (echo ' nsr_shutdown -q') > /dev/console
                fi
        fi
        ;;
    status)
        if [ -f /usr/sbin/nsr_shutdown ]; then
                /usr/sbin/nsr_shutdown -l
        fi
        ;;
    *)
        echo "usage: `basename $0` {start|stop|status}"
        ;;
esac
  1. Depends on what the operating characteristics of the program. Drop a stop/start script in cron.daily/ and go from there.

  2. Again, how do you characterize when it is hung? That is, what symptoms indicate to you that it is hung?

1 Like

I meant the content of the start/stop script...

I just added it...

---------- Post updated at 01:56 PM ---------- Previous update was at 01:39 PM ----------

Well, if we look at the screenshot I attached earlier, the process had been running since the 1st of May and it didn't look right when grep'ed because we had expected it to run and complete on the 1st itself, so that is why networker was restarted today around 11amish

I'm with you on the cron job, this idea is beginning to appeal to me more and more and I spoke to the backup chap about it so we will definitely look into implementing this on the 3 servers.

/etc/rc.d/init.d/networker status

appears to just list the processes. Probably the best thing to do in your script is to just issue a

/etc/rc.d/init.d/networker stop
/etc/rc.d/init.d/networker start

If that does not work, then perhaps there is a -f option to /usr/sbin/nsr_shutdown . Probably this is better than kill. Consult your manual and/or your Networker support organization.

1 Like

Hi @scrutinizer, otheus

Thank you so much for your help

I am putting this in cronjob

#!/bin/bash
STOPCMD='service networker stop'
STARTCMD='service networker start'
PROCESS='nsrexecd'

if ps auxw | grep -v grep | grep $PROCESS > /dev/null
then
  echo "`date` Process Networker is running" >>/var/log/messages
else
  echo "`date` Process Networker not running and will be started" >>/var/log/messages
  $STARTCMD
fi
exit

Since this is Linux, you should be able to just do:

if pgrep $PROCESS 

Also, I highly recommend you avail yourself of the logger command, since writing directly to the messages file is a breach of unix best-practices.

thanks!

I wasn't aware of logger

1 Like