Background process, return code and pid.

RECrerar · August 12, 2011, 12:38pm

Hey all,

Okay, this one is tricky and I'm not sure there is a niec way to do it, or indeed anyway to do it. The main issue revolves around timing out a hung ssh. I am doing this by creating a wrapper script for the ssh with the following requirements.

My requirements are:

Defineable timeout period
If the timeout period completes and ssh is still running then kill it.
Provide a return code as if the ssh has not been run from a wrapper script.
Multiple instances of this wrapper script can run at the same time.

By point 3 I mean, say the original ssh was

$REMOTE_INT_IP ls $ms_billing_dir/$reg_file

want my wrappered ssh (currently called safe_ssh) to return the return code of the ls not just whether or not the ssh completed.

Sounds simple right? However, what I have found is I can either kill the ssh or get a meaningful return code but trying to do both is neigh impossible (at least with my level of scripting.)

The status so far
I currently have a script that will timeout the ssh but can only return whether I killed the ssh or whether it exited of it's own accord.

###############################################################################
# safe_ssh                                                                    #
#                                                                             #
# Wrapper script for ssh providing a timeout function in the situation that   #
# the ssh hangs.                                                              #
#                                                                             #
# $1    - Timeout duration in seconds.                                        #
# $2+   - ssh parameters.                                                     #
###############################################################################
safe_ssh()
{
  #############################################################################
  # Check that the first parameter is an integer.                             #
  #############################################################################
  if [[ ! -z $(echo $1 | sed 's/[0-9]//g') ]]
  then
    echo "Usage error safe_ssh"
    echo "The first parameter must be an integer representing the timeout "
    echo "duration in seconds."
    exit 1
  fi

  #############################################################################
  # Set up a sleep thread that will run in the background and simply sleep    #
  # for the requisite number of seconds.                                      #
  #############################################################################
  sleep $1 &
  sleep_pid=$!

  #############################################################################
  # Set up the SSH thread to run the command. This will also run in the       #
  # background.                                                               #
  #############################################################################
  shift
  ssh $@ &
  ssh_pid=$!

  #############################################################################
  # Loop until either thread has completed.  We check that the count equals 1 #
  # as the grep will also turn up in the results.  If a thread has completed, #
  # check whether the other is still running and if so terminate it.          #
  #############################################################################
  while :
  do
    ps -p $ssh_pid > /dev/null 2>&1
    if [ 0 = $? ]
    then
      #########################################################################
      # The ssh command is still running, check if sleep has exited and if so #
      # kill the ssh command.                                                 #
      #########################################################################
      ps -p $sleep_pid > /dev/null 2>&1
      if [ 1 = $? ]
      then
        kill -15 $ssh_pid
        exit 1
      fi
    else
      #########################################################################
      # ssh has exited.  If the sleep thread is still running, kill it.       #
      #########################################################################
      ps -p $sleep_pid > /dev/null 2>&1
      if [ 0 = $? ]
      then
        kill -15 $sleep_pid
        exit 0
      fi
    fi
  done

The problem
I can get the return code of the ssh if I echo it into a temporary file from the background process and then read that file in the main process. For example something like:

(ssh $@; exit_code=$?; echo $exit_code > /tmp/ssh_exit_code) &
ssh_pid=$!

However,

In this case, $ssh_pid is no longer the pid of the ssh itself but the whole background script meaning that I can no longer cleanly kill the ssh as I don't know it's pid.
I need the file to have a unique file name in case there are multiple instances of the script running so that I can read the correct file from the main script. For this I was thinking of including ssh_pid in the file name.

I thought about echoing the ssh PID into the temp file as well but this will not work as the steps of the script to add data to the tmp file will not be executed till the ssh has completed and of course it won't have completed if it has hung, which is the situation in which we want to kill it.

I hope this vaguely makes sense. Sorry it is a bit convoluted. If you need any clarifications please do ask.

Thanks a lot
Robyn

---------- Post updated at 05:38 PM ---------- Previous update was at 04:41 PM ----------

Okay,

I think I have come up with an idea and it is as follows.

set a sleep thread running in the background, (when this sleep thread completes it reads a temporary file for the ssh_pid and kills the ssh.)
set up the ssh thread in the background
echo the pid of the backgroud ssh to the temporary file
wait on the ssh
once the wait is complete kill the sleep thread if it exists.

The temporary file will be named with the parent PID so all child processes can determine what it's name is.

This way:
If the ssh finishes without hanging, the wait will provide the background process return code.
If the ssh hangs there will have been plenty of time to write its PID to the temporary file and hence the sleep thread can kill it when it exits.

I think I can do most of this but wanted to run the idea past you in case there is some obvious flaw I haven't spotted.

ALSO: How do I get the PID of the running process, that is say I call my script safe_ssh, how do I get the PID of safe_ssh from within safe_ssh. I assume it must be straight forward but do not currently know.

Thanks a lot
Robyn

purdym · August 12, 2011, 1:10pm

Q: ALSO: How do I get the PID of the running process, that is say I call my script safe_ssh, how do I get the PID of safe_ssh from within safe_ssh. I assume it must be straight forward but do not currently know.

A: $$

If I were writing this I wouldn't try to get the return code from ssh. It is quite hard. Why not redirect all the output from ssh to a log file. Then examine the log file for errors, if there are errors, set the return code to non-zero.

---------- Post updated at 12:10 PM ---------- Previous update was at 11:44 AM ----------

Consider these examples:

safe_ssh 5 wpgux001_sw sleep 20

job should not run for more than 5 seconds
command is: sleep 20

SSH appears to be hung.
kill -15 934072
Output is:
This is a private computer facility.  Access to the facility must be
specifically authorized.  If you are not authorized, your continued
access and further inquiry expose you to criminal and/or civil
proceedings.

RET_CODE: 255

safe_ssh 25 wpgux001_sw sleep 20

Output is:
This is a private computer facility.  Access to the facility must be
specifically authorized.  If you are not authorized, your continued
access and further inquiry expose you to criminal and/or civil
proceedings.

RET_CODE: 0

safe_ssh 25 wpgux00a_sw sleep 20

host wpgux00a_sw does not exist.

Output is:
ssh: Could not resolve hostname wpgux00a_sw: Hostname and service name not provided or found
RET_CODE: 1

Consider this code:

safe_ssh () {

  SLEEP_WAIT=$1
  shift

  #############################################################################
  # Check that the first parameter is an integer.                             #
  #############################################################################
  if [[ ! -z $(echo $SLEEP_WAIT | sed 's/[0-9]//g') ]]
  then
    echo "Usage error safe_ssh"
    echo "The first parameter must be an integer representing the timeout "
    echo "duration in seconds."
    exit 1
  fi

  #############################################################################
  # Set up the SSH thread to run the command. This will also run in the       #
  # background.                                                               #
  #############################################################################
  ssh $@ 1>/tmp/ssh.$$ 2>&1 &
  ssh_pid=$!

  # sleep
  sleep $SLEEP_WAIT

  #############################################################################
  # check if ssh is still running, if it is, kill it
  #############################################################################
  if (( $(ps -ef | egrep -v "ps|grep" | grep -cw $ssh_pid) > 0 ))
  then
     echo "SSH appears to be hung."
     echo "kill -15 $ssh_pid"
     RET_CODE=255

  else
     RET_CODE=$(egrep -ci "error|fail|ssh:" /tmp/ssh.$$)

  fi

  echo "Output is:"
  cat /tmp/ssh.$$

}

Notice the change in logic. Sleep is not in the background. I don't kill the sleep. Simply sleep and then check if ssh is still running.

Corona688 · August 12, 2011, 1:18pm

This may be overkill. Why not just use the timeout utility? It seems to do exactly what you ask.

timeout 300 ssh username@host ...

The 300 is a duration in seconds.

purdym · August 12, 2011, 1:25pm

Cool, but I don't see that command on AIX or HP-UX.

jim_mcnamara · August 12, 2011, 3:39pm

Or in the remote .profile or .bashrc use TMOUT=n where n is the number of idle seconds before the process gets killed. You probably should set TMOUT as readonly, which is shell dependent.

RECrerar · August 13, 2011, 5:30am

It was be belief that the timeout only works if the ssh has properly connected, not got stuck somehow. If I am wrong please do correct me.

@jim mcnamara thanks for the suggestion but this is for a fix that will go out to multiple different systems an I need a fix that will be our code rather than just ours an so I don't think changnig the .profile file is a possibility.