Trace / Debug Howto?

bbq · June 25, 2014, 9:33am

Can anyone recommend any good guides on how to investigate what a hanging process is doing?

In fact I would be interested in any online guides that would improve my forensic skills on the Linux platform.

I have a script that occasionally hangs. Strace shows:

[root@cfg01o ~]# strace -p 32370
Process 32370 attached - interrupt to quit
select(14, [3 6 8 11 13], [], NULL, NULL) = 1 (in [3])
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
read(3, "\344u\245%\6U\216\307\276\355\213p\376\203}\2617\251\320\301}\5\376Y(\235]K\222\301\304\370"..., 16384) = 64
write(3, "K\t\26O\323\344\214\341\247W\346\\*e\330\304\372\323O\356q\34\360\327\350\345*\274\35(Q'", 32) = 32
select(14, [3 6 8 11 13], [], NULL, NULL^C <unfinished ...>

I think the select is referring to a case statement in a while loop that is reading from a file. It's looking for an exit message at the point where it seems most likely to be hanging.

I don't know how to drill down further than this, so any suggestions or pointers to good guides online would be appreciated.

Thanks

Corona688 · June 25, 2014, 12:10pm

Well -- looking at the shell script's code might be a good start!

A select call is rather strange to find in a shell. The read/write's contents look odd too -- pure binary data. This sort of stuff almost almost never passes through the shell (and arguably can't, since it can't handle things like binary nulls). I suspect you're tracing a subprocess, not the shell. Either that, or are just seeing it load its libraries.

So please tell us, in detail, what your script actually is; show us its contents; and tell us what process you're tracing, please.

bbq · June 25, 2014, 1:21pm

I was more interested in finding some online tutorials on how to trace into a hanging job effectively...

But as you ask:

The script is running jobs over ssh on remote servers.

A file based messaging service is used for communication between the remote server and the master.

The script seems to hang at a point where it is waiting for a msg containing the word COMPLETED.

This might be as easy as checking for a corrupt msg file the next time it hangs. But I am really hoping for some more generic tips-and-tricks type answers.

m_process_msg_queue()
{
    local TMPFILE1="$(m_get_tmp_file ${FUNCNAME})" LINE CTL_FILE STATUS PIPE PID JOB_FILE TO=${C_MSG_PIPE_TO}
    [[ -f ${TMPFILE1} ]] || m_fail 1 "Error: Failed to create tmp file (${FUNCNAME})" 

    m_check_file -frw "${M_MSG_QUEUE}" s || m_fail 1 "Error: Msg queue not found (${FUNCNAME})" 

    while read LINE
    do
        #======================================
        # Split the line and get the control file
        #   and the status field
        #======================================
        CTL_FILE="$(echo ${LINE} | cut -d"|" -f1)"
        STATUS="$(echo ${LINE} | cut -d"|" -f2 | awk '{print $NF}' FS="Status=" )"
        
        [[ (-n ${STATUS}) && (-n ${CTL_FILE}) ]] || 
            m_fail 1 "Error: Failed to parse msg ctl file (${FUNCNAME})" 
        PIPE=${CTL_FILE##*/}

        JOB_FILE="$(sed -n '/^JobFile:/p' "${CTL_FILE}" | cut -d":" -f 2)"
        [[ -n ${JOB_FILE} ]] || m_fail 1 "Error: Failed to retrieve job ctl file (${FUNCNAME})" 
        m_check_file -frw "${JOB_FILE}" s || m_fail 1 "Error: job ctl validation failure (${FUNCNAME})" 

        PID="$(sed -n '/^Pid:/p' "${CTL_FILE}" | cut -d":" -f2)"
        [[ ${PID} =~ ^[[:digit:]]+$ ]] || m_fail 1 "Error: PID validation (${FUNCNAME})" 
        m_write_job_field ${C_JOB_PID} "${PID}" "${JOB_FILE}"

        case ${STATUS} in
            "COMPLETED")
                #======================================
                # Completed. Nothing to do.
                #======================================
                m_close_pipe "${FUNCNAME}" "${PIPE}" "${TO}" "${PID}"
                ;;
            "FAILED")
                #======================================
                # Log the error in the run log
                #======================================
                m_log_msg "Non FATAL error in (${CTL_FILE})"
                m_close_pipe "${FUNCNAME}" "${PIPE}" "${TO}" "${PID}"
                ;;
                
            "FATAL")
                #======================================
                    # Remote job flags a fatal error
                # Don't launch any more jobs
                # Wait for all other jobs to complete
                # Only then throw fatal error in master
                    #======================================
                m_log_msg "FATAL error in (${CTL_FILE})"
                M_HALT_ON_ERROR="true"
                m_close_pipe "${FUNCNAME}" "${PIPE}" "${TO}" "${PID}"
                ;;
            "MANUAL")
                #======================================
                # Manual intervention requested
                # Inform the user
                #======================================
                m_log_msg "Manual request in (${CTL_FILE})"
                M_HALT_ON_ERROR="true"
                m_close_pipe "${FUNCNAME}" "${PIPE}" "${TO}" "${PID}"
                ;;
            *)
                m_log_msg "Unrecognised request (${STATUS}) in (${CTL_FILE})"
                M_HALT_ON_ERROR="true"
                m_close_pipe "${FUNCNAME}" "${PIPE}" "${TO}" "${PID}"
                ;;
        esac

        m_write_job_field ${C_JOB_FINISH} "$(date)" "${JOB_FILE}"

    done < "${M_MSG_QUEUE}"

}

Corona688 · June 25, 2014, 1:57pm

It's not a matter of "how to" -- it's detective work.

select() is hanging on file descriptors 3, 6, 8, 11, and 13. OK, so, what are those...? If you trace the program early enough you can look for the open calls which return these file descriptor numbers and see. ls -l /proc/####/pid may help if you didn't trace the program in time, as might lsof.

Knowing what it's hanging on, is halfway to knowing why it's hanging.

You should also trace it from the other end. What things are happening which this application is not seeing? If you can pinpoint the exact thing which fails -- "app x does y via z, but select() in app q does not see this change" -- then you know enough to actually begin asking questions!

The script you posted is so incomplete to be useless, full of variables and functions which aren't explained, only used. You also haven't explained what application you traced.

bbq · June 27, 2014, 5:09am

Thanks Corona

I'm running it under strace now and of course it has stopped failing....

Yes, the reason I didn't post any code was because it is a distributed application with multiple libraries included. Several thousand lines of code.

That's why my OP was a request for pointers to good tutorials, not a request to help debug this application,

But thanks for taking an interest anyway.

MadeInGermany · June 30, 2014, 1:11pm

A while loop reading from stdin can be obscured by commands or functions that read from stdin.
A hardened loop is

while read LINE <&3
  ...
done 3< "${M_MSG_QUEUE}"

--
A well-known shell-debugging is to put

set -x

in the script, or run it with

/bin/bash -x scriptname

bbq · July 1, 2014, 4:15am

Hi Mig

I don't get your comments about the while loop.

The example I posted is reading from a file that contains messages from remote servers.

Thanks for the -x tip but I was looking for something a little deeper than that.

Like I said in the OP, I'm really looking for online tutorials. Especially anything that can give me more insight into the output from strace for example. I still don't know what that 'select' is about.

Cheers

fpmurphy · July 1, 2014, 9:38am

"man select" will tell you all about the select API.

MadeInGermany · July 1, 2014, 12:14pm

The following script can demonstrate; please test it yourself:

{
suck_stdin() {
local my_line
read my_line
echo "suck_stdin: $my_line"
}

while read line
do
 echo "passwd: $line"
 suck_stdin
done < /etc/passwd
} < /etc/group

You see that the suck_stdin() function reads a line, therefore it obscures the while loop. Here the suck_stdin draws every second line from /etc/passwd.
Now please harden it with the suggested descriptor magic:

{
suck_stdin() {
local my_line
read my_line
echo "suck_stdin: $my_line"
}

while read line <&3
do
 echo "passwd: $line"
 suck_stdin
done 3< /etc/passwd
} < /etc/group

and the suck_stdin() won't obscure the while loop - instead it reads from the outer stdin (here: /etc/group).
Commands like ssh behave like the suck_stdin (ssh can usually be fixed with the -n option), and maybe also your m_* functions contain read or ssh.
Conclusion:
without knowing your m_* commands, I suggest to harden your while loop.

Corona688 · July 1, 2014, 2:07pm

The output of strace means really doesn't mean much out of context. Have you followed any of my advice at all yet? Do you know what those file descriptors are?

The manpages contain much information about it.

jim_mcnamara · July 1, 2014, 2:40pm

Consider the bashdb debugger as well. It has some useful features, but requires serious reading beforehand.

BASH with Debugger and Improved Debug Support and Error Handling

bbq · July 3, 2014, 9:18am

Thanks very much guys

That's the kind of stuff I have been looking for.

I'll certainly play around with the code posted and I wasn't aware of the select command, I assumed it was something low level in the kernal maybe

I still don't know what the file descriptors are, as since running under strace it hasn't failed. But if it does I should be able to discover where the problem is.

I don't know if the -n will help with this version of the harness. But if things get bad I'll try it.

The harness uses ssh multiplexing to add the ssh connections to NIX domain sockets and then echos stdout from the remote server into a local report file in real time.

The new version I'm working on uses a messaging framework to send a report file so shouldn't have any issues like this.

I hope

Thanks for the feed back, really helpful stuff.

---------- Post updated at 05:37 PM ---------- Previous update was at 09:50 AM ----------

That's really interesting MIG

I wasn't aware of that behaviour before.

Thanks

---------- Post updated 03-07-14 at 02:18 PM ---------- Previous update was 02-07-14 at 05:37 PM ----------

Thanks to all of you for your help.

The job eventually failed again and I was able to trace the problem to a hanging write to a named pipe that the client process had died on.

I've put in a rather ugly fix that does the write in the background and then kills $! if it is still running.

It would be better if I could test if a process is listening on the end of the pipe but I haven't found a way of doing so.

Does anyone know if that can be done?

Thanks again for all the helpful input

MadeInGermany · July 3, 2014, 10:32am

Some Linux ship with a timeout command.
Otherwise this function can be used:

timeout () {
case $1 in
[1-9]|[1-9][0-9]|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9])
 to=$1
 shift
 ;;
*)
 echo "usage: $0 timeoutseconds command..."
 return 1
 ;;
esac
perl -e "alarm $to; exec @ARGV" "$@"
# without perl one can try
# ("$@"; kill 0) & (sleep $to; kill 0) &
}

Now, replace your

hanging_command

by e.g.

timeout 5 hanging_command

and hanging_command will get a SIGALRM after 5 seconds, that hopefully kills it.

Corona688 · July 3, 2014, 3:09pm

If you can use a non-named pipe, and keep it open, if the client dies, the writer will receive SIGPIPE.