SIGCHLD interrupts its own handler

Hi. I have a program whose job it is to manage 15 child processes. Sometimes these children die (sometimes deliberately other times with a SEGV). This causes a SIGCHLD to be sent to my program which uses waitpid() in the signal handler to gather information and, in most cases, restart the child.
The problem I am having I that under very high loads I am seeing SIGCHLDs sent while the first is still being processed. I can see from my log that the intention was to restart both but only the last one actually gets restarted leading to a gradual haemorrhage of children which eventually causes the whole system to stop responding.

I am using posix threads and have omitted any mutex in the signal handler because I thought it would be atomic. Obviously not. I am a bit scared to put a mutex there; what happens if the same thread is interrupted again while the mutex is held? Deadlock would be worse than the current situation.

I don't want to ignore signals while in the handler either; it is most important that all SIGCHLDs are honoured and the child restarted. I have seen a deferred solution where one thread is dedicated to catching signals and the main program looks at this from time to time. I don't think this will work too well though because I need to call waitpid straight after I get the signal; it needs to wait for the right child status after all.

Any pointers in the right direction would be most welcome.

Cheers;
Jeremy

Make your SIGCHLD handler a loop that calls waitpid with WNOHANG in case any signals were missed while you were processing and only quits when it finds there's none left to process. Otherwise its hard to guarantee you catch them all.

Hi Corona688. You are right. I already use WNOHANG with waitpid in a loop but only until that child has been dealt with. This is still liable to be interrupted though.

However you gave me an idea: I am using threads, I don't really need signals anyway and they mess with my head. What I did was start a thread and get it to sit in a loop with a blocking call to waitpid. As children die it gets unblocked, deals with it and loops back to blocking again. All in a nice serial manner.

I can simply ignore SIGCHLD now, that thread will pick up any children needing attention via waitpid.

Now, try as I might, I cannot break the system and I have it running in production today.

Cheers;

Jeremy

Good job, Jeremy!

You just discovered the way we usually deal with asynchronous signal in multi-threaded application. We spend a thread that waits (and consumes) synchronously the signal.

The function to wait synchronously for a signal is called sigwait(). You need in addition to ensure that the signal shall be delivered to the right thread, that is the one blocked in sigwait(). Pthreads offers such possibility with pthread_sigmask().

In your particular application, you luckier. First you can use waitpid(). And second, you can ignore SIGCHLD "signal redirection", since the default signal action for SIGCHLD is IGN.

Cheers,
Lo�c.