Executive summary:
Code (posted below) cores in AIX 5.3, despite being compiled and run successfully on several other operating systems. Code is attempting to verify that pthread_mutex_lock can be successfully aborted by siglongjmp. I do not believe this is an unreasonable requirement.
If you could please compile the code below in any operating system supporting pthreads and report whether it runs to completion, I'd really appreciate it. Of course, I would appreciate it more if someone could tell me I'm definately doing something wrong.
Ok, on with the long winded post....
I have a simple application using siglongjmp and mutexes that is coring in AIX 5.3 and, thus far, no other operating systems. I have compiled and ran it successfully on Redhat Enterprise Linux (kernel 2.6.18, 32 bit), HP-UX 11, Compaq Tru64 V5.1B, and SunOS 5.7.
What seems to be happening is that when the code prematurely exits the pthread_mutex_lock function, via the long jump, a subsequent call to pthread_mutex_lock causes the application to seg fault (in AIX 5.3). Interestingly, this only seems to occur if the subsequent call is made before the thread holding the lock releases it; a condition that could not be guaranteed in a real application. Further, replacing the subsequent pthread_mutex_lock with pthread_mutex_trylock (in a spin loop) will succeed without coring as well. However, the spin lock is wasteful and, unlike most spin locks which spin for a bit and then block, this spin has to continue until the lock is acquired. This is because any attempt to call the blocking function (pthread_mutex_lock) causes the application to core.
When the core occurs, dbx shows an AIX library function at the top of the stack. Here is the stack according to dbx when it cores:
Segfault at _usched_dispatch_front stack is:
_usched_dispatch_front
_usched_swtch
_waitlock
_local_lock_common
_mutex_lock
main
What I am really looking for here is ammunition to point to whether my code (which works on 4 of 5 Operating Systems successfully so far) or IBM's libraries are at fault here. To that end, if people can compile and run this successfully (or not) and report their results that would be awesome! Of course, if anyone has insight regarding something I am doing wrong, I'd love to hear it!
Here is the code, overly commented to explain the problem. The two options that make it run successfully on AIX 5.3 can be tested by compiling with either -DNO_CORE (which changes the order of operations to a successful one) or -DNO_CORE -DTRYSPIN (which replaces the failing pthread_mutex_lock with a spinning pthread_mutex_trylock). As I said, though, while neither of these two options is, in my opinion, a viable "solution" to the problem, I do find it interesting that either averts the core.
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdarg.h>
#include <string.h>
#include <signal.h>
#include <setjmp.h>
static pthread_mutex_t mx = PTHREAD_MUTEX_INITIALIZER;
static pthread_t tidMain;
static sigjmp_buf env;
static int t_printf(const char *fmt, ...)
{
va_list va;
char buf[256];
sprintf(buf, "[%c] ",
pthread_equal(tidMain, pthread_self()) ? 'M' : 'I');
va_start(va, fmt);
vsnprintf(buf+4, sizeof(buf)-4, fmt, va);
va_end(va);
printf(buf); fflush(stdout);
}
static void *thread_fn(void *data)
{
int rv;
t_printf("thread started\n");
/* lock mutex so main thread must block waiting on us to release it */
rv = pthread_mutex_lock(&mx);
t_printf("mutex was %slocked\n",
rv == 0 ? "" : "NOT ");
/* sleep to "assure" main thread will be blocked in pthread_mutex_lock
call when we signal it. */
sleep(5);
t_printf("signaling main\n");
/* send signal to main thread causing it to long jump out of the
pthread_mutex_lock function */
pthread_kill(tidMain, SIGALRM);
/* sleep to "assure" main thread will be attempting to reaquire the
mutex when we unlock it. */
sleep(2);
rv = pthread_mutex_unlock(&mx);
t_printf("mutex lock %sreleased\n",
rv == 0 ? "" : "NOT ");
}
static void alarm_fn(int sig)
{
/* Since we directed this signal to the main thread via pthread_kill,
verify that is where we get it! If we do not get it there, we
may expect issues trying to long jump across a thread stack.
Fortunately this is working as expected. */
t_printf("signal received %s\n",
pthread_equal(tidMain, pthread_self()) ? "OK" : "IN WRONG THREAD");
siglongjmp(env, 1);
puts("!!! LONG JUMP FAILED !!!");
exit(-2);
}
int main(int argc, char **argv)
{
pthread_t tid;
int rv;
int cancelled = 0;
tidMain = pthread_self();
signal(SIGALRM, alarm_fn);
/* create "interference" thread */
pthread_create(&tid, NULL, thread_fn, NULL);
if (sigsetjmp(env, 1) == 0)
{
/* sleep a bit to "assure" the thread we created executes and
grabs the lock before we can so we have to block. */
sleep(3);
/* this is where we want to be when we get signalled to test
that we can be broken out of pthread_mutex_lock successfully
by a long jump */
t_printf("blocked locking mutex\n");
rv = pthread_mutex_lock(&mx);
}
else
{
/* we expect the signal to be delivered forcing us into this
block of code. */
rv = -1;
cancelled = 1;
}
/* print rv and cancelled to show us what path we took above -- we
expect to be cancelled with rv == -1 */
t_printf("rv: %d; %scancelled\n", rv, cancelled ? "" : "!NOT! ");
/* Verify that we can re-acquire the lock after pthread_mutex_lock was
jumped out of by the long jump -- this is where we die in AIX 5.3
but nowhere else! Oddly, we only core if the interference thread
still has the mutex locked when we get here; if it unlocks first
then this call succeeds. We can test this by sleeping for a bit
before making this call (to allow the interference thread to
unlock) define NO_CORE to demonstrate this behavior.
It also happens that a pthread_mutex_trylock is successful too if
written in a spin loop; which is also odd. Define both NO_CORE
and TRYSPIN to demonstrate this behavior. */
#ifdef NO_CORE
#ifdef TRYSPIN
t_printf("attempting relock via trylock\n");
while((rv = pthread_mutex_trylock(&mx)) != 0)
{
sched_yield();
}
#else
sleep(5);
t_printf("attempting relock after sleep\n");
rv = pthread_mutex_lock(&mx);
#endif
#else
t_printf("attempting relock\n");
rv = pthread_mutex_lock(&mx);
#endif
t_printf("lock was %sacquired\n",
rv == 0 ? "": "NOT ");
sleep(1);
t_printf("goodbye...\n");
return EXIT_SUCCESS;
}