Interesting issue with pthread_mutex_lock and siglongjmp in AIX 5.3 (and no other OS)

Executive summary:

Code (posted below) cores in AIX 5.3, despite being compiled and run successfully on several other operating systems. Code is attempting to verify that pthread_mutex_lock can be successfully aborted by siglongjmp. I do not believe this is an unreasonable requirement.

If you could please compile the code below in any operating system supporting pthreads and report whether it runs to completion, I'd really appreciate it. Of course, I would appreciate it more if someone could tell me I'm definately doing something wrong.

Ok, on with the long winded post....

I have a simple application using siglongjmp and mutexes that is coring in AIX 5.3 and, thus far, no other operating systems. I have compiled and ran it successfully on Redhat Enterprise Linux (kernel 2.6.18, 32 bit), HP-UX 11, Compaq Tru64 V5.1B, and SunOS 5.7.

What seems to be happening is that when the code prematurely exits the pthread_mutex_lock function, via the long jump, a subsequent call to pthread_mutex_lock causes the application to seg fault (in AIX 5.3). Interestingly, this only seems to occur if the subsequent call is made before the thread holding the lock releases it; a condition that could not be guaranteed in a real application. Further, replacing the subsequent pthread_mutex_lock with pthread_mutex_trylock (in a spin loop) will succeed without coring as well. However, the spin lock is wasteful and, unlike most spin locks which spin for a bit and then block, this spin has to continue until the lock is acquired. This is because any attempt to call the blocking function (pthread_mutex_lock) causes the application to core.

When the core occurs, dbx shows an AIX library function at the top of the stack. Here is the stack according to dbx when it cores:

Segfault at _usched_dispatch_front stack is:
_usched_dispatch_front
_usched_swtch
_waitlock
_local_lock_common
_mutex_lock
main

What I am really looking for here is ammunition to point to whether my code (which works on 4 of 5 Operating Systems successfully so far) or IBM's libraries are at fault here. To that end, if people can compile and run this successfully (or not) and report their results that would be awesome! Of course, if anyone has insight regarding something I am doing wrong, I'd love to hear it!

Here is the code, overly commented to explain the problem. The two options that make it run successfully on AIX 5.3 can be tested by compiling with either -DNO_CORE (which changes the order of operations to a successful one) or -DNO_CORE -DTRYSPIN (which replaces the failing pthread_mutex_lock with a spinning pthread_mutex_trylock). As I said, though, while neither of these two options is, in my opinion, a viable "solution" to the problem, I do find it interesting that either averts the core.

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdarg.h>
#include <string.h>
#include <signal.h>
#include <setjmp.h>

static pthread_mutex_t	mx = PTHREAD_MUTEX_INITIALIZER;

static pthread_t	tidMain;

static sigjmp_buf 	env;

static int t_printf(const char *fmt, ...)
{
	va_list	va;
	char	buf[256];
	
	sprintf(buf, "[%c] ",
		pthread_equal(tidMain, pthread_self()) ? 'M' : 'I');
	
	va_start(va, fmt);
	vsnprintf(buf+4, sizeof(buf)-4, fmt, va);		
	va_end(va);
	
	printf(buf); fflush(stdout);
}

static void *thread_fn(void *data)
{
	int		rv;
	
	t_printf("thread started\n");
	
	/* lock mutex so main thread must block waiting on us to release it */
	
	rv = pthread_mutex_lock(&mx);
	
	t_printf("mutex was %slocked\n",
		rv == 0 ? "" : "NOT ");
	
	/* sleep to "assure" main thread will be blocked in pthread_mutex_lock
	     call when we signal it. */
	
	sleep(5);
	
	t_printf("signaling main\n");
	
	/* send signal to main thread causing it to long jump out of the
	     pthread_mutex_lock function */
	
	pthread_kill(tidMain, SIGALRM);
	
	/* sleep to "assure" main thread will be attempting to reaquire the
	     mutex when we unlock it. */
	
	sleep(2);
	
	rv = pthread_mutex_unlock(&mx);
	
	t_printf("mutex lock %sreleased\n",
		rv == 0 ? "" : "NOT ");
}

static void alarm_fn(int sig)
{
	/* Since we directed this signal to the main thread via pthread_kill,
	     verify that is where we get it!  If we do not get it there, we
	     may expect issues trying to long jump across a thread stack.
	     Fortunately this is working as expected. */
	     
	t_printf("signal received %s\n",
		pthread_equal(tidMain, pthread_self()) ? "OK" : "IN WRONG THREAD");
	
	siglongjmp(env, 1);
	
	puts("!!! LONG JUMP FAILED !!!");
	exit(-2);
}

int main(int argc, char **argv)
{
	pthread_t	tid;
	int		rv;
	int		cancelled = 0;
	
	tidMain = pthread_self();
	
	signal(SIGALRM, alarm_fn);
	
	/* create "interference" thread */
	
	pthread_create(&tid, NULL, thread_fn, NULL);

	if (sigsetjmp(env, 1) == 0)
	{
		/* sleep a bit to "assure" the thread we created executes and
		     grabs the lock before we can so we have to block. */
		
		sleep(3);
		
		/* this is where we want to be when we get signalled to test
		     that we can be broken out of pthread_mutex_lock successfully
		     by a long jump */
			 
		t_printf("blocked locking mutex\n");
		
		rv = pthread_mutex_lock(&mx);
	}
	else
	{
		/* we expect the signal to be delivered forcing us into this
		     block of code. */
		
		rv = -1;
		cancelled = 1;
	}
	
	/* print rv and cancelled to show us what path we took above -- we
	     expect to be cancelled with rv == -1 */

	t_printf("rv: %d; %scancelled\n", rv, cancelled ? "" : "!NOT! ");
	
	/* Verify that we can re-acquire the lock after pthread_mutex_lock was
	     jumped out of by the long jump -- this is where we die in AIX 5.3
	     but nowhere else!  Oddly, we only core if the interference thread
	     still has the mutex locked when we get here; if it unlocks first
	     then this call succeeds.  We can test this by sleeping for a bit
	     before making this call (to allow the interference thread to
	     unlock) define NO_CORE to demonstrate this behavior.
	     
	   It also happens that a pthread_mutex_trylock is successful too if
	     written in a spin loop; which is also odd.  Define both NO_CORE
		 and TRYSPIN to demonstrate this behavior. */
	
	#ifdef NO_CORE
		#ifdef TRYSPIN
			t_printf("attempting relock via trylock\n");
		
			while((rv = pthread_mutex_trylock(&mx)) != 0)
			{
				sched_yield();
			}
		#else
			sleep(5);
			
			t_printf("attempting relock after sleep\n");
			
			rv = pthread_mutex_lock(&mx);
		#endif
	#else
		t_printf("attempting relock\n");
		
		rv = pthread_mutex_lock(&mx);
	#endif
	
	t_printf("lock was %sacquired\n",
		 rv == 0 ? "": "NOT ");
	
	sleep(1);
	
	t_printf("goodbye...\n");
	
	return EXIT_SUCCESS;
}

I would try increasing the thread stack size and see what happens. You may have a yellow or red (guard) zone stack overflow.