Solaris - BUS error with optimize mode

Hi,

I'm facing BUS (invalid address alignment) issue.

Application works correctly with binaries compiled without optimization.
When "-O3" (or -O2, O1) is in use, application abort with - BUS (invalid address alignment).

The problem appears in following situation:

struct _a
{
	int _a1;
	uint64_t _a2;
}

int f2(uint64_t *obj)
{
	printf("%llu", *obj); <--- BUS ERROR
	
	return 0;
}

void f1()
{
	...
	_a* obj_a = malloc();
	
	printf("%llu", obj_a->_a2); <--- OK
	f2(&obj_a->_a2);
}

Solaris Sparc
GCC 4.2.1

Thanks

Is that really the entire code? Does it still cause that crash when minimized like that?

Thanks form the reply.
No, it's just an example. It's a big system so it's hard to present problem detailed.

For some reason, when object from struct is passed by reference to function and we want to manipulate by pointer on this object it cause BUS error.

With test function with param "struct _a *obj" (so pointer to struct is passed) works correctly.

int f3(struct _a *obj)
{	
printf("%llu", obj->_a2); <--- OK	
return 0;
}

Like I wrote, without optimization works correctly.

Optimization often causes subtle bugs from anything which uses undefined values -- things like pointing to a stack variable which went out of scope, overrunning the end of an array, etc. The crash can happen quite a distance from whatever caused it -- the corruption could've happened long ago. Code which happens to "just work" unoptimized may run far differently when the compiler starts removing in-between steps and squeezes out the empty spaces, letting things start bumping into each other.

So I'd start by logging the values fed into that function. They're probably getting corrupted somewhere. Then follow it backwards from there until you find out where the corruption is happening.

Change

struct _a
{
    int _a1;
    uint64_t _a2;
}

to

struct _a
{
    uint64_t _a2;
    int _a1;
}

Note that I swapped the order of the fields in the structure.

By defining the structure with the smaller elements first, if the structure is packed tightly, as it likely is under optimization, you're likely violating an address restriction on the value when you pass it like that.

It's unlikely that the compiler would build itself a data structure it cannot use...

GCC on SPARC?

Given that Sun's own compilers did it with an OS base utility - Solaris 10 mkfs - you'd better believe it's quite likely that GCC on SPARC will do it, too.

That forces alignment with a #pragma, otherwise, that'd be extremely difficult.

He didn't mention that about his code, but you're right, it could be something to look for.

The code is badly written:

struct _a
{
	int _a1;
	uint64_t _a2;
}


int f2(uint64_t *obj)
{
        // obj is of type uint64_t. Not an unsigned long long!
        // this could be any value, including NULL! 
        // no wonder a BUS ERROR in certain circumstances  
	printf("%llu", *obj); <--- BUS ERROR
	
	return 0;
}

void f1()
{
	...
        // how is malloc() expected to know what size space to allocate
        // no test for NULL!
	_a* obj_a = malloc();
	
        // a random value will be returned.  Should zeroize structure
        // why the use of %llu. _a2 is a uint64_t  
	printf("%llu", obj_a->_a2); <--- OK
        // you should use an appropriate cast here
	f2(&obj_a->_a2);
}

The point was that even Sun messed up alignment in operating system basic tools such as mkfs on SPARC. SPARC has some very strict alignment restrictions, and the OPs code is one of the archetypes of ways to violate those restrictions: a malloc'd block that's used by a structure that has a small member with no alignment restrictions declared before a larger member that can have much stricter alignment restrictions.

As far as I can tell, GCC on SPARC has no equivalent of the "-xmemalign" argument the Solaris Studio compilers have:

Man Page cc.1

1 Like

Solaris supports memalign() - a malloc variant that allows specification of alignments on a per object basis. We have had to take that approach with some code.

memalign() isn't going to help in this case - malloc() itself has to return "memory suitably aligned for any use". The problem comes from using a structure in such a way that a member that has a strict alignment is offset from the beginning of the structure by an amount that causes the alignment to be wrong. Per the "-xmemalign" documentation from the Solaris Studio "cc" man page:

And that's why the OPs code that accesses the structure via malloc()'d memory causes a bus error.

It's pretty much the archetype of how to get a SIGBUS error on SPARC.

struct _a
{
	int _a1;
	uint64_t _a2;
}

Please explain how you come to the conclusion that the 2nd member of the structure require strict alignment. I do not see that requirement.

There are three ways to get SIGBUS on SPARC:

  1. Writing to memory that was created with mmap using the MAP_NORESERVE flag and an initial write to a virtual page requires that page to actually be created, but no swap space is available for a backing store.

  2. Hardware error.

  3. Misaligned memory access.

This isn't 1 or 2.

Oracle sun docs - these are super short paragraphs - basically a 64 bit integer has to align on a 64 bit boundary, 32 on 32 bit boundary, 16 on 16.

http://docs.oracle.com/cd/E19253-01/816-4854/hwovr-2/index.html

http://docs.oracle.com/cd/E19253-01/816-4854/hwovr-3/index.html

x86 is one of the only common architectures which does not have these limitations, and accessing memory in that manner still comes at a performance cost.