Free() corrupted unsorted chunks

Karunx · August 13, 2013, 6:21am

We are migrating Pro*C code from SOLARIS to LINUX-Redhat.

While migrating we face memory de-allocation issue intermittently when accessing large volume of data.

Below is the part of the code(since code is big I am putting the part of the code where the issue comes):

------------------------------------------------------------

void free_frm_nde(node * nde_ptr)
{
static int num_ndes;
int i;

if (nde_ptr == NULL)
return;

if (nde_ptr->ptr_nxt != NULL)
free_frm_nde(nde_ptr->ptr_nxt);

for (i = 0; i < 7; i++)
{
if (nde_ptr->ptr_dwn != NULL)
free_frm_nde(nde_ptr->ptr_dwn); 
}
free(nde_ptr->ptr_dta); -- Error occurs here
free(nde_ptr);
}

------------------------------------------------------------

Below is the error message:

*** glibc detected *** : ./filename: free(): corrupted unsorted chunks: 0x0a068b80 ***
======= Backtrace: =========
/lib/libc.so.6[0xa044a5]
/lib/libc.so.6(cfree+0x59)[0xa048e9]
./filename[0x8051180]
./filename[0x805115c]
./filename[0x805115c]
./filename[0x805115c]
./filename[0x805115c]
./filename[0x804cb31]
./filename[0x804eea7]
./filename[0x804ed8f]
./filename[0x8049a27]
./filename[0x804966f]
/lib/libc.so.6(__libc_start_main+0xdc)[0x9b0e9c]
./[0x8048f21]
======= Memory map: ========
0097c000-00997000 r-xp 00000000 fd:03 229417 /lib/ld-2.5.so
00997000-00998000 r--p 0001a000 fd:03 229417 /lib/ld-2.5.so
00998000-00999000 rw-p 0001b000 fd:03 229417 /lib/ld-2.5.so
0099b000-00aef000 r-xp 00000000 fd:03 230066 /lib/libc-2.5.so
00aef000-00af0000 ---p 00154000 fd:03 230066 /lib/libc-2.5.so
00af0000-00af2000 r--p 00154000 fd:03 230066 /lib/libc-2.5.so
00af2000-00af3000 rw-p 00156000 fd:03 230066 /lib/libc-2.5.so
00af3000-00af6000 rw-p 00af3000 00:00 0
00af8000-00b0d000 r-xp 00000000 fd:03 230096 /lib/libpthread-2.5.so
00b0d000-00b0e000 ---p 00015000 fd:03 230096 /lib/libpthread-2.5.so

Any idea about this issue would be appreciated.

Thanks in advance.

Thanks in advance

Don_Cragun · August 13, 2013, 7:49am

The spot that you have marked in red is not where the error occurs; it is where the error is detected. This error occurs because you have corrupted a pointer the system uses to keep track of space that has been malloc()ed. The most common causes for this type of corruption are (1) using an uninitialized pointer and (2) writing more data into memory than was allocated for the buffer into which the data is being written.

jim_mcnamara · August 13, 2013, 8:19am

Solaris has a high performance malloc library for threading. It's behavior may be different enough from Linux, such that the the problem did not occur on Solaris, but does show up in Linux.

But that is grabbing at straws.

I would suspect that while porting someone altered code. Is this a verbatim copy of the Solaris code? i.e., do checksums match? If you want to fix this mess, don't simply say 'yes', do a checksum.

If checksums match for the code, then grep the make files for the word malloc to see if the library (specifically Solaris (lib)mtmalloc) I mentioned are being used. If there is a Linux version of it with the same name it probably has nothing in common with Solaris libmtmalloc. Do not use it. RHEL links threaded Pro*C just fine normally.

alister · August 13, 2013, 8:46pm

Or, glibc free() could be detecting a problem that the Solaris counterpart does not.

I've experienced a similar situation: OpenBSD segfaulting on code that ran just "fine" under Linux/glibc. After some investigation, OpenBSD's implementation had exposed a years-old bug.

I am not familiar with the Solaris implementation you mentioned, but perhaps it supports runtime configuration (config file or env variable) to enable additional checks? If so, they may trigger the bug (if indeed there is one).

Regards,
Alister

Karunx · August 14, 2013, 4:24am

@jim mcnamara
Thanks for your reply..
There is no code change and also make file didnt use the library file libmtmalloc in both SOLARIS and LINUX.

---------- Post updated at 03:24 AM ---------- Previous update was at 01:44 AM ----------

@Don Cragun,jim mcnamara,Alister

Existing SOLARIS is 32 bit and Migrating environment LINUX-Redhat is 64 bit
We are generating 32-bit executables only and the verbatim code works fine sometimes in LINUX.
We are not sure,why the code is failing intermittently.
Is there any specific RPM need to be installed ?
Any ideas would be appreciated.
Thanks

maverick_here · August 14, 2013, 5:46am

Hi,

I has a similar issue when I was migrating application code ( c ) from SCO to Solaris. The support informed me the issue to be big endian and little endian. To remedy it they also had a script in place. I may be wrong. But a likely guess. Since linux is small endian and Solaris big endian.

Don_Cragun · August 14, 2013, 8:28am

There is nothing magical about a bad pointer problem going undetected for years. Depending on what the problem is, the same source code built using a different compiler, running on different hardware, or even running it at a different time of day may mask a problem until the right sequence of events happens in your program to expose the problem. 99.44% of the time (at least in my experience looking at bug reports against Solaris systems), the bug is in your code. The rest of the time, it may be a bug in the kernel, in a system library, or a hardware problem.

Without carefully analyzing your code, there is no way to guess at what the problem might be. You basically need to look at every line of code that allocates space, every line of code that uses a pointer (or an array subscript), and every line of code that frees space to verify that the pointer/array subscript is in bounds for the space allocated to that space/array) and that allocated space is not referenced after it is freed.

If you show us your code, we might spot the problem in seconds, or we might never figure it out. If you don't show us your code, there isn't anything we can do to help you other than suggest that you set breakpoints in your code, dump variables that seem to be corrupted, and add debugging statements until you identify the problem and fix it. But, be aware that adding a line of debugging code can easily change the way your program runs just enough to hide a problem. I.e., debugging bad pointers can be really hard.

You have not said anything yet that sounds like there was a bug on your old Solaris system nor that there is a bug on your new Linux system (although subtle differences in the ways functions are defined to behave on the two systems may well be your problem).

alister · August 14, 2013, 11:40am

Endianness is a property of the underlying hardware's architecture, not the operating system. Saying that Linux is little endian or that Solaris is big endian is nonsensical. Both Solaris and Linux run on little, big, and bi endian architectures.

You did not provide any specifics, but my guess is that your situation involved serializing/deserializing across endianness, between Linux on a little-endian architecture and Solaris on a big-endian architecture, and that the script was a workaround which swapped bytes in a data file.

Perhaps improperly deserialized data is a factor in the OP's issue.

Regards,
Alister

Karunx · August 14, 2013, 12:18pm

Hi Maverick,

We could have checked so many areas including bit size, endianness, etc. But it is quite surprising that the program runs fine in linux env few times and suddenly throws error intermittently. we are trying to use "VALGRIND" tool now. If you have any other leads, keep me posted.

Thanks