AIX - remote shell (sudo) - signal 11 core system 50

brjohnsmith · November 17, 2014, 6:32pm

Hi,
I am running a remote shell from site A to site B, where both are AIX. The remote shell starts other application, and when it finishes, it returns to the site A.
The problem is that I am receiving an error signal 11 and system core error 50 - segmentation fault.
Does anyone know if there are some configuration on remote shell channel, thread size, or some other place where I need to resize, in order to be able to run it? BTW, sometimes it works, sometimes not, and it seems that it depends on the application size that is running on site B (I am not sure for it yet).
The guy that is responsible for operating system said that he is not able to see anything wrong, and the problem is related to the application. The fact is, when the application runs at the same site, it works - any time.
It is a weird situation and I dont know what I need to do to trace or to be able to see the problem. I hope anyone could suggest me something such as parameters on the operating system site and this could be related to the size of thread, remote shell, etc..Oh, I was forgetting to say, the remote shell is executed under sudoers process, i.e, to run the remote shell on the other side, a sudo is necessary.
I would appreciate any help or hints for this issue.
tks.

Don_Cragun · November 17, 2014, 7:32pm

The most common reasons for an application to die with a segmentation fault (assuming no one explicitly sent it a SIGSEGV signal) are:

using an uninitiaized pointer,
buffer overflow (allocating a buffer of size x and writing into buffer[n] where n >= x or n < 0 [in C, valid array offsets are 0 to n-1]), or
searching for the end of a string in a character array that does not include a terminating null byte.

achenle · November 17, 2014, 8:07pm

Check for environment variable differences when the app is run remotely. You could be loading the wrong shared object because of a different environment, for example. You might also be running into resource limits on the remote invocation, such as max memory usage.

Can you get a stack trace from a core file? If so, what is the app trying to do when it SEGVs? If you can get a stack trace, can you get a memory map? Where is the code being executed from? What shared library or executable?

Intermittent SEGVs can be tremendously hard to track down. Heap corruption from buffer overflows - the most common cause of intermittent SEGVs - tends to appear almost random at times because of the way heap memory tends to work. (Because of hardware alignment requirements, heap memory from "malloc()/calloc()/etc" and/or "new" tends to be parceled out in 8-byte blocks. So if you malloc() a 25-byte buffer, on most of today's hardware you really get 32 bytes...)

Are you responsible for developing this application? If so, have you ever tried something like Purify? Go look at the cost of that tool, then calculate how much time you've already spent trying to run down this ONE problem...

brjohnsmith · November 18, 2014, 6:31pm

Hi,

As to analyse the trace map is a bit byte process, I would like to know something more about resize process or parameters, and answering the question if I can change the application, yes, I can.

I have heard some other hints, and one of them was: to identify the limits of the user that is running remotely. So, to do it, I used the command: ulimit, and the answer was: unlimited. the other hint was to know the limits of each machine, and on /etc/security/limits, I am able to see all of the parameters with -1, and it seems it is unlimited as well.

So, I am considering that I have no problem with limits, or machines, and the problem could be the size of the application that is handling some (or many) variables, contents, etc...(I am not sure for about it yet).

If someone knows other parameters, files, or any operating system issue that I need to check before, please, let me know, otherwise, I will start (and trying) changing the application.

tks.

achenle · November 18, 2014, 7:54pm

Are you saying your going to start making changes without knowing exactly what's failing?

That's known as Easter-egging.

brjohnsmith · November 19, 2014, 12:34pm

Hi anchenle,

yes, it is a easter-egging. the fact is that no one knows what is going on, even the operating system guy, it seems that there is no limit to be resized, and, if the amount of instructions were not so much, it works, so, this is we are considering that the size of the application could be one of the reason of the error. As it is possible to change the application and it does not depend on anyone, it would be easier to do it. However, as I said, if anyone can say something about the operating system parameters, or something like that, I would appreciate in receiving it. (my knowledge in Unix is zero)
one thing that I need to know it is if the process is executed inside any thread, because it could be that this thread would need to be bigger than it is defined.
tks.

achenle · November 19, 2014, 1:41pm

Easter-egging and intermittent SEGV is not easy.

If you don't know what's going on, you're not looking at the information you have.

Have you even tried to examine a core file from one of the SEGVs?

Have you even looked into trying any of the many memory corruption tools that are available?

https://www.google.com/search?q=aix\+memory\+corruption\+tools