help required - stack trace

ranj1 · August 10, 2006, 5:23am

Hi all,
One of our programs written in Java, produced this logfile. This job runs 48 threads and only one thread failed with this error. The code is a blackbox(external product), so cant look at the source code. From what I can infer from the log, the job was trying to write the log messages into a file but didnt write and failed. THe error is below

There was no core dump too. We were having swap space issues on this box previously. our admins increased the swap 2 days back. The next day, the job ran fine. But the subsequent run failed and the error is above. Can anyone throw some light on this.
Box: HP-UX B.11.11
swapinfo -t gives

jim_mcnamara · August 10, 2006, 7:34am

Was swapinfo run during the time tha Java app was running?

ranj1 · August 10, 2006, 7:50am

Sorry for the delayed reply. Was not at desk.

I ran it seperately. But once, while monitoring the job, we got a Red alert in 'glance' saying that the 'Global swap space is nearly full'. It was after pointing out this that the swap was increased.

jim_mcnamara · August 10, 2006, 9:26am

That means the app is requesting lots of virutal memory. All I get is

it's a 64 bit app
sigreturn (didn't know it was in HPUX) switches context (like longjmp) after reveiving a signal.

So, what signal triggered the problem... did you check syslog?

ranj1 · August 10, 2006, 9:45am

No signal in syslog. Is there any command to see which job is taking up more virtual memory? If we can know the application using the resources, we can atleast ask for a rescheduling of the job.

Regards,
Ranjith

jim_mcnamara · August 10, 2006, 10:59am

The lsp_engine - looks like a virtual machine - is the one with the problem. If the virtual memory problem started with running this app, you don't need to worry about who/what is using memory. Try to run the app as close to solo as possible.

I have not done sysadmin for long time - maybe somebody like Perderabo can give you tuning help. What I see is that you have a lot of free pagefile space.

It may also be that you will have to reconfigure the app, if that is possible.

jim_mcnamara · August 11, 2006, 10:09am

Are you anywhere further along on solving this problem?

ranj1 · August 11, 2006, 10:33am

No Jim;

The job is running fine for the past 2 days. We raised the issue with our Sysdamins and they havent got back to us. It looks like they are running 2 similar applications(a new project in Acceptance box) and both have multiple threads running at the same time. Our DBA's advise was that after the testing one App would be switched off and this issue shouldnt occur.

So, what we have determined to do in case of job failure is to check the system status at that moment and the jobs currently running and try to rerun the job after some time when the activity on the system has decreased.

I hope that will help.

Thanks for your help on this. I will get back if we find anything else.

ranj1 · September 5, 2006, 8:46am

This issue arose again. The job abended today with the same issue. I requested a rerun for the job. It runs 48 threads. After 4 minutes, one of the threads aborted with the error mentioned in this thread. All the other threads ran fine and completed. Below are the snapshots of the commands run while the job was executing. I have put the details in the attachment. Hope it throws some light.

ranj1 · September 26, 2006, 8:20am

This issue occured again. Can anyone throw some light on this. Or guide as to what other statistics can I gather for analysis?

Regards,
Ranjith