After run ps , uptime , w command I get reply "killed"

Hi,

After run ps , uptime , w command I get reply "killed" as normal dba and staff group user.

As root every command works fine.
I cheched all the user settings , right with other servers and I could not find any error and other settings.

The oslevel is 5300-10-01-0921.

Any idea to check?

Can you paste the exact terminal output that you see?

Hi,
Yes , for example?

[root@fihelor03:/root/]#su - test
[test@fihelor03:/home/test]$ps
Killed

Just a guess: could it be that the user is missing access to the shared libraries somehow?

I hope this helps.

bakunin

What do you mean under shared libraries. Exactly which?

On an AIX 5.3 system /usr/bin/ps uses libc.a, libwlm.a and libaacct.a.

Try comparing all ulimit values for root and the user.

I got the same problem on AIX 6.1.
The problem appears after two or three days of workload.
The ulimit of the users is the same as root.
Any clues?

check the error log. errpt. you may have rootvg corruption or something really whacky going on.

Try running a trace (with truss) on it and see if this gives any clues:

$ truss ps

Sorry, but when the problem occurs, truss is killed also.
No errors in errpt or suspect of rootvg corruption.

This symptom happens after two or three days of work, with a reboot the problem is solved and the system works as normal as any other aix i got.

Can i reinstall AIX in upgrade mode when the version installed is newer (because of the fixes) than the version on the DVD?

IBM Lab says that they can see the kill signal, but they cannot identify why this happens.

This reminds me on a machine i once had, which ran a memory hog. The application would slowly allocate all the memory in the machine thereby filling up the swap.

When AIX has a swap utilization of more than ~96% it cannot reorganize its swap any more and the system starts to react unpredictably. One of the signs of the machine being near to hanging is that commands will become killed the way you describe. In my experience it was usually a matter of minutes before the final crash sat in.

As you say that a reboot remedies the problem temporarily this seems to fit. Probably you could try to monitor memory and swap utilization and correlate to the times the problem happens? Just an idea.

I hope this helps.

bakunin

Bakunin,

if you were right, this would happen to all users, including root ? At least on my boxes, all userprocesses are impacted, when I have a rogue process eating the memory + paging area.

My best guess would be number of runnable processes that is likely unlimited for root user but for sure not for the others (there it is 2000 by default and this value is system wide set per user, no matter how many processes the user himself runs), if you have a very busy box, it can add up very fast since every oracle query forks several processes. Changing the value for testing for one of the impacted users if the problem occurs would prove me right or wrong.

My thought would go as well into the memory direction but rather pinned memory than paging - how much memory is pinned on the systems? AIX can only pin about 85% in total and the longer the kernel is up, the more memory is pinned by it. If you have already a high amount of pinnend memory right after the reboot - for example by pinning your databases into memory (what is btw real bad practice and on AIX just not required accept you are using very large pages), you can see sometimes this kind of issues without having full paging areas ...

Kind regards
zxmaus