Defunct Processes

Hi,

Can any one help me to get rid of defunct process on UNIX IBM AIX box. These processes started when the system was rebooted almost after 1 1/2 years. Once one defunct process is created then all the user ids get infected and in turn creates numerous defunct processes. We have tried rebooting the system many times but the defunct processes start coming almost 20 hrs after the reboot. When the system is back after reboot we dont see any defunct processes pilling at that time but it comes after 15-20 hrs and keep on increasing there after. Does any one have any idea what would be triggereing this and what measures can be taken to get rid of this ? This is affecting the whole sytem and application on it and we are not able to find any solution to this issues. Any help will be greatly appreciated.

Thanks.

I have bad, worse and plain ugly news for you:

The bad news is that rebooting is the only way to get rid of zombies (this is what the "defunct" processes are usually called).

The worse news is that AIX is not to blame for these processes. The OS (read: the kernel) maintains a so-called "process table", where every running process gets an entry. All the processes are additionally organised in an hierarchy. At the root of this hierarchy is a process called "init" which is the process number 0. In fact when the machine is booted the first process to be started is this init, which in turn works through /etc/inittab (hence the name), and so on and so on until the system is up and alive.

When a user logs on there is a getty process (one of the childs of init), which starts a shell (now a child of getty) for the user, who now can start processes (childs of his shell), .... If a process is declared to run independently from the users shell (a background process or a process started with nohup) it is "passed over" to the init process as a child.

A Zombie is now a process which parent has died without cleaning up its leftovers. It still has an entry in the process table but this entry cannot be removed because the parent process which would normally control it is gone. From the whole tree some branch is cut off but a leaf from the branch is dangling - in the nowhere.

Now, after the bad and the worse, the very bad news: zombies are - excusively - a symptom of sluggish programming. A process is responsible for cleaning up its environment and if it doesn't do so on a regular basis its a case of "programmer has not found out how to program in the Unix environment yet". PEBKAC on behalf of your software vendor.

I hope this helps.

bakunin

Like bakunin - I would suggest something that may not sit well either. If it started recently, the first and only place to look is at:

patches, upgrades, or new software

added to the system about just before the problem started.

Any new/updated cron entries, or new/updated scripts that run under "at" also need to be checked out. If you allow any of these on your system.

I have seen this type of problem after a reboot in the following circumstances (granted, it was Solaris, but the principle is the same):

  1. Server is booted
  2. A new filesystem is mounted, but not updated in vfstab
  3. A utility to do some task (like cleanup) is called from the new filesystem
  4. Server is rebooted
  5. The filesystem is not mounted at boot time
  6. The utility is called to perform the cleanup task - but there's nothing there.
  7. The application crashes quietly, you end up with defunct processes.

Zombies have a living parent who can't be bothered to clean up. If that parent dies, it's surviving child processes, including zombies, will get a new ppid of 1 and now the parent will be init. init always reaps all child zombies.

So...want a zombie to go away? Look it up in ps and note the ppid. The is the pid of the misbehaving parent. Kill the parent. In a second or two the zombies should be gone.

At least this is the way unix is supposed to work. Is AIX really different than this?

In (more than) one word: you don't want to know. ;-)) *)

Some (few) zombies can be cleaned up this way, most can't. I don't know why, but common knowledge and my experience is "zombies can only be cleaned by a reboot".

bakunin

___________
*) Years ago, when AIX 3.2.5 was current, there was a saying that this was the best MVS IBM had ever built. Go figure....

Thanks all for your thoughts ...
Can any one share me a hint as to why these processes are starting after 15-20 hrs of the reboot . This is been happening from last few days .. Is there any script or anything i should check which could generate this at this time ?