Orphaned process "D" state

Hello,

How can we clear the D state (orphaned) process? I have tried to kill it with kill -9 but not work.

The server is critical, so is there anyway to clear the D process without rebooting the server?

You can check to see what is the parent process, and if possible you can kill or restart the parent process (as long as the parent process is not the root process).

In the case of remote mounts causing the D state, you can check the parent networking process and decide how to proceed.

Some people have tried to be creative as follows:

  1. Determine the zombie & parent PIDs. (in this example let's say the zombie's PID 3200 and the parent's PID 3100)
  2. Start gdb and attach to the parent in this example , attach 3200
  3. Call waitpid for the zombie process:, for example call waitpid(3100,0,0)
  4. Detach from the parent (detach) and exit the debugger.

Update: Fixed typos (I think!)

1 Like

D state is "device waiting" and is a bit nasty.
Such a process cannot be killed.
It makes sense to guess the blocking device, and fix it. Once fixed, the proceses will leave the D state and continue.

2 Likes

Here are the different process state codes and description:-

D    Uninterruptible sleep (usually IO)
R    Running or runnable (on run queue)
S    Interruptible sleep (waiting for an event to complete)
T    Stopped, either by a job control signal or because it is being traced.
W    paging (not valid since the 2.6.xx kernel)
X    dead (should never be seen)
Z    Defunct ("zombie") process, terminated but not reaped by its parent.

As you can see, D means uninterruptible sleep usually due to an IO.

You can check the wchan - name of the kernel function in which the process is sleeping to understand what exactly is going on:-

ps -eo pid,ppid,state,wchan=WIDE-WCHAN-COLUMN,comm,args | ( read -r; printf "  %s\n" "$REPLY"; grep <your process name/pid> )

Usually it will be a exit_mm() function to release all memory descriptors and related data structures.

As per linux kernel documentation, it first of all checks mm->core_waiters flag is set. If it does, then the process is dumping the contents of memory to a core file (IO). If that is the case, I believe to avoid corruption, it will not respond to a KILL signal until the core file dumping is completed.

4 Likes

Hi Neo,

It's orphan process, not zombie, and its PPID is 1 :frowning:

[root@xxx:~]# ps -ef | grep dsmc
root     13613     1  0 Apr19 ?        00:00:00 dsmc q systeminfo policy -console
root     17067 12166  0 14:33 pts/2    00:00:00 grep dsmc
root     21870     1  0 Apr22 ?        00:00:00 dsmc

Hi MadeinGermany

You mean guessing the IO devices (disks) ? The root cause of this is that the NFS server was disconnected unexpectedly and caused the NFS mounted folder became unresponsive, I have forced unmount and remount when the NFS server is back. And cannot kill it.

Hi Yoda,
I have tried your command

ps -eo pid,ppid,state,wchan=WIDE-WCHAN-COLUMN,comm,args | ( read -r; printf "  %s\n" "$REPLY"; grep <your process name/pid> )

And resulted in as below:

[root@xxx:~]# ps -eo pid,ppid,state,wchan=WIDE-WCHAN-COLUMN,comm,args | ( read -r; printf "  %s\n" "$REPLY"; grep 13613 )
    PID  PPID S WIDE-WCHAN-COLUMN COMMAND         COMMAND
13613     1 D cifs_reconnect_tc dsmc            dsmc q systeminfo policy -console

[root@xxx:~]# ps -eo pid,ppid,state,wchan=WIDE-WCHAN-COLUMN,comm,args | ( read -r; printf "  %s\n" "$REPLY"; grep 21870 )
    PID  PPID S WIDE-WCHAN-COLUMN COMMAND         COMMAND
21870     1 D cifs_reconnect_tc dsmc            dsmc

Look like it matches with my finding above (nfs disconnected). Now the nfs mounted folders are back. As the state D, so we cannot kill it, a reboot only can help clearing it?

2 Likes

Yes, I understand D state is for orphans and Z is for zombie.

However, the process of using gdb to attach to the process is the same.

The "creative process" I suggested using gdb can be tried before rebooting if you absolutely do not want to reboot.

Don't you agree?

Actually I expect gdb to also get hung when attaching it to a process in D state. But it's worth a trial.

If processes are permanently hung in cifs_reconnect_tcon then it looks like a kernel bug (or a missing interrupt/timeout feature).
Is your kernel at the latest patch level?

I agree.

He has nothing to lose to at least try to attach with gdb if he really does not want to reboot.

He might get lucky :slight_smile:

I tried but no lucky :rolleyes:

[root@xxx:~]# ps -ef | grep dsmc
root     10765     1  0 Apr19 ?        00:00:00 dsmc q systeminfo policy -console
root     14196     1  0 Apr23 ?        00:00:03 /usr/bin/dsmc schedule -optfile=/opt/tivoli/tsm/client/ba/bin/dsm.opt
root     27110  2182  0 18:41 pts/0    00:00:00 grep dsmc
[root@xxx:~]# gdb
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
(gdb) attach 10765
Attaching to process 10765
ptrace: Operation not permitted.
(gdb)
1 Like

Thanks for trying....

It was a long shot, but sometimes we do get lucky :slight_smile:

1 Like