Not able to kill a process

There is backup client software on Solaris-10. I wanted to restart that application, but one of PID is not getting killed even with -9. Can I try some more things before restarting server.

root@pdvtil03:/# ps -ealf | grep -i 6177
 0 S     root 28101 10844   0  50 20        ?    220        ? 12:17:50 pts/1       0:00 grep -i 6177
 0 S     root  6177     1   0  40 20        ?  56311        ? 22:00:13 ?           0:39 /opt/simpana/iDataAgent/ifind -j 40
root@pdvtil03:/# ps -Al | grep -i 6177
 0 S      0  6177     1   0  40 20        ?  56311        ? ?           0:39 ifind
root@pdvtil03:/# kill -9 6177
root@pdvtil03:/# echo $?
0
root@pdvtil03:/# ps -ealf | grep -i 6177
 0 S     root 29062 10844   0  50 20        ?    220        ? 12:18:20 pts/1       0:00 grep -i 6177
 0 S     root  6177     1   0  40 20        ?  56311        ? 22:00:13 ?           0:39 /opt/simpana/iDataAgent/ifind -j 40
root@pdvtil03:/#

try to check with ptree -pid see if other process is holding that process..

root@pdvtil03:/# ptree 6177
6177  /opt/simpana/iDataAgent/ifind -j 4016161 -a 2:6905 -t 2 -d ctpddnas13.webapp
root@pdvtil03:/#

Are you seeing any hardware errors logged on the system console? Being resistant to SIGKILL signals usually indicates that the process is hung in a non-restartable system call that is waiting for an interrupt that would (under normal circumstances) usually occur quickly.

Don, I have seen any hardware errors on this server as well on console. I have seen uninterrupted state for pid (when it goes in D state), which I had to reboot.

Please get used to ps -fp 6177 !
Besides ptree 6177 the command lsof -p 6177 gives information about resourses. (Please download/compile lsof if not present.)

Seems like some issue on this server. I didn't not knew full path of lsof, so I tried finding it. Now I am not able to kill 2807 also. Here is another output.

root@pdvtil03:/# find / -name lsof
^C
^C^C

From another terminal

root@pdvtil03:/root# ps -ef | grep -i find
    root  2807 12503   0 13:20:55 pts/1       0:00 find / -name lsof
    root 18399 17487   0 13:27:53 pts/2       0:00 grep -i find
    root 25152     1   0 03:44:22 ?           0:39 /opt/simpana/iDataAgent/ifind -j 4016161 -a 2:6905 -t 2 -d ctpddnas13.webapp
    root 21595     1   0 01:15:35 ?           0:39 /opt/simpana/iDataAgent/ifind -j 4016161 -a 2:6905 -t 2 -d ctpddnas13.webapp
    root  6177     1   0 22:00:13 ?           0:39 /opt/simpana/iDataAgent/ifind -j 4016161 -a 2:6905 -t 2 -d ctpddnas13.webapp
root@pdvtil03:/root#
root@pdvtil03:/# ps -fp 6177
     UID   PID  PPID   C    STIME TTY         TIME CMD
    root  6177     1   0 22:00:13 ?           0:39 /opt/simpana/iDataAgent/ifind -j 4016161 -a 2:6905 -t 2 -d ctpddnas13.webapp
root@pdvtil03:/# ptree 6177
6177  /opt/simpana/iDataAgent/ifind -j 4016161 -a 2:6905 -t 2 -d ctpddnas13.webapp
root@pdvtil03:/#

It was going on and on, so I have to kill it.

root@pdvtil03:/# /usr/local/bin/lsof -p 6177
lsof: WARNING: bad section count line in /root/.lsof_pdvtil03: line "4 sections, dev=16b00000000"
lsof: WARNING -- child process 14488 may be hung.
lsof: WARNING -- child process 15836 may be hung.
lsof: WARNING -- child process 17049 may be hung.
lsof: WARNING -- child process 17647 may be hung.
lsof: WARNING -- child process 18291 may be hung.
lsof: WARNING -- child process 19373 may be hung.
lsof: WARNING -- child process 20323 may be hung.
lsof: WARNING -- child process 22133 may be hung.
lsof: WARNING -- child process 23402 may be hung.
lsof: WARNING -- child process 24067 may be hung.
lsof: WARNING -- child process 28478 may be hung.
^C
root@pdvtil03:/#

Run

truss find /

It will get hung, but the last shown syscall gives the path with the problem.

/# truss find /
execve("/usr/bin/find", 0xFFBFFB4C, 0xFFBFFB58)  argc = 2
sysinfo(SI_MACHINE, "sun4u", 257)               = 6
mmap(0x00000000, 32, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF3E0000
mmap(0x00000000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF390000
mmap(0x00000000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF380000
mmap(0x00000000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF370000
memcntl(0xFF3A0000, 17936, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
memcntl(0x00010000, 4152, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
resolvepath("/usr/lib/ld.so.1", "/lib/ld.so.1", 1023) = 12
resolvepath("/usr/bin/find", "/usr/bin/find", 1023) = 13
stat64("/usr/bin/find", 0xFFBFF610)             = 0
open("/var/ld/ld.config", O_RDONLY)             Err#2 ENOENT
stat64("/etc/emc/rsa/cst/lib/libsec.so.1", 0xFFBFED70) Err#2 ENOENT
stat64("/usr/openwin/lib/libsec.so.1", 0xFFBFED70) Err#2 ENOENT
stat64("/lib/libsec.so.1", 0xFFBFED70)          = 0
resolvepath("/lib/libsec.so.1", "/lib/libsec.so.1", 1023) = 16
open("/lib/libsec.so.1", O_RDONLY)              = 3
mmap(0x00010000, 32768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 0xFF360000
mmap(0x00010000, 90112, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF340000
mmap(0xFF340000, 57913, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xFF340000
mmap(0xFF350000, 13309, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 65536) = 0xFF350000
mmap(0xFF354000, 5616, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANON, -1, 0) = 0xFF354000
munmap(0xFF360000, 32768)                       = 0
close(3)                                        = 0
memcntl(0xFF340000, 14336, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
stat64("/etc/emc/rsa/cst/lib/libc.so.1", 0xFFBFED70) Err#2 ENOENT
stat64("/usr/openwin/lib/libc.so.1", 0xFFBFED70) Err#2 ENOENT
stat64("/lib/libc.so.1", 0xFFBFED70)            = 0
resolvepath("/lib/libc.so.1", "/lib/libc.so.1", 1023) = 14
open("/lib/libc.so.1", O_RDONLY)                = 3
mmap(0x00010000, 32768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 0xFF360000
mmap(0x00010000, 1368064, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF180000
mmap(0xFF180000, 1247157, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xFF180000
mmap(0xFF2C2000, 35965, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 1253376) = 0xFF2C2000
mmap(0xFF2CC000, 1616, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANON, -1, 0) = 0xFF2CC000
munmap(0xFF2B2000, 65536)                       = 0
munmap(0xFF360000, 32768)                       = 0
close(3)                                        = 0
memcntl(0xFF180000, 146148, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
stat64("/etc/emc/rsa/cst/lib/libavl.so.1", 0xFFBFED70) Err#2 ENOENT
stat64("/usr/openwin/lib/libavl.so.1", 0xFFBFED70) Err#2 ENOENT
stat64("/lib/libavl.so.1", 0xFFBFED70)          = 0
resolvepath("/lib/libavl.so.1", "/lib/libavl.so.1", 1023) = 16
open("/lib/libavl.so.1", O_RDONLY)              = 3
mmap(0x00010000, 14372, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 0xFF360000
mmap(0x00010000, 81920, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF320000
mmap(0xFF320000, 3316, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xFF320000
mmap(0xFF332000, 296, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 8192) = 0xFF332000
munmap(0xFF322000, 65536)                       = 0
munmap(0xFF360000, 14372)                       = 0
close(3)                                        = 0
memcntl(0xFF320000, 1128, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
mmap(0x00010000, 24576, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF360000
getcontext(0xFFBFF480)
getrlimit(RLIMIT_STACK, 0xFFBFF460)             = 0
getpid()                                        = 16310 [16309]
setustack(0xFF362A88)
brk(0x00028090)                                 = 0
brk(0x0002A090)                                 = 0
stat64("/platform/SUNW,SPARC-Enterprise/lib/libc_psr.so.1", 0xFFBFE9F8) = 0
resolvepath("/platform/SUNW,SPARC-Enterprise/lib/libc_psr.so.1", "/platform/sun4u-opl/lib/libc_psr.so.1", 1023) = 37
open("/platform/SUNW,SPARC-Enterprise/lib/libc_psr.so.1", O_RDONLY) = 3
mmap(0x00010000, 6532, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 0xFF310000
close(3)                                        = 0
stat64("/usr/lib/locale/en_US.ISO8859-1/en_US.ISO8859-1.so.3", 0xFFBFE830) = 0
resolvepath("/usr/lib/locale/en_US.ISO8859-1/en_US.ISO8859-1.so.3", "/usr/lib/locale/en_US.ISO8859-1/en_US.ISO8859-1.so.3", 1023) = 52
open("/usr/lib/locale/en_US.ISO8859-1/en_US.ISO8859-1.so.3", O_RDONLY) = 3
mmap(0x00010000, 26032, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 0xFF300000
mmap(0x00010000, 90112, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF2E0000
mmap(0xFF2E0000, 16093, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xFF2E0000
mmap(0xFF2F2000, 10158, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 8192) = 0xFF2F2000
munmap(0xFF2E4000, 57344)                       = 0
munmap(0xFF300000, 26032)                       = 0
close(3)                                        = 0
mmap(0x00000000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF300000
memcntl(0xFF2E0000, 6624, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
stat64("/usr/lib/locale/en_US.ISO8859-15/en_US.ISO8859-15.so.3", 0xFFBFE830) = 0
resolvepath("/usr/lib/locale/en_US.ISO8859-15/en_US.ISO8859-15.so.3", "/usr/lib/locale/en_US.ISO8859-15/en_US.ISO8859-15.so.3", 1023) = 54
open("/usr/lib/locale/en_US.ISO8859-15/en_US.ISO8859-15.so.3", O_RDONLY) = 3
mmap(0x00010000, 25996, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 0xFF2D0000
mmap(0x00010000, 90112, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF160000
mmap(0xFF160000, 16057, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_TEXT, 3, 0) = 0xFF160000
mmap(0xFF172000, 10122, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_INITDATA, 3, 8192) = 0xFF172000
munmap(0xFF164000, 57344)                       = 0
munmap(0xFF2D0000, 25996)                       = 0
close(3)                                        = 0
memcntl(0xFF160000, 6624, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
time()                                          = 1415570989
getcwd("/", 1024)                               = 0
getcwd("/", 1025)                               = 0
lstat64("/", 0xFFBFF808)                        = 0
openat(-3041965, "/", O_RDONLY|O_NDELAY|O_LARGEFILE) = 3
fcntl(3, F_SETFD, 0x00000001)                   = 0
fstat64(3, 0xFFBFF600)                          = 0
ioctl(1, TCGETA, 0xFFBFE694)                    = 0
fstat64(1, 0xFFBFE5B0)                          = 0
/
write(1, " /\n", 2)                             = 2
fstat64(3, 0xFFBFF770)                          = 0
fchdir(3)                                       = 0
getdents64(3, 0xFF364000, 8192)                 = 3104
lstat64("zcpst01_root_pool", 0xFFBFF640)        = 0
openat(-3041965, "zcpst01_root_pool", O_RDONLY|O_NDELAY|O_LARGEFILE) = 4
mmap(0x00010000, 65536, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF140000
fcntl(4, F_SETFD, 0x00000001)                   = 0
fstat64(4, 0xFFBFF438)                          = 0
/zcpst01_root_pool
write(1, " / z c p s t 0 1 _ r o o".., 19)      = 19
fstat64(4, 0xFFBFF5A8)                          = 0
fchdir(4)                                       = 0
^C
^C
root@pdvtil03:/# df -h | grep -i zcpst01_root_pool
zcpst01_root_pool      4.8G    18K   4.8G     1%    /zcpst01_root_pool
zcpst01_root_pool/zone   8.2G   3.4G   4.8G    42%    /zone/pdvtil03-zcpst01/root
root@pdvtil03:/#

pdvtil03-zcpst01 is non global zone of this server. Is this server (non global zone) creating issues ?

Truss can get stuck, too, if you try to truss a process hung in an unkillable kernel wait state.

What does "pstack PID &" show? (And use the ampersand to background it so the pstack utility doesn't hang your terminal if it gets stuck too.)

I can not see this in pstack

root@pdvtil03:/# ps -ef | grep -i 6177
    root 20323     1   0 13:38:19 ?           0:00 /usr/local/bin/lsof -p 6177
    root 22133     1   0 13:38:49 ?           0:00 /usr/local/bin/lsof -p 6177
    root 17049     1   0 13:36:19 ?           0:00 /usr/local/bin/lsof -p 6177
    root 14488     1   0 13:35:18 ?           0:00 /usr/local/bin/lsof -p 6177
    root 19373     1   0 13:37:49 ?           0:00 /usr/local/bin/lsof -p 6177
    root 15836     1   0 13:35:49 ?           0:00 /usr/local/bin/lsof -p 6177
    root 29101     1   0 13:40:49 ?           0:00 /usr/local/bin/lsof -p 6177
    root 24067     1   0 13:39:49 ?           0:00 /usr/local/bin/lsof -p 6177
    root 17647     1   0 13:36:49 ?           0:00 /usr/local/bin/lsof -p 6177
    root 28478     1   0 13:40:19 ?           0:00 /usr/local/bin/lsof -p 6177
    root 18291     1   0 13:37:19 ?           0:00 /usr/local/bin/lsof -p 6177
    root 23402     1   0 13:39:19 ?           0:00 /usr/local/bin/lsof -p 6177
    root  6177     1   0 22:00:13 ?           0:39 /opt/simpana/iDataAgent/ifind -j 4016161 -a 2:6905 -t 2 -d ctpddnas13.webapp
    root 12277  7386   0 14:49:04 pts/3       0:00 grep -i 6177
root@pdvtil03:/# pstack 6177
pstack: cannot examine 6177: no such process
root@pdvtil03:/# 

You can also use mdb to get a kernel stack trace:

https://blogs.oracle.com/jayd/entry/solaris\_tip\_of\_the_week14

It has this nice little script that dumps the kernel stack trace for the process specified:

#!/bin/sh

for p in `pgrep $1`; do
  echo "-------------------------"
  pargs $p
  echo "0t${p} ::pid2proc|::walk thread|::findstack" | mdb -k
done

Call it with "ifind" as an argument and you'll probably see where your process is stuck.

This is output I can see

root@pdvtil03:/# cat /var/tmp/kernel_stack.sh
#!/bin/sh

for p in `pgrep $1`; do
  echo "-------------------------"
  pargs $p
  echo "0t${p} ::pid2proc|::walk thread|::findstack" | mdb -k
done
root@pdvtil03:/# /var/tmp/kernel_stack.sh ifind
-------------------------
pargs: cannot examine 25152: no such process
stack pointer for thread 30073f60e80: 2a11f340ac1
[ 000002a11f340ac1 cv_wait+0x38() ]
  000002a11f340b71 dbuf_read+0x25c()
  000002a11f340c21 dmu_buf_hold+0x94()
  000002a11f340ce1 zap_lockdir+0x24()
  000002a11f340da1 zap_cursor_retrieve+0x50()
  000002a11f340e91 zfs_readdir+0x374()
  000002a11f341131 fop_readdir+0x1c()
  000002a11f3411e1 getdents64+0x8c()
  000002a11f3412e1 syscall_trap32+0xcc()
stack pointer for thread 301681e6100: 2a11f268f11
[ 000002a11f268f11 cv_wait+0x38() ]
  000002a11f268fc1 exitlwps+0x11c()
  000002a11f269071 proc_exit+0x20()
  000002a11f269121 exit+8()
  000002a11f2691d1 post_syscall+0x41c()
  000002a11f2692e1 syscall_trap32+0x18c()
-------------------------
pargs: cannot examine 21595: no such process
stack pointer for thread 301863e0a40: 2a11f330ac1
[ 000002a11f330ac1 cv_wait+0x38() ]
  000002a11f330b71 dbuf_read+0x25c()
  000002a11f330c21 dmu_buf_hold+0x94()
  000002a11f330ce1 zap_lockdir+0x24()
  000002a11f330da1 zap_cursor_retrieve+0x50()
  000002a11f330e91 zfs_readdir+0x374()
  000002a11f331131 fop_readdir+0x1c()
  000002a11f3311e1 getdents64+0x8c()
  000002a11f3312e1 syscall_trap32+0xcc()
stack pointer for thread 3018799e7a0: 2a117660f11
[ 000002a117660f11 cv_wait+0x38() ]
  000002a117660fc1 exitlwps+0x11c()
  000002a117661071 proc_exit+0x20()
  000002a117661121 exit+8()
  000002a1176611d1 post_syscall+0x41c()
  000002a1176612e1 syscall_trap32+0x18c()
-------------------------
pargs: cannot examine 6177: no such process
stack pointer for thread 301890a1200: 2a11f120ac1
[ 000002a11f120ac1 cv_wait+0x38() ]
  000002a11f120b71 zio_wait+0x34()
  000002a11f120c21 dmu_buf_hold+0x94()
  000002a11f120ce1 zap_lockdir+0x24()
  000002a11f120da1 zap_cursor_retrieve+0x50()
  000002a11f120e91 zfs_readdir+0x374()
  000002a11f121131 fop_readdir+0x1c()
  000002a11f1211e1 getdents64+0x8c()
  000002a11f1212e1 syscall_trap32+0xcc()
stack pointer for thread 301a7351100: 2a11f200f11
[ 000002a11f200f11 cv_wait+0x38() ]
  000002a11f200fc1 exitlwps+0x11c()
  000002a11f201071 proc_exit+0x20()
  000002a11f201121 exit+8()
  000002a11f2011d1 post_syscall+0x41c()
  000002a11f2012e1 syscall_trap32+0x18c()
root@pdvtil03:/#

You're stuck in ZFS, trying to read the contents of a directory.

What's the output of "zpool status", "iostat -E", and "iostat -e"? Are there any errors in /var/adm/messages?

You probably should to run "zpool scrub [POOL]". Expect to see errors.

You are correct. There are some issues on SAN disks. I was not able to see them in messages, but zpool status shows it. Checked with Storage team and they had some activity, which might have triggered this issue. Thanks for pointing towards this.