unkillable format process

Hi gurus,

I can not seem to be able to run format completely on my t2000 (Sol 10), it just hangs there and I cannot kill it.

I know that it is probably trapped in the kernel somewhere (far from the user space) and this is the reason i can not kill it but I would like to determine how to know where it hangs ...

Truss gives a weird error: (truss: unanticipated system error: PID)

So any help will be really appreciated:

root@RANDOMSERVERNAME# ps -ef | grep format
    root  3905     1   0 14:44:36 ?           0:00 format
    root  9943  9939   0 23:55:01 ?           0:00 /usr/sbin/format -l /tmp/format.out9939
    root 24480 23975   0 08:18:05 pts/2       0:00 grep format
    root 13007     1   0 14:58:45 ?           0:00 format

root@RANDOMSERVERNAME# truss -p 3905
truss: unanticipated system error: 3905 <----- ??

root@RANDOMSERVERNAME# truss -p 13007
truss: unanticipated system error: 13007 <----- ??

root@RANDOMSERVERNAME# truss -p 9943
^C                    <---------- gave no result after 15 minutes ...

Tried to kill gracefully and via -9, no change even after 15 30 minutes (in the case of PID 3905 I tried a dozen times since yesterday).

root@RANDOMSERVERNAME# kill -9 3905
root@RANDOMSERVERNAME# kill -9 9943
root@RANDOMSERVERNAME# kill -9 13007
root@RANDOMSERVERNAME# ps -ef | grep format | grep -v grep
    root  3905     1   0 14:44:36 ?           0:00 format
    root  9943  9939   0 23:55:01 ?           0:00 /usr/sbin/format -l /tmp/format.out9939
    root 13007     1   0 14:58:45 ?           0:00 format


Pstack?

> ::pgrep format
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R   3905      1  22162  22162      0 0x4a004900 0000060020657900 format
R  13007      1  13007  12429      0 0x4a004900 0000060020594508 format
R   9943   9939   9935   9935      0 0x4a004900 0000060020620510 format

> 0000060020657900::thread
            ADDR    STATE  FLG PFLG SFLG   PRI  EPRI PIL             INTR
0000060020657900 inval/2000 22d0 d158    0     0     0   6                1
> 0000060020657900::walk thread | ::findstack
stack pointer for thread 300050ab180: 2a100a54d41
[ 000002a100a54d41 cv_wait+0x38() ]
  000002a100a54df1 spec_lockcsp+0x60()
  000002a100a54ea1 spec_open+0x4a4()
  000002a100a54f61 fop_open+0x78()
  000002a100a55011 vn_openat+0x500()
  000002a100a551d1 copen+0x260()
  000002a100a552e1 syscall_trap32+0xcc()
> 0000060020594508::thread
            ADDR    STATE  FLG PFLG SFLG   PRI  EPRI PIL             INTR
0000060020594508 inval/2000 22d0 d4c8    0     0     0   0                1
> 0000060020594508::walk thread | ::findstack
stack pointer for thread 300261d2340: 2a1022f6d41
[ 000002a1022f6d41 cv_wait+0x38() ]
  000002a1022f6df1 spec_lockcsp+0x60()
  000002a1022f6ea1 spec_open+0x4a4()
  000002a1022f6f61 fop_open+0x78()
  000002a1022f7011 vn_openat+0x500()
  000002a1022f71d1 copen+0x260()
  000002a1022f72e1 syscall_trap32+0xcc()
> 0000060020620510::thread
            ADDR    STATE  FLG PFLG SFLG   PRI  EPRI PIL             INTR
0000060020620510 inval/2000 207b 2168    0     0     0   6                1
> 0000060020620510::walk thread | ::findstack
stack pointer for thread 30008397a80: 2a1004ded41
[ 000002a1004ded41 cv_wait+0x38() ]
  000002a1004dedf1 spec_lockcsp+0x60()
  000002a1004deea1 spec_open+0x4a4()
  000002a1004def61 fop_open+0x78()
  000002a1004df011 vn_openat+0x500()
  000002a1004df1d1 copen+0x260()
  000002a1004df2e1 syscall_trap32+0xcc()

Anyone who knows how to speak Solaris Kernel that understand where this is stucked at (apparently I would say that the 3 processes are stucked at exactly the same place ... without me knowing where it is ...).

Thanks gurus!

#kill -15 3905 than do it below
#kill -9 3905
or
#pkill -9 format

You can try change sort maybe its are dependancy with each other.
#kill -9 13007
#kill -9 3905

Thank you for the ideas getrue but it didn't work.

I might have another problem ....

root@RANDOMSERVERNAME# iostat -en
  ---- errors ---
  s/w h/w trn tot device
    0   0   0   0 md/d10
    0   0   0   0 md/d11
    0   0   0   0 md/d12
    0   0   0   0 md/d20
    0   0   0   0 md/d21
    0   0   0   0 md/d22
    0   0   0   0 md/d30
    0   0   0   0 md/d31
    0   0   0   0 md/d32
    0   0   0   0 c1t0d0
    0   0   0   0 c1t1d0
  901   0   1 902 c0t0d0
    0   0   0   0 c2t5006048AD530B8D6d19
    0   0   0   0 c2t5006048AD530B8D6d18
    0   0   0   0 c2t5006048AD530B8D6d17
    0   0   0   0 c2t5006048AD530B8D6d15
    0   0   0   0 c2t5006048AD530B8D6d14
    0   0   0   0 c2t5006048AD530B8D6d13
    0   0   0   0 c2t5006048AD530B8D6d12
    0   0   0   0 c2t5006048AD530B8D6d11
    0   0   0   0 c2t5006048AD530B8D6d10
    0   0   0   0 c2t5006048AD530B8D6d9
    0   0   0   0 c2t5006048AD530B8D6d8
    0   0   0   0 c2t5006048AD530B8D6d7
    0   0   0   0 c2t5006048AD530B8D6d6
    0   0   0   0 c2t5006048AD530B8D6d5
    0   0   0   0 c3t5006048AD530B8D9d19
    0   0   0   0 c3t5006048AD530B8D9d18
    0   0   0   0 c3t5006048AD530B8D9d17
    0   0   0   0 c3t5006048AD530B8D9d15
    0   0   0   0 c3t5006048AD530B8D9d14
    0   0   0   0 c3t5006048AD530B8D9d13
    0   0   0   0 c3t5006048AD530B8D9d12
    0   0   0   0 c3t5006048AD530B8D9d11
    0   0   0   0 c3t5006048AD530B8D9d10
    0   0   0   0 c3t5006048AD530B8D9d9
    0   0   0   0 c3t5006048AD530B8D9d8
    0   0   0   0 c3t5006048AD530B8D9d7
    0   0   0   0 c3t5006048AD530B8D9d6
    0   0   0   0 c3t5006048AD530B8D9d5
    0   0   0   0 whms3554:/ip_app_profiles_source
    0   0   0   0 mjhnasv01:/mjh/NR/whms3551/sparc-5.10/sfw
    0   0   0   0 mjhnasv01:/mjh/dr002004/whms3690/dwuser

root@RANDOMSERVERNAME# iostat -En c0t0d0
c0t0d0           Soft Errors: 901 Hard Errors: 0 Transport Errors: 1
Vendor: MATSHITA Product: CD-RW  CW-8124   Revision: DZ13 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 901 Predictive Failure Analysis: 0

Seems like a CD-ROM error ... and an HBA problem (?!?)

Aug  9 11:32:53 RANDOMSERVERNAME uata: [ID 859416 kern.info] ghd_timer_newstate: HBA reset failed hba 0x60010aa6800 gcmdp 0x60011d83000 gtgtp 0x300015086c0
Aug  9 11:38:53 RANDOMSERVERNAME last message repeated 6 times
Aug  9 11:39:53 RANDOMSERVERNAME uata: [ID 859416 kern.info] ghd_timer_newstate: HBA reset failed hba 0x60010aa6800 gcmdp 0x60011d83000 gtgtp 0x300015086c0
Aug  9 11:42:53 RANDOMSERVERNAME last message repeated 3 times

Interesting to see that dmesg was not showing anything wrong with the cdrom and that iostat is not showing any errors on the LUNs...

Will keep you posted