T2000 Sparc server fails boot

I have a T2000 enterprise SPARC server that's no longer on contract with Oracle. It's on old firmware (6.3.x). After a power-down this weekend, it won't boot normally. Boot snapshot at the bottom of the post.

It can boot to cdrom, and it'll boot to failsafe mode, but it won't do a regular boot, nor will it boot to single user mode. It's ZFS.

Nothing has changed, BUT it would appear that the onboard battery had failed, and the time reverted to 1999. ALOM works. I set the date using ALOM and rebooted, and it fails in the same way.

I thought perhaps CPU1 failed, but disabling the CPU in ALOM and rebooting just moved the problem to CPU2.

Loading: /platform/SUNW,SPARC-Enterprise-T2000/kernel/sparcv9/unix
Loading: /platform/sun4v/kernel/sparcv9/unix
SunOS Release 5.10 Version Generic_141414-02 64-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
os-io panic: failed to stop cpu1

panic[cpu0]/thread=180e000: send_one_mondo: unexpected hypervisor error 0x2 while sending a mondo to cpuid: 0x1

000000000180b460 unix:send_one_mondo+14c (1, 10aeb58, 0, 2, 180c5e8, 1)
  %l0-3: 000000000187c000 0000000001866e18 0000000000000000 0000001d5e36ff20
  %l4-7: 000000000187c2c0 000000000187f980 0000000000000003 0000001d16b07320
000000000180b510 unix:xt_one_unchecked+c8 (1, 100ff74, 70020000, 0, 0, 1)
  %l0-3: 000000000000000b 0000000001866e18 0000000000000000 000000000180b5c0
  %l4-7: 0000000000000000 0000000000000002 0000000000000001 000000000180b5e0
000000000180b5e0 unix:setbackdq+3f0 (2a10121fca0, ffffffffffffffff, 300043cc000, 0, 1, 6d)
  %l0-3: 0000060038f02b10 0000000000000000 0000060039a6d4b8 0000000000000001
  %l4-7: 0000000000000000 0000060039a6ca80 0000000000000002 0000000000000a38
000000000180b690 unix:cpu_pause_start+9c (0, 185e800, 185f400, 1, 1847958, 1)
  %l0-3: 0000000000000001 0000000000000002 0000000001847859 000000000185f768
  %l4-7: 000000000000001b 000002a10121fca0 000000000000006d 000000000000006e
000000000180b740 unix:pause_cpus+6c (0, 1, 5, 1847858, 182ac00, 1847800)
  %l0-3: 0000000000000000 0000000001861000 000000000180c000 000000000186b000
  %l4-7: 0000000000000001 000000000185e800 0000000001064000 ffffffffffffffff
000000000180b7f0 unix:cpu_add_unit+28 (300043d0000, 1826400, a, 187bf60, 5f50, 187bc00)
  %l0-3: 0000000001831c00 0000000001861000 00000000010c9800 000000000186b000
  %l4-7: 0000000000000001 000000000185e800 0000000001064000 ffffffffffffffff
000000000180b8a0 unix:setup_cpu_common+14c (4, 1000, 0, 300043d0000, 1b, 180c000)
  %l0-3: 0000000001831c00 000000000189ec00 00000000010c9800 0000060038f02b70
  %l4-7: 0000000000000001 000000000185e800 0000000001064000 ffffffffffffffff
000000000180b960 unix:start_other_cpus+19c (190c400, 1, 0, 18631e0, 185f770, 186b3e8)
  %l0-3: 0000000000000002 0000000000000002 00000000010ac400 0000000000000000
  %l4-7: 000000000190c400 0000000000000003 000000000101b000 00000000018fe800
000000000180ba10 genunix:main+1e4 (18fe840, 18fa400, 185eb40, 18acc00, 0, 1906c00)
  %l0-3: 0000000000000000 0000000000000001 0000000001906c00 0000000000000002
  %l4-7: 0000000001907aa0 0000000001907800 00000000018fe850 00000000018fe800

syncing file systems... done
skipping system dump - no dump device configured
rebooting...

What is your ALOM version? Did you apply patches to the Solaris installation?

Boot verbose - both CD and normal. What's different?

@dukenuke2
no patches - because I'm not under oracle contract, can't download them. I would love to get my hands on SysFW 6.7.13 and 139434-10.

alom version:

sc> showsc version -v
Advanced Lights Out Manager CMT v1.3.8
SC Firmware version: CMT 1.3.8
SC Bootmon version: CMT 1.3.8

VBSC 1.3.5
VBSC firmware built Apr  6 2008, 15:09:33

SC Bootmon Build Release: 01
SC bootmon checksum: 13AA267E
SC Bootmon built Apr  6 2008, 15:17:23

SC Build Release: 01
SC firmware checksum: 12914608

SC firmware built Apr  6 2008, 15:17:37
SC firmware flashupdate FRI MAY 22 23:55:22 2009

SC System Memory Size: 32 MB
SC NVRAM Version = 12
SC hardware type: 4

FPGA Version: 4.2.4.7

---------- Post updated at 08:43 AM ---------- Previous update was at 08:08 AM ----------

@achnele
The post differs with a panic after CPU2, but it doesn't seem to help (I disabled CPU1 because it failed here last time). If I disable CPU2, the panic moves to CPU3.

In case it matters, I'm still seeing this - as I haven't replaced the battery yet. That's next. The date has been set manually through ALOM.

SC Alert: BATTERY at SC/BAT/V_BAT has exceeded low warning threshold.

From CDROM with -v -s

PCI-device: usb@6, ohci1
ohci1 is /pci@7c0/pci@0/pci@1/pci@0/usb@6
cpu0: UltraSPARC-T1 (cpuid 0 clock 1200 MHz)
cpu2: UltraSPARC-T1 (cpuid 2 clock 1200 MHz)
cpu3: UltraSPARC-T1 (cpuid 3 clock 1200 MHz)
cpu4: UltraSPARC-T1 (cpuid 4 clock 1200 MHz)
cpu5: UltraSPARC-T1 (cpuid 5 clock 1200 MHz)
PCI-device: pci@8, pxb_plx8
pxb_plx8 is /pci@7c0/pci@0/pci@8
cpu6: UltraSPARC-T1 (cpuid 6 clock 1200 MHz)
USB 1.10 device (usb3eb,3301) operating at full speed (USB 1.x) on USB 1.10 root hub: hub@1, hubd1 
at bus address 2
hubd1 is /pci@7c0/pci@0/pci@1/pci@0/usb@6/hub@1
/pci@7c0/pci@0/pci@1/pci@0/usb@6/hub@1 (hubd1) online
cpu7: UltraSPARC-T1 (cpuid 7 clock 1200 MHz)
cpu8: UltraSPARC-T1 (cpuid 8 clock 1200 MHz)
cpu9: UltraSPARC-T1 (cpuid 9 clock 1200 MHz)
cpu10: UltraSPARC-T1 (cpuid 10 clock 1200 MHz)
cpu11: UltraSPARC-T1 (cpuid 11 clock 1200 MHz)
cpu12: UltraSPARC-T1 (cpuid 12 clock 1200 MHz)
cpu13: UltraSPARC-T1 (cpuid 13 clock 1200 MHz)
cpu14: UltraSPARC-T1 (cpuid 14 clock 1200 MHz)
cpu15: UltraSPARC-T1 (cpuid 15 clock 1200 MHz)
cpu16: UltraSPARC-T1 (cpuid 16 clock 1200 MHz)
cpu17: UltraSPARC-T1 (cpuid 17 clock 1200 MHz)
cpu18: UltraSPARC-T1 (cpuid 18 clock 1200 MHz)
cpu19: UltraSPARC-T1 (cpuid 19 clock 1200 MHz)
cpu20: UltraSPARC-T1 (cpuid 20 clock 1200 MHz)
cpu21: UltraSPARC-T1 (cpuid 21 clock 1200 MHz)
cpu22: UltraSPARC-T1 (cpuid 22 clock 1200 MHz)
cpu23: UltraSPARC-T1 (cpuid 23 clock 1200 MHz)
cpu24: UltraSPARC-T1 (cpuid 24 clock 1200 MHz)
cpu25: UltraSPARC-T1 (cpuid 25 clock 1200 MHz)
cpu26: UltraSPARC-T1 (cpuid 26 clock 1200 MHz)
cpu27: UltraSPARC-T1 (cpuid 27 clock 1200 MHz)
PCI-device: SUNW,qlc@0, qlc0
qlc0 is /pci@7c0/pci@0/pci@8/SUNW,qlc@0
cpu28: UltraSPARC-T1 (cpuid 28 clock 1200 MHz)
cpu29: UltraSPARC-T1 (cpuid 29 clock 1200 MHz)
PCI-device: pci@9, pxb_plx9
pxb_plx9 is /pci@7c0/pci@0/pci@9
cpu30: UltraSPARC-T1 (cpuid 30 clock 1200 MHz)
cpu31: UltraSPARC-T1 (cpuid 31 clock 1200 MHz)
Booting to milestone "milestone/single-user:default".

from local (with -v -s)

PCI-device: usb@5, ohci0
ohci0 is /pci@7c0/pci@0/pci@1/pci@0/usb@5
PCI-device: usb@6, ohci1
ohci1 is /pci@7c0/pci@0/pci@1/pci@0/usb@6
cpu0: UltraSPARC-T1 (chipid 0, clock 1200 MHz)
cpu2: UltraSPARC-T1 (chipid 0, clock 1200 MHz)
panic: failed to stop cpu2

panic[cpu0]/thread=180e000: send_one_mondo: unexpected hypervisor error 0x2 while sending a mondo to cpuid: 0x2

000000000180b460 unix:send_one_mondo+14c (2, 10aeb58, 0, 2, 180c5e8, 1)
  %l0-3: 000000000187c000 0000000001866e18 0000000000000000 0000002316b37abc
  %l4-7: 000000000187c2c0 000000000187f980 0000000000000003 00000022cf2ceebc
000000000180b510 unix:xt_one_unchecked+c8 (2, 100ff74, 70020000, 0, 0, 1)
  %l0-3: 000000000000000b 0000000001866e18 0000000000000000 000000000180b5c0
  %l4-7: 0000000000000000 0000000000000004 0000000000000002 000000000180b5e0
000000000180b5e0 unix:setbackdq+3f0 (2a10121fca0, ffffffffffffffff, 30004504000, 0, 1, 6d)
  %l0-3: 0000060039444b10 0000000000000000 0000060039bc34b8 0000000000000001
  %l4-7: 0000000000000000 0000060039bc2a80 0000000000000002 0000000000000a38
000000000180b690 unix:cpu_pause_start+9c (0, 185e800, 185f400, 1, 1847958, 1)
  %l0-3: 0000000000000002 0000000000000002 000000000184785a 000000000185f770
  %l4-7: 000000000000001b 000002a10121fca0 000000000000006d 000000000000006e
000000000180b740 unix:pause_cpus+6c (0, 1, 5, 1847858, 182ac00, 1847800)
  %l0-3: 0000000000000000 0000000001861000 000000000180c000 000000000186b000
  %l4-7: 0000000000000001 000000000185e800 0000000001064000 ffffffffffffffff
000000000180b7f0 unix:cpu_add_unit+28 (30004508000, 1826400, a, 187bf60, 5f50, 187bc00)
  %l0-3: 0000000001831c00 0000000001861000 00000000010c9800 000000000186b000
  %l4-7: 0000000000000001 000000000185e800 0000000001064000 ffffffffffffffff
000000000180b8a0 unix:setup_cpu_common+14c (4, 1000, 0, 30004508000, 1b, 180c000)
  %l0-3: 0000000001831c00 000000000189ec00 00000000010c9800 0000060039444b70
  %l4-7: 0000000000000001 000000000185e800 0000000001064000 ffffffffffffffff
000000000180b960 unix:start_other_cpus+19c (190c400, 1, 0, 18631e0, 185f778, 186b490)
  %l0-3: 0000000000000003 0000000000000003 00000000010ac400 0000000000000000
  %l4-7: 000000000190c400 0000000000000004 000000000101b000 00000000018fe800
000000000180ba10 genunix:main+1e4 (18fe840, 18fa400, 185eb40, 18acc00, 0, 1906c00)
  %l0-3: 0000000000000000 0000000000000001 0000000001906c00 0000000000000002
  %l4-7: 0000000001907aa0 0000000001907800 00000000018fe850 00000000018fe800

syncing file systems... done

There is something wrong with your hypervisor firmware.
Maybe it is too old, not supported by the current Solaris?
Try to update all firmware:
Firmware Downloads and Release History for Sun Systems

@madeingermany
Firmware updated.

sc> showhost
SPARC-Enterprise-T2000 System Firmware 6.7.12  2011/07/06 20:03
Host flash versions:
   OBP 4.30.4.d 2011/07/06 14:29
   Hypervisor 1.7.3.c 2010/07/09 15:14
   POST 4.30.4.b 2010/07/09 14:24

still no difference.

See also Known Issues - Oracle VM Server for SPARC 2.0 Release Notes
Maybe you lost your ldm settings during the power-cycle?
See the chapter "Logical Domains Variable Persistence"

likely yes... the battery failed, the time reverted to 1999, and I'm still getting:

SC Alert: BATTERY at SC/BAT/V_BAT has exceeded low warning threshold.

I'm sure I lost everything, including LDM, if it's held by the battery.

I'm waiting on a battery - I should have one in hand today. In the meantime, any idea what other settings may be lost?

BTW, it will boot to failsafe mode, just not single user.

If you don't have a valid contract you might not be able to get a newer firmware...

:confused:Update:
-was able to update firmware, including hypervisor - no change. Still fails

SunOS Release 5.10 Version Generic_141414-02 64-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
panic: failed to stop cpu1
panic[cpu0]/thread=180e000: send_one_mondo: unexpected hypervisor error 0x2 while sending a mondo to cpuid: 0x1
000000000180b460 unix:send_one_mondo+14c (1, 10aeb58, 0, 2, 180c5e8, 1)
  %l0-3: 000000000187c000 0000000001866e18 0000000000000000 000000221aa114c0
  %l4-7: 000000000187c2c0 000000000187f980 0000000000000003 00000021d31a88c0
000000000180b510 unix:xt_one_unchecked+c8 (1, 100ff74, 70020000, 0, 0, 1)

it CAN boot
-from CDrom, with or without single user
-via failsafe, and can mount the ZFS bootvolume

It cannot boot:
-to single user for the local drive
-regularly for the local drive