Booting from a "broken" MirrorDisk UX device

I'm setting up mirroring on five different HP servers (2 Itanium and 3 PA-RISC) this week. The PA-RISC boxes are running HP-UX 11.11 with the latest patches and the Itaniums are on HP-UX 11.23 with the latest patch clusters. This is my first foray into MirrorDisk UX.

I found a handy set of examples as to what to do here and here. After following along with the steps in the script at the first link (I chose to run the commands manually to learn more) it appears that I have successfully installed MirrorDisk and mirrored my root VG (vg00).

Output from lvdisplay for one of the lvols:

[03:08 PM][root@Raver:/]$lvdisplay /dev/vg00/lvol2
--- Logical volumes ---
LV Name                     /dev/vg00/lvol2
VG Name                     /dev/vg00
LV Permission               read/write   
LV Status                   available/syncd           
Mirror copies               1            
Consistency Recovery        MWC                 
Schedule                    parallel     
LV Size (Mbytes)            4096            
Current LE                  512       
Allocated PE                1024        
Stripes                     0       
Stripe Size (Kbytes)        0                   
Bad block                   off          
Allocation                  strict/contiguous         
IO Timeout (Seconds)        default             

And here is output from vgdisplay indicating that all lvols are using 2 PVs:

[03:08 PM][root@Raver:/]$vgdisplay -v vg00
--- Volume groups ---
VG Name                     /dev/vg00
VG Write Access             read/write     
VG Status                   available                 
Max LV                      255    
Cur LV                      8      
Open LV                     8      
Max PV                      16     
Cur PV                      2      
Act PV                      2      
Max PE per PV               4350         
VGDA                        4   
PE Size (Mbytes)            8               
Total PE                    8680    
Alloc PE                    5500    
Free PE                     3180    
Total PVG                   0        
Total Spare PVs             0              
Total Spare PVs in use      0                     

   --- Logical volumes ---
   LV Name                     /dev/vg00/lvol1
   LV Status                   available/syncd           
   LV Size (Mbytes)            304             
   Current LE                  38        
   Allocated PE                76          
   Used PV                     2       

   LV Name                     /dev/vg00/lvol2
   LV Status                   available/syncd           
   LV Size (Mbytes)            4096            
   Current LE                  512       
   Allocated PE                1024        
   Used PV                     2       

   LV Name                     /dev/vg00/lvol3
   LV Status                   available/syncd           
   LV Size (Mbytes)            200             
   Current LE                  25        
   Allocated PE                50          
   Used PV                     2       

   LV Name                     /dev/vg00/lvol4
   LV Status                   available/syncd           
   LV Size (Mbytes)            1024            
   Current LE                  128       
   Allocated PE                256         
   Used PV                     2       

   LV Name                     /dev/vg00/lvol5
   LV Status                   available/syncd           
   LV Size (Mbytes)            4096            
   Current LE                  512       
   Allocated PE                1024        
   Used PV                     2       

   LV Name                     /dev/vg00/lvol6
   LV Status                   available/syncd           
   LV Size (Mbytes)            3584            
   Current LE                  448       
   Allocated PE                896         
   Used PV                     2       

   LV Name                     /dev/vg00/lvol7
   LV Status                   available/syncd           
   LV Size (Mbytes)            4096            
   Current LE                  512       
   Allocated PE                1024        
   Used PV                     2       

   LV Name                     /dev/vg00/lvol8
   LV Status                   available/syncd           
   LV Size (Mbytes)            4600            
   Current LE                  575       
   Allocated PE                1150        
   Used PV                     2       


   --- Physical volumes ---
   PV Name                     /dev/dsk/c2t0d0
   PV Status                   available                
   Total PE                    4340    
   Free PE                     1590    
   Autoswitch                  On        

   PV Name                     /dev/dsk/c3t2d0
   PV Status                   available                
   Total PE                    4340    
   Free PE                     1590    
   Autoswitch                  On

I rebooted the system and set the PRI and ALT booth paths as instructed at the end of the linked script above. PRI points to the secondary mirror device and ALT points to the primary mirror device (the original system disk). The system boots fine if pointed to PRI and both disks are in the system. If I pull out the original system disk and point at PRI, it still boots fine but complains about not having a dump area (Which I think I can fix, but the link above that points to the HP ITRC forums has one person suggesting that you DON'T mirror dump. Anyone know why?). However, if I pull out the secondary disk (the newly created mirror) and try to boot off of the ALT path which points to the original system disk, the boot is attempted but after some really nasty messages and a failure to activate the root file system a system crash occurs. I think there must be some preparation I must make that I've skipped or maybe I need to change something about the boot string when booting from the original system disk?

So I know I have mirroring but it seems pointless if I can't boot off of each mirror independently of the other (simulating a disk failure). Or am I misunderstanding something about how mirroring is supposed to work in HP-UX?

The crash message:

Booting... 
Boot IO Dependent Code (IODC) revision 1


HARD Booted.

ISL Revision A.00.43  Apr 12, 2000 

ISL booting  hpux

Boot
: disk(0/1/1/0.0.0.0.0.0.0;0)/stand/vmunix
11370496 + 2097152 + 3603336 start 0x200f68




alloc_pdc_pages: Relocating PDC from 0xfffffff0f0c00000 to 0x3f901000.
gate64: sysvec_vaddr = 0xc0002000 for 2 pages
NOTICE: autofs_link(): File system was registered at index 3.
NOTICE: cachefs_link(): File system was registered at index 5.
NOTICE: nfs3_link(): File system was registered at index 6.
td: claimed Tachyon XL2 Fibre Channel Mass Storage card at 0/3/1/0
td: claimed Tachyon XL2 Fibre Channel Mass Storage card at 0/4/1/0
igelan0: INITIALIZING HP PCI 1000Base-T Core at hardware path 0/1/2/0
igelan1: INITIALIZING HP A6825-60101 PCI 1000Base-T Adapter at hardware path 0/2/1/0

    System Console is on the Built-In Serial Interface
Logical volume 64, 0x3 configured as ROOT
LVM: VG 64 0x000000: Quorum check failed!
LVM : Activation of root volume group failed
Quorum not present, or some physical volume(s) are missing


        ------------------------------------


Firmware Version  45.11

Duplex Console IO Dependent Code (IODC) revision 1



Stored message buffer up to panic:

[snipped additional SAN device info]

igelan0: INITIALIZING HP PCI 1000Base-T Core at hardware path 0/1/2/0
igelan1: INITIALIZING HP A6825-60101 PCI 1000Base-T Adapter at hardware path 0/2/1/0

    System Console is on the Built-In Serial Interface
Logical volume 64, 0x3 configured as ROOT
LVM: VG 64 0x000000: Quorum check failed!
LVM : Activation of root volume group failed
Quorum not present, or some physical volume(s) are missing


        -----------------------------------------------------
        |                                                   |
        |       SYSTEM HALTING during LVM Configuration     |
        |                                                   |
                Could not configure root VG
            If new kernel, ISL> hpux -lm /stand/vmunix; pvck -y /dev/rdsk/....
        |                                                   |
        -----------------------------------------------------


linkstamp:          Wed Jan 14 12:45:53 EST 2009
_release_version:   @(#)                 $Revision: vmunix:    vw: -proj
selectors: CUPI80_BL2000_1108 -c 'Vw for CUPI80_BL2000_1108 build' -- 
cupi80_bl2000_1108 'CUPI80_BL2000_1108'  Wed Nov  8 19:24:56 PST 2000 $
panic: LVM: Configuration failure

PC-Offset Stack Trace (read across, top of stack is 1st):
  0x00213858  0x00538d84  0x0053c2f8
  0x00538c3c  0x0035c1c8  0x004774fc
  0x0035deb0  0x0017a02c  0x00201564
End Of Stack

Trap Type 15 (Data page fault):
  Instruction Address (pcsq.pcoq) = 0x0.0x47b8b8 
  Instruction (iir) = 0x531a0020 (load/store)
  Target Address (isr.ior) = 0x0.0x0000000000000010
  Base Register (gr24) = 0x0000000000000000
  Savestate Ptr (ssp) = 0xede8000.0x400003ffffff0fa8
  Savestate Return Pointer (ss_rp) = 0x0000000000000000 

System Panic:

linkstamp:          Wed Jan 14 12:45:53 EST 2009

Wed Nov  8 19:24:56 PST 2000 $
panic: Data page fault



*** A system crash has occurred.  (See the above messages for details.)
*** The system is now preparing to dump physical memory to disk, for use
*** in debugging the crash.

ERROR:  Your system crashed before I/O and dump configuration was complete.
        A crash dump cannot be taken under these circumstances without
        special configuration.  Contact your HP support representative.

So, what am I doing wrong? (I assume I must be doing something wrong...)

To have a system that can reboot automatically after the failure of either side of the mirror, you will need to add a third disk to the root volume group. Then after one disk fails, it will still have quorum because most of the disks in volume group are available.

A more dangerous approach is to override quorum at boot time. There is a switch to do this documented on the hpux(1m) man page. What I did, was to not use the switch in the autoboot file. Then if a disk is missing the box won't boot up all the way as you saw. I would notice this and know that I had a bad root disk. Then I would manually boot adding in that switch to override quorum (-lq I think?). Then I would get a new disk and rebuild mirrors.

If you boot all the way up with one disk missing, logs files will change and your mirror will no longer be valid.

Dumps are written when the OS is screwed up and do not use OS I/O routines. The I/O routines used cannot handle a mirror.

The only danger of -lq (no quorum) is that if nobody notices you have a failed disk you may end by having no disks at all when the last one gives up.. (Dont laugh it happened to me...and not once...).

Now testing is more delicate you method as you have noticed may give issues...
I usually test if the disk is bootable by choosing at ISL which disk I boot from, once booted from both disks and all is fine, I trust the box and it worked fine so far, playing at remove/slipping back a disk can end with a system crash as you have not said on what model you are (true when disks share the same scsi interface [because of SCSI resets...])