Metastat shows "maintenance" and "last-erred"

TheSteed · October 20, 2009, 9:34am

Hi All,

Sorry to post a problem for my first post but I'm in a bit of a pickle at the minute!
I have an Ultra45 connected to a Storedge 3100 series, 2 internal, 2 external disks with a db application running on the external disks.
Now everything is working fine and we've had no downtime or anything but I am due to perform an upgrade of our software and I ran metastat just to check out the mirroring, below is what i got.

# metastat
d7: Mirror
    Submirror 0: d37
      State: Okay         
    Submirror 1: d47
      State: Needs maintenance 
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 286576512 blocks (136 GB)

d37: Submirror of d7
    State: Okay         
    Size: 286576512 blocks (136 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c2t8d0s7          0     No            Okay   Yes 


d47: Submirror of d7
    State: Needs maintenance 
    Invoke: metareplace d7 c2t9d0s7 <new device>
    Size: 286576512 blocks (136 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c2t9d0s7          0     No     Maintenance   Yes 


d4: Mirror
    Submirror 0: d14
      State: Needs maintenance 
    Submirror 1: d24
      State: Needs maintenance 
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 235557840 blocks (112 GB)

d14: Submirror of d4
    State: Needs maintenance 
    Invoke: metareplace d4 c1t0d0s4 <new device>
    Size: 235557840 blocks (112 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c1t0d0s4          0     No     Maintenance   Yes 


d24: Submirror of d4
    State: Needs maintenance 
    Invoke: after replacing "Maintenance" components:
                metareplace d4 c1t1d0s4 <new device>
    Size: 235557840 blocks (112 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c1t1d0s4          0     No      Last Erred   Yes 


d1: Mirror
    Submirror 0: d11
      State: Needs maintenance 
    Submirror 1: d21
      State: Okay         
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 16382880 blocks (7.8 GB)

d11: Submirror of d1
    State: Needs maintenance 
    Invoke: metareplace d1 c1t0d0s1 <new device>
    Size: 16382880 blocks (7.8 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c1t0d0s1          0     No     Maintenance   Yes 


d21: Submirror of d1
    State: Okay         
    Size: 16382880 blocks (7.8 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c1t1d0s1          0     No            Okay   Yes 


d0: Mirror
    Submirror 0: d10
      State: Needs maintenance 
    Submirror 1: d20
      State: Needs maintenance 
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 235557840 blocks (112 GB)

d10: Submirror of d0
    State: Needs maintenance 
    Invoke: after replacing "Maintenance" components:
                metareplace d0 c1t0d0s0 <new device>
    Size: 235557840 blocks (112 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c1t0d0s0          0     No      Last Erred   Yes 


d20: Submirror of d0
    State: Needs maintenance 
    Invoke: metareplace d0 c1t1d0s0 <new device>
    Size: 235557840 blocks (112 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c1t1d0s0          0     No     Maintenance   Yes 


Device Relocation Information:
Device   Reloc  Device ID
c2t9d0   Yes    id1,sd@SSEAGATE_ST314670LSUN146G2144JJD7____________3KS4JJD7
c2t8d0   Yes    id1,sd@SSEAGATE_ST314670LSUN146G2144KKMM____________3KS4KKMM
c1t1d0   Yes    id1,sd@n5000cca20be55a84
c1t0d0   Yes    id1,sd@n5000cca20bda0fae

I've tried using metareplace -e dy cxtxdxsx on the submirrors that in are Maintenance state and they are resyncing right now...prob gonna take some time!
My question is with regards to the "last erred" submirrors...are these gonners? I've checked /var/adm/messages and can see no read/write errors and iostat -En shows no hard errors, only 1 soft error each on the internal disks.
I ran fsck on the internal disks and
So should I be looking at replacing the disks or if the ones i have put resyncing right now come back ok can I do a metareplace on the "Last erred" submirrors?
I have an alternative boot point also that I can boot to, the mirror on the second internal disk...so if i boot to that and try to sync will it use the second submirror as the "master" when syncing?

I came across this link and sort of followed the advice there...
Solaris: SDS: Both Metadevices of a mirror have �State: Needs maintenance� | The Solaris Cookbook

System_Shock · October 20, 2009, 11:15am

Before you go and do anything drastic:
Is this Solaris 10 and have you rebooted the box lately?

If it is Solaris 10 and you rebooted the box lately, make sure the mdmonitor service came back online. If mdmonitor is not running, metastat will report mirrors as needing maintenance when in reality they are ok.

TheSteed · October 20, 2009, 11:26am

Hi thanks for the reply!

Yes and yes; it is Solaris 10 and I've rebooted today...

mdmonitor is online as well!

sbk1972 · October 21, 2009, 4:10am

I have a similar problem with a solaris 8 server using 5200 disk array. Every reboot I will get a number of MD devices that will need maintance / last errored. Due to the how the metadevices are set up, i.e. one of my devices has 7 sub-devices, its quite worrying.

I tend to just metareplace -e the failed device with the one that last error'ed e.g.

# metareplace -e d10  c1d2s0    :;-  uses last err'ed device as a device to  sync the main device with

d10 = d11 and d12
d11 = c1t1d1s0 - main
d12 = c1td2s0 - failed device

Its when you have two sub devices that have error'ed / need maintenance ( failed during a re-sync) that you need to be careful.

You may find its not the disks that are iffy, but the enclosure.
SBK

TheSteed · October 21, 2009, 7:52am

Well after I ran metareplace with the submirrors that were in maintenance they came up okay again, and the ones that had been "last erred" changed to maintenance, so i ran metareplace -e again on these and they have resynced fine this time.

Seems to have solved the problem!