Replacing a failed disk using SVM

fretagi · March 8, 2017, 6:58am

Hi

Please can you help me on replacing or removing a faulty disk drive on a SUN NETRA X4250 server with 4 internal drives only.

the format comand show me the following:

format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0t0d0 <drive type unknown>
          /pci@0,0/pci8086,25e2@2/pci8086,3500@0/pci8086,3510@0/pci1000,3150@0/sd@0,0
       1. c0t1d0 <DEFAULT cyl 36469 alt 2 hd 255 sec 63>
          /pci@0,0/pci8086,25e2@2/pci8086,3500@0/pci8086,3510@0/pci1000,3150@0/sd@1,0
       2. c0t2d0 <DEFAULT cyl 36469 alt 2 hd 255 sec 63>
          /pci@0,0/pci8086,25e2@2/pci8086,3500@0/pci8086,3510@0/pci1000,3150@0/sd@2,0
       3. c0t3d0 <DEFAULT cyl 36469 alt 2 hd 255 sec 63>
          /pci@0,0/pci8086,25e2@2/pci8086,3500@0/pci8086,3510@0/pci1000,3150@0/sd@3,0
Specify disk (enter its number): ^D
root@mhcominf01:/#

so I beleive drive 0. c0t0d0 <drive type unknown> is faulty.
Also in tail /var/adm/messages shows:

Mar  8 10:19:46 mhcominf01 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,25e2@2/pci8086,3500@0/pci8086,3510@0/pci1000,3150@0/sd@0,0 (sd1):
Mar  8 10:19:46 mhcominf01      drive offline
Mar  8 10:23:47 mhcominf01 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,25e2@2/pci8086,3500@0/pci8086,3510@0/pci1000,3150@0/sd@0,0 (sd1):
Mar  8 10:23:47 mhcominf01      drive offline
Mar  8 10:23:57 mhcominf01 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,25e2@2/pci8086,3500@0/pci8086,3510@0/pci1000,3150@0/sd@0,0 (sd1):
Mar  8 10:23:57 mhcominf01      drive offline
Mar  8 13:45:31 mhcominf01 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,25e2@2/pci8086,3500@0/pci8086,3510@0/pci1000,3150@0/sd@0,0 (sd1):
Mar  8 13:45:31 mhcominf01      drive offline
Mar  8 13:45:41 mhcominf01 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,25e2@2/pci8086,3500@0/pci8086,3510@0/pci1000,3150@0/sd@0,0 (sd1):
Mar  8 13:45:41 mhcominf01      drive offline

this confims that the disk is faulty.
metastat shows:

 metastat
d30: Mirror
    Submirror 0: d32
      State: Needs maintenance
    Submirror 1: d31
      State: Okay
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 459940950 blocks (219 GB)

d32: Submirror of d30
    State: Needs maintenance
    Invoke: metareplace d30 c0t2d0s3 <new device>
    Size: 459940950 blocks (219 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t2d0s3          0     No     Maintenance   Yes


d31: Submirror of d30
    State: Okay
    Size: 459940950 blocks (219 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t1d0s3          0     No            Okay   Yes


d20: Mirror
    Submirror 0: d22
      State: Needs maintenance
    Submirror 1: d21
      State: Okay
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 20980890 blocks (10 GB)

d22: Submirror of d20
    State: Needs maintenance
    Invoke: metareplace d20 c0t2d0s1 <new device>
    Size: 20980890 blocks (10 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t2d0s1          0     No     Maintenance   Yes


d21: Submirror of d20
    State: Okay
    Size: 20980890 blocks (10 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t1d0s1          0     No            Okay   Yes


d10: Mirror
    Submirror 0: d12
      State: Needs maintenance
    Submirror 1: d11
      State: Needs maintenance
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 41945715 blocks (20 GB)

d12: Submirror of d10
    State: Needs maintenance
    Invoke: after replacing "Maintenance" components:
                metareplace d10 c0t2d0s0 <new device>
    Size: 41945715 blocks (20 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t2d0s0          0     No      Last Erred   Yes


d11: Submirror of d10
    State: Needs maintenance
    Invoke: metasync d10
    Size: 41945715 blocks (20 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t1d0s0          0     No       Resyncing   Yes


Device Relocation Information:
Device   Reloc  Device ID
c0t2d0   Yes    id1,sd@n5000c5003a470243
c0t1d0   Yes    id1,sd@n5000c500741f132b
root@mhcominf01:/#

from the following command:

 metastat -c
d30              m  219GB d32 (maint) d31
    d32          s  219GB c0t2d0s3 (maint)
    d31          s  219GB c0t1d0s3
d20              m   10GB d22 (maint) d21
    d22          s   10GB c0t2d0s1 (maint)
    d21          s   10GB c0t1d0s1
d10              m   20GB d12 (maint) d11 (maint)
    d12          s   20GB c0t2d0s0 (last-erred)
    d11          s   20GB c0t1d0s0 (resyncing

I beleive now I have to use the following:

metadetach d30 d32

metadetach d20 d22

From now on I am not quite what the following steps.
Please can you help

MadeInGermany · March 8, 2017, 2:45pm

The "last Erred" means the d10 mirror is already degraded, and can have all types of data corruption.
I have no experience how to go from there. Hope you have a good data backup.

dn888 · March 11, 2017, 5:45am

Can you show the metadb command?

fretagi · March 13, 2017, 1:19am

sorry for the late reply:

 metadb
        flags           first blk       block count
     a m  p  luo        16              8192            /dev/dsk/c0t1d0s5
     a    p  luo        8208            8192            /dev/dsk/c0t1d0s5
     a    p  luo        16400           8192            /dev/dsk/c0t1d0s5
     a    p  luo        16              8192            /dev/dsk/c0t2d0s5
     a    p  luo        8208            8192            /dev/dsk/c0t2d0s5
     a    p  luo        16400           8192            /dev/dsk/c0t2d0s5
root@mhcominf01:/#

MadeInGermany · March 13, 2017, 8:49am

Do you have the bad disk still in?
Do you have a new disk already?
This link seems to be okay for Solaris 10.
Read it twice (at least)!

fretagi · March 13, 2017, 9:07am

Yes, I still have a bad disk, but at the moment I dont have a new disk. I am stuck on the line that says that I have to remove any metadb on the failed disk!
How to identify the metadb on the failed disk. the failed disk is c0t0d0 but the metadb command shows the following:

 metadb
        flags           first blk       block count
     a m  p  luo        16              8192            /dev/dsk/c0t1d0s5
     a    p  luo        8208            8192            /dev/dsk/c0t1d0s5
     a    p  luo        16400           8192            /dev/dsk/c0t1d0s5
     a    p  luo        16              8192            /dev/dsk/c0t2d0s5
     a    p  luo        8208            8192            /dev/dsk/c0t2d0s5
     a    p  luo        16400           8192            /dev/dsk/c0t2d0s5
root@mhcominf01:/#

this does not show c0t0d0
Please help

dn888 · March 14, 2017, 11:29am

Your c0t0d0 is not part of the SVM setup so therefore you don't need to remove it from SVM. But it may be part of dump device or swap, do dumpadm or swap -l

For your SVM devices:

metareplace -e d30 c0t2d0s3
metareplace -e d20 c0t2d0s1
metareplace -e d10 c0t2d0s0
metasync d10
metastat -c

fretagi · March 15, 2017, 1:12am

for dumpadm :

root@mhcominf01:/# dumpadm
      Dump content: kernel pages
       Dump device: /dev/md/dsk/d30 (dedicated)
Savecore directory: /var/crash/mhcominf01
  Savecore enabled: yes
   Save compressed: on

for swap -l

root@mhcominf01:/# swap -l
swapfile             dev  swaplo blocks   free
/dev/md/dsk/d20     85,20      8 20980880 20980880
root@mhcominf01:/#

when running metastat , I come across the following:

d32: Submirror of d30
    State: Needs maintenance
    Invoke: metareplace d30 c0t2d0s3 <new device>
    Size: 459940950 blocks (219 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t2d0s3          0     No     Maintenance   Yes

,
and:

d22: Submirror of d20
    State: Needs maintenance
    Invoke: metareplace d20 c0t2d0s1 <new device>
    Size: 20980890 blocks (10 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t2d0s1          0     No     Maintenance   Yes

these are two different stattements. What do they mean, what is the impact on the system?
Please explain

MadeInGermany · March 15, 2017, 5:33pm

Ok, it seems that c0t0d0 disk is not in use.
One check is missing:

zpool status

should not list the c0t0d0.
The next challenge is to find the physical disk that is c0t0d0.
The d30 and d20 mirrors are degraded, because the submirrors d32 and d22 failed. Obviously because the c0t2d0 disk failed. But they still work because the other submirrors are okay.
I would surface-scan the disk c0t2d0, by means of format , pick the c0t2d0, analyze, read.
If this passes without errors, I would resync them, e.g. by the commands

metareplace -e d30 c0t2d0s3
metareplace -e d20 c0t2d0s1

Then metastat should show they are resyncing (data is restored from the other submirror).
If all submirrors are okay then the full redundancy is restored, and the mirrors' state become okay.

dn888 · March 15, 2017, 5:41pm

If your disk has failed completely, format would say "drive type unknown", like the c0t0d0 disk.

So at this point, you can try to resync the mirror. If it fails to complete the resync, then replace the disk.