Need assistance to replace root disk on Netra X4200

BRH · January 29, 2014, 9:43am

Good day.

I have a SUN Netra X4200 running on Solaris 10, with two disks in raid 1 configuration. HDD0 (c2t0d0) seems to have a problem but HDD1 (c2t1d0) is ok:

# for a in c2t0d0 c2t1d0 ; do raidctl -l $a; done
Volume                  Size    Stripe  Status   Cache  RAID
        Sub                     Size                    Level
                Disk                                    
----------------------------------------------------------------
c2t0d0                  136.5G  N/A     DEGRADED OFF    RAID1
                0.2.0   136.5G          GOOD    
                0.5.0   136.5G          FAILED                   <<< where did 0.5.0 come from and what happened to 0.0.0?


Volume                  Size    Stripe  Status   Cache  RAID
        Sub                     Size                    Level
                Disk                                    
----------------------------------------------------------------
c2t1d0                  136.5G  N/A     OPTIMAL  OFF    RAID1
                0.1.0   136.5G          GOOD    
                0.3.0   136.5G          GOOD

Is there a fix or do I have to replace the disk. I have some disks harvested from an old server, however, I need to be sure that if I replace the primary disk with an older disk, that the older disk does not screw up the mirror disk which is now the only valid one. If I have to replace the disk, does anyone have a procedure?

Regards,
Bjoern

bartus11 · January 29, 2014, 9:45am

What does this show:

raidctl -S

BRH · January 29, 2014, 9:56am

Thanks for your prompt reply, bartus11!

# raidctl -S
  2 "LSI_1064"
  c2t0d0 2 0.2.0 0.5.0 1 DEGRADED
  c2t1d0 2 0.1.0 0.3.0 1 OPTIMAL
  0.1.0 GOOD
  0.2.0 GOOD
  0.3.0 GOOD
  0.5.0 FAILED

hicksd8 · January 29, 2014, 12:35pm

Firstly, the 0.5.0 will refer to the position in the SCSI chain; perhaps SCSI ID 5?

Don't worry about finding 0.0.0. It was how it was configured and which disks were selected by the sysadmin to go into the array.

So, your controller knows exactly what's going on; there are two RAID1 arrays and one of the disks has failed.

Do you know exactly which disk it is??? Pulling out the wrong disk will be fatal!!!!

Assuming you know EXACTLY which disk is in trouble, the first thing to try is to pull it out and simply push it back in. Then check the status again. If it says it's rebuilding then perhaps it was just a connection problem (poor contacts happen all the time). If it still says FAILED then replacement is needed.

Most disks show the number of LBA's on the label. LBA=logical blocks or sectors on the disk.

The replacement disk must be the same or greater number of LBA's for it to work. It stands to reason that you can't completely mirror a drive to one which is smaller. [Some disks which have exactly the same model number have different numbers of LBA's. Different versions of manufacture. So beware.]

If you do plug in a disk which is smaller then the RAID controller will refuse to do anything with it. It's not because the replacement is faulty.

If you plug in a disk that the controller is happy with then the status will go into REBUILD whilst the remirror is being done, followed by OPTIMAL when the remirror has finished.

These RAID controllers support hot-swap so no need to take the system down just make damned sure that you're pulling out the right disk.

Hope that helps.

BRH · January 29, 2014, 3:04pm

Thanks for your reply hicksd8.

All the nodes in my cluster have the same disk configuration. This is what it looked like about a week before the problem arose:

Volume                  Size    Stripe  Status   Cache  RAID
          Sub                     Size                    Level
                  Disk                                    
  ----------------------------------------------------------------
  c2t0d0                  136.5G  N/A     OPTIMAL  OFF    RAID1
                  0.0.0   136.5G          GOOD    
                  0.2.0   136.5G          GOOD    
   
   
  Volume                  Size    Stripe  Status   Cache  RAID
          Sub                     Size                    Level
                  Disk                                    
  ----------------------------------------------------------------
  c2t1d0                  136.5G  N/A     OPTIMAL  OFF    RAID1
                  0.1.0   136.5G          GOOD    
                  0.3.0   136.5G          GOOD

According to the information at hand, the faulty disk is HDD0 (c2t0d0). This is the one we need to replace. If reseating the drive does not work, and I replace it with another compatible drive, do I have to format or label the spare drive first?

Regards,
Bjoern

hicksd8 · January 29, 2014, 3:54pm

The RAID controller only looks at raw storage. It doesn't know anything about format, partition types, or filesystems. In rebuilding the mirror it will simply copy sector 0 to sector0, sector 1 to sector 1, thru' sector n to sector n. It doesn't give a stuff what's on the new drive is sees.

Historically, there used to be less able RAID controllers which looked for empty drives ie, expecting no format or partition table. For this reason, if I had to test a recycled drive prior to using it, I'll blow any partition table away before disconnecting from my test rig (to make it look empty) but this is really no longer necessary. Your RAID controller is a LSI and they're extremely good. It will just take care of everything.

Since your problem drive is marked "failed" by the controller, it won't even be trying to talk to it any more so it won't be flashing when the system is I/O'ing. That should tell you which drive to pull.

Don't be surprised if pulling out the drive and reinserting it starts the rebuild. After months/years of operation a poor connection can develop which is cured by reseating. If that doesn't work insert a replacement with the same or greater LBA's.