RAID5 multi disk failure

chebarbudo · January 23, 2012, 2:18pm

Hi there,

Don't know if my title is relevant but I'm dealing with dangerous materials that I don't really know and I'm very afraid to mess anything up.

I have a Debian 5.0.4 server with 4 x 1TB hard drives.

I have the following mdstat

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sda1[0] sdd1[3] sdb1[1] sdc1[2]
      1024896 blocks [4/4] [UUUU]

md5 : active raid1 sda5[0] sdd5[3] sdb5[1] sdc5[2]
      1023872 blocks [4/4] [UUUU]

md6 : active raid1 sda6[0] sdd6[3] sdb6[1]
      1023872 blocks [4/3] [UU_U]

md7 : active raid1 sda7[0] sdd7[3] sdb7[1] sdc7[2]
      1023872 blocks [4/4] [UUUU]

md8 : active raid1 sdd8[3] sdb8[1] sdc8[2]
      1023872 blocks [4/3] [_UUU]

unused devices: <none>

That's kind of weird because I use to have a huge md10 partition with a monstruous amount of important files.

I have no idea where to start!

I tried to examine the partitions in the multi-disk :

root@titan:~# mdadm --examine /dev/sda10
/dev/sda10:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 0b972a2e:3aaabcf9:a4d2adc2:26fd5302
  Creation Time : Sat Apr 17 16:30:50 2010
     Raid Level : raid5
  Used Dev Size : 1459502912 (1391.89 GiB 1494.53 GB)
     Array Size : 4378508736 (4175.67 GiB 4483.59 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 10

    Update Time : Sun Jun  5 16:00:41 2011
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
       Checksum : ac3fac12 - correct
         Events : 2552115

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     0       8       10        0      active sync   /dev/sda10

   0     0       8       10        0      active sync   /dev/sda10
   1     1       8       26        1      active sync   /dev/sdb10
   2     2       8       42        2      active sync   /dev/sdc10
   3     3       8       58        3      active sync   /dev/sdd10
root@titan:~# mdadm --examine /dev/sdb10
/dev/sdb10:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 0b972a2e:3aaabcf9:a4d2adc2:26fd5302
  Creation Time : Sat Apr 17 16:30:50 2010
     Raid Level : raid5
  Used Dev Size : 1459502912 (1391.89 GiB 1494.53 GB)
     Array Size : 4378508736 (4175.67 GiB 4483.59 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 10

    Update Time : Mon Jan 23 12:05:02 2012
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 0
       Checksum : ade16f37 - correct
         Events : 6224199

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     1       8       26        1      active sync   /dev/sdb10

   0     0       0        0        0      removed
   1     1       8       26        1      active sync   /dev/sdb10
   2     2       0        0        2      faulty removed
   3     3       8       58        3      active sync   /dev/sdd10
root@titan:~# mdadm --examine /dev/sdc10
/dev/sdc10:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 0b972a2e:3aaabcf9:a4d2adc2:26fd5302
  Creation Time : Sat Apr 17 16:30:50 2010
     Raid Level : raid5
  Used Dev Size : 1459502912 (1391.89 GiB 1494.53 GB)
     Array Size : 4378508736 (4175.67 GiB 4483.59 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 10

    Update Time : Fri Jan 20 23:16:43 2012
          State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0
       Checksum : ad7f1c03 - correct
         Events : 6223465

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     2       8       42        2      active sync   /dev/sdc10

   0     0       0        0        0      removed
   1     1       8       26        1      active sync   /dev/sdb10
   2     2       8       42        2      active sync   /dev/sdc10
   3     3       8       58        3      active sync   /dev/sdd10
root@titan:~# mdadm --examine /dev/sdd10
/dev/sdd10:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 0b972a2e:3aaabcf9:a4d2adc2:26fd5302
  Creation Time : Sat Apr 17 16:30:50 2010
     Raid Level : raid5
  Used Dev Size : 1459502912 (1391.89 GiB 1494.53 GB)
     Array Size : 4378508736 (4175.67 GiB 4483.59 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 10

    Update Time : Mon Jan 23 12:05:02 2012
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 1
  Spare Devices : 0
       Checksum : ade16f5b - correct
         Events : 6224199

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3       8       58        3      active sync   /dev/sdd10

   0     0       0        0        0      removed
   1     1       8       26        1      active sync   /dev/sdb10
   2     2       0        0        2      faulty removed
   3     3       8       58        3      active sync   /dev/sdd10

But that doesn't really help...
I have no idea how to interpret the results!
I'm scared with the "faulty" and "removed" warnings.
Can anyone give me a hint?
Is there any other command I can run to regain access to the data, at least read-only?

Thanks for your help.
Santiago

frank_rizzo · January 23, 2012, 7:19pm

been a while since I worked with md so I can't help you much there. I would check all disks and SMART data for any errors.

RAID is not a substitute for backups.

Are you able to mount the file systems that are using those volumes?

chebarbudo · January 25, 2012, 2:18pm

OK, thanks to your pieces of advice, I went a little further :
I can tell that two of my 4 disks are removed from the array.

# mdadm --examine /dev/sda10 | grep 'Update Time'
    Update Time : Sun Jun  5 16:00:41 2011
# mdadm --examine /dev/sdb10 | grep 'Update Time'
    Update Time : Mon Jan 23 12:05:02 2012
# mdadm --examine /dev/sdc10 | grep 'Update Time'
    Update Time : Fri Jan 20 23:16:43 2012
# mdadm --examine /dev/sdd10 | grep 'Update Time'
    Update Time : Mon Jan 23 12:05:02 2012

One failed in june 2011, the second one failed 5 days ago.
I thought that RAID5 would turn read only as soon as one disk fails.
Does anyone knows more?
Please let's not discuss how crazy it is to have let my RAID5 run with one disk removed during 6 month. I didn't know what SMART was before now (belive me I'm reading the manual).

For more information, here is the status of the array

# mdadm --examine /dev/sdb10 | tail -6
this     1       8       26        1      active sync   /dev/sdb10

   0     0       0        0        0      removed
   1     1       8       26        1      active sync   /dev/sdb10
   2     2       0        0        2      faulty removed
   3     3       8       58        3      active sync   /dev/sdd10

Is there any chance I can resync 2 disks out of 4?

Any help will be appreciated.

chebarbudo · January 27, 2012, 12:37pm

Hi there, me again,

I think my problem is somewhere else.
I know no disk is broken given that there are a few other raid arrays using the same 4 disks:

root@titan:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sda1[0] sdd1[3] sdb1[1] sdc1[2]
      1024896 blocks [4/4] [UUUU]

md5 : active raid1 sda5[0] sdd5[3] sdb5[1] sdc5[2]
      1023872 blocks [4/4] [UUUU]

md6 : active raid1 sdc6[2] sda6[0] sdd6[3] sdb6[1]
      1023872 blocks [4/4] [UUUU]

md7 : active raid1 sda7[0] sdd7[3] sdb7[1] sdc7[2]
      1023872 blocks [4/4] [UUUU]

md8 : active raid1 sda8[0] sdd8[3] sdb8[1] sdc8[2]
      1023872 blocks [4/4] [UUUU]

unused devices: <none>

So I thought I should just check the disks.
Problem: fsck doesn't work:

root@titan:~# fsck.ext3 /dev/sdc10
#e2fsck 1.41.3 (12-Oct-2008)
fsck.ext3: Superblock invalid, trying backup blocks...
fsck.ext3: Bad magic number in super-block while trying to open /dev/sdc10

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

root@titan:~# fsck.ext3 -b 8193 /dev/sdc10
e2fsck 1.41.3 (12-Oct-2008)
fsck.ext3: Bad magic number in super-block while trying to open /dev/sdc10

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

How can I repair the filesystem on /dev/sdc10?

Thanks for your help
Santiago