Identify failed disk in Linux RAID

Loic_Domaigne · October 6, 2011, 3:04pm

Good Evening,

2 years ago, I set up an Ubuntu file-server for a friend, who is a photograph amateur. Basically, the server offers a software RAID-5 that can be accessed remotely from a MAC. Unfortunately, I didn't labeled the hard drives (i.e. which physical drive corresponds to the /dev/sdX device).

Now a drive has failed, and the RAID-5 is at risk. I needed to find out which physical drive we have to replace, before we can rebuild the array. I have summed up below the procedure I'd follow. It would be great if some Linux software RAID connaisseur could review it. The more eyeballs, the better; and beside Linux RAID are quite new land for me.

stop raid system
# umount /dev/md1
# mdadm -S /dev/md1
Unplug one by one the hard drives. Looks in dmesg failure events for /dev/sdX. That way the mapping between the physical disk and the device /dev/sdX is step-by-step revealed.
Replace the failed disk, and partition it accordingly to what is expected.
Rebuild the mirror with the new disk
- get UUID with mdadm -query
- assemble array with that new disk: mdadm --assemble /dev/md -u XXX
- update /etc/mdadm.conf: mdadm --detail --scan >> /etc/mdadm.conf

You find below detailed information about the server set-up.

TIA,
Lo�c

The setup:

Ubuntu server, 6 SATA Hard drives /dev/sda ... /dev/sdf

Each Drives (X=a..f) are partitioned as followed:
/sdX1 type Linux Partition
/sdX2 type swap
/sdX3 type extended
/sdX5 type RAID

The server has 2 software Raids:
/dev/md0 RAID1 /sda1 and /sdb1
/dev/md1 RAID5 /sda5, /sdb5, /sdc5, /sdd5, /sde5, /sdf5

The OS is located on /dev/md0, only application data are located on /dev/md1

The Failure:

A Fail event had been detected on md device /dev/md1.
It could be related to component device /dev/sdd5.
The /proc/mdstat file currently contains the following:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid5 sde5[4] sdc5[2] sdd5[6](F) sdf5[5] sdb5[1] sda5[0]
9636429120 blocks level 5, 64k chunk, algorithm 2 [6/5] [UUU_UU]

md0 : active raid1 sdb1[1] sda1[0]
20506816 blocks [2/2] [UU]

unused devices: <none>

admin_xor · October 8, 2011, 3:31pm

Well I cannot help you determine what exact physical drive got corrupted; but, I surely can help you figure out which drive in the RAID array is faulty.

It's simple. there are a lot of ways to do this:

mdadm --detail /dev/md1 | grep faulty

or,

dmesg | grep -i "disk failure"

For Linux RAID, I always consult http://raid.wiki.kernel.org/

Hope this helps!

Loic_Domaigne · November 1, 2011, 4:28pm

Gidday,

just for the record - The procedure to find the physical faulty disk worked. Only step 4. was not correct

The rebuild has been triggered simply with:

# mdadm --manage /dev/md1 /dev/sdd5

(disk /dev/sdd had failed, and partition 5 was part of the md1 array)

HTH,
Lo�c