failing drive

I posted some errpt output,see Phone Support, that this forum graciously looked at and confirmed what we suspected, that one of our RAID5 disks is failing. I have a replacement, but am having trouble downing the old disk. If I try and run Remove a Disk from smit, it says the device is busy. The drive has not died yet. We tried pulling it out, but it still remained in the available state in List all Defined disks.

If I recall correct from your other post it was hdisk2?
You need to know if it is part of a VG:

lspv| grep hdisk2

Next check if it is mirrored:

lsvg -l <nameofthatvg>

You will get a list where there is a column calle LPs and PPs.
If the PPs are a multiple of LPs, you have set up a mirror, which is good and makes it easy. You'll have to check that for all LVs/FS's that are listed. Else you have

If you don't have a mirror, you'll have to varyoffvg the VG and replace the disk, but you'll have to restore from your backup.

When it is mirrored like described above, you can just do following, like "man unmirrorvg" says:

       3    To replace a bad disk drive in a mirrored volume group, enter:

            unmirrorvg workvg hdisk7
            reducevg workvg hdisk7
            rmdev -l hdisk7 -d
            replace the disk drive, let the drive be renamed hdisk7
            extendvg workvg hdisk7
            mirrorvg workvg
            Note: By default in this example, mirrorvg will try to create 2 copies for logical
            volumes in workvg. It will try to create the new mirrors onto the replaced disk
            drive. However, if the original system had been triply mirrored, there may be no
            new mirrors created onto hdisk7, as other copies may already exist for the logical
            volumes. This follows the default behavior of unmirrorvg to reduce the mirror copy
            count to 1. Note: When unmirrorvg workvg hdisk7 is run, hdisk7 will be the
            remaining drive in the volume group. This drive is not actually removed from the
            volume group. You must run the migratepv command to move the data from the disk
            that is to be removed from the system to disk hdisk7.

How this is exactly done is depending on the RAID adapter (more precisely: the adapters driver software), so i can give you only general directions.

If the failing disk is part of a RAID you will probably not be able to manage the disk device itself. A RAID works like this: there are several disks connected to an adapter. The driver software of the adapter makes one big virtual disk out of the several physical ones and presents this virtual construct as a physical disk to the machine. (This is what is done during the "RAID initialization" or however it is called with your software. The driver/adapter writes some bookkeeping information onto the physical disk to be able to use them the described way.)

Only this virtual disk is added to a VG as a "Physical Volume" and from there on normal LVM procedures apply.

Your first task is to make the PV free from OS access. You can do this by either breaking the mirror (if the VG is mirrored) or by varying off the VG as zaxxon suggested. Since the "disk" in the VG is only a virtual construct there is no strict relationship between disks and logical volumes. All the logical volumes on the virtual RAID disk are "smudged across" the physical disks comprising the RAID.

After this you need to use the adapters driver software (in case of the IBM SCSI RAID adapter this is plugged into SMITty and the diag utility) to remove the disk from the RAID, after which the RAID is in status "reduced". then physically change the disks and add the new disk to the RAID. This will probably take some time as the new disk has to be written with the data first to be useful in the RAID. Only then varyon again and start using the VG again.

Do you need to backup? In principle you don't, because in a RAID all the disks hold all the information with redundancy. The classical case is 5 disks holding the capacity of 4 - for this penalty it is possible to replace every single disk without losing data, because the data it holds is also available on the other 4. This does NOT mean that a backup would be a bad idea: not at all! It is better to have a backup you don't need than to need a backup you don't have.

I hope this helps.

bakunin

Bakunin is right - haven't seen that it's a disk from RAID5, so forget about checking for number of LPs and PPs.

Thanks alot for your useful advice. We are really in a bind here. The IBM server this is located on has a RAID adapter, but it turns out the vendor never configured it! I am looking into installing an external scsi drive, and dumping to it.