AIX hard disk failure

Hi all,

I have encountered the issue with the hard disk, the disk is failed and need to replace by the new one.

As my understanding, this is just to take out the failed disk and insert the new ones, and that's all.

But the third party hardware vendor said, there should be another procedure in AIX for this activity.
I have attached the screenshot for disk checking.

Please advise if I miss something.

I am only guessing, but I think you might need to format the new disk with the correct filesystem format and also correctly partition the disk before you swap them out.

Hi Neo,

As further checking, I can see it is local disk with RAID 5 in "Change/Show PCI-X SCSI pdisk" in smitty. As shown below, it's a disk inside the hdisk5 scsi raid 5 disk array.

xxx@/#lsdev -Cc disk
hdisk0 Available 04-08-ff-0,0 SCSI RAID 10 Disk Array
hdisk2 Available 00-08-02     1814     DS4700 Disk Array Device
hdisk3 Available 00-08-02     1814     DS4700 Disk Array Device
hdisk4 Available 00-08-02     1814     DS4700 Disk Array Device
hdisk5 Available 04-08-ff-0,1 SCSI RAID 5 Disk Array

So I though, just take out the missing disk, insert the new disk and format it with "create an array candiate pdisk and format to 512 byte sectors". But I'm not sure about this.

 
                                              x                       Change/Show PCI-X SCSI pdisk                       x
                                              x                                                                          x
                                              x Move cursor to desired item and press Enter. Use arrow keys to scroll.   x
                                              x                                                                          x
                                              x   0940-038 scsi2: Open not attempted. Device not Available.              x
                                              x   0940-038 scsi3: Open not attempted. Device not Available.              x
                                              x   pdisk0    04-08-00-3,0  Active      Array Member     142.8GB           x
                                              x   pdisk1    04-08-00-4,0  Active      Array Member     142.8GB           x
                                              x   pdisk2    04-08-00-5,0  Active      Array Member     142.8GB           x
                                              x   pdisk8    04-08-00-8,0  Active      Array Member     142.8GB           x
                                              x   pdisk4    04-08-01-3,0  Active      Array Member     142.8GB           x
                                              x   pdisk5    04-08-01-4,0  Missing     Disk             142.8GB           x
                                              x   pdisk7    04-08-01-8,0  Active      Array Member     142.8GB           x
                                              x   pdisk3    04-08-01-5,0  Active      Array Member     142.8GB           x

Did you let the system discover the new disk (cfgmgr...) ?

I am a bit confused: is this a disk in your AIX system as you said in #1 or a disk in a separate RAID as you said in #2? Usually, if you have an external raid, by formatting it forms one virtual disk (the whole RAID set) which you then in turn can see in AIX as a single hdisk device. Please explain your hardware setup (what is connected to what, etc.) a bit more detailed.

If your disk is part of a RAID set which is managed by an external device you need to follow the procedures of this external device. That may be anything, you will have to look it up in the respective manual of the device.

Notice that the following only applies if the system is not part of a cluster!

If the disk is directly attached to the system (that basically means you have a hdisk device /dev/hdiskNN for this single disk) you CANNOT remove it simply! Disks are uniquely identified by a "PVID" (physical volume ID) and AIX will notice that this disk is not that disk, regardless of them being identical hardware. (To be honest, it is, in fact, possible to de-configure a physically removed disk but that is really complicated work including manually patching the ODM - you do NOT want to have to do that if you can avoid it.

The correct way to remove the disk is: identify part of which volume group it is by using the lspv command. Move all LVs occupying space on that disk to other disks (if it is only one copy of mirrored LVs simply remove the copy ( rmlvcopy and remirror once the new disk is in place) by using the lmigratepp command.

When the disk has no occupied PPs any more (check with lsvg -p <volume-group> ) remove it from the VG with:

reducevg <vg-name> <hdiskNN>

Now - ONLY NOW! - you can pull the hdisk and replace it with the new one. Run a cfgmgr then to discover the new disk. Add it the VG by doing a:

extendvg <vg-name> <hdiskNN>

it will format the disk and put a PVID on it in the process.

Now you can use the space provided by the new disk. If you have deleted a mirror from a LV before, create a new mirror using the mklvcopy command. If you want to move a whole (unmirrored) LV to the new disk: create a mirror copy the same way and remove the original. This is faster then moving single PPs around.

I hope this helps.

bakunin

1 Like

Hi Bakunin,

--> My hardware setup is that I have a RAID5 disk hdisk5 (inside it are the pdiskx).

hdisk0    04-08-ff-0,0  Optimal     RAID 10 Array    142.8GB
 pdisk0   04-08-00-3,0  Active      Array Member     142.8GB
 pdisk1   04-08-00-4,0  Active      Array Member     142.8GB

hdisk5    04-08-ff-0,1  Optimal     RAID 5 Array     714.3GB
 pdisk2   04-08-00-5,0  Active      Array Member     142.8GB
 pdisk8   04-08-00-8,0  Active      Array Member     142.8GB
 pdisk4   04-08-01-3,0  Active      Array Member     142.8GB
 pdisk3   04-08-01-5,0  Active      Array Member     142.8GB
 pdisk7   04-08-01-8,0  Active      Array Member     142.8GB
 pdisk5   04-08-01-4,0  Active      Array Member     142.8GB

My volume group

xxx@/#lsvg -p YYYY
SSAMvg:
PV_NAME           PV STATE          TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk5            active            532         45          00..00..00..00..45

My raid configuration

0940-038 scsi2: Open not attempted. Device not Available.
0940-038 scsi3: Open not attempted. Device not Available.
------------------------------------------------------------------------
Name      Location      State       Description        Size
------------------------------------------------------------------------
sisioa0   04-08         Available   PCI-X Dual Channel U320 SCSI RAID Adapter
 scsi0    04-08-00-07,0 NoLink      No remote adapter target
 scsi1    04-08-01-07,0 NoLink      No remote adapter target

hdisk0    04-08-ff-0,0  Optimal     RAID 10 Array    142.8GB
 pdisk0   04-08-00-3,0  Active      Array Member     142.8GB
 pdisk1   04-08-00-4,0  Active      Array Member     142.8GB

hdisk5    04-08-ff-0,1  Optimal     RAID 5 Array     714.3GB
 pdisk2   04-08-00-5,0  Active      Array Member     142.8GB
 pdisk8   04-08-00-8,0  Active      Array Member     142.8GB
 pdisk4   04-08-01-3,0  Active      Array Member     142.8GB
 pdisk3   04-08-01-5,0  Active      Array Member     142.8GB
 pdisk7   04-08-01-8,0  Active      Array Member     142.8GB
 pdisk5   04-08-01-4,0  Active      Array Member     142.8GB

We have pulled out the failure disk pdisk5 and added a new hard disk.
Anything is fine until trying to add the new disk to the RAID and encountered the issue below:

hdisk5 changed. hdisk 5 has been expanded. However, hdisk5 needs to be unconfigured and reconfigured prior to the system being able to use the increased capacity.
Note: the volume group, logical volumes, and file systems associated with hdisk5 might need to be changed in order to make use of the increased capacity.

After some checks, I can see the hdisk5 becomes bigger with new size but the VG still in the old size.

It seems that the "physical" layer of the RAID is already reconfigured. Perhaps you have re-read the configuration too with the cfgmgr command and hence hdisk5 (this is the "logical" representation of the whole RAID) has become bigger. Anyways, you can make sure that the "new" hdisk5 is identified correctly in all its aspects.

Unmount the filesystems of the VG, then do a varyoffvg <VG> . Then delete the hdisk device and rediscover it:

rmdev -Rl hdisk5
cfgmgr

Now you need to tell the volume manager that the VG has changed. Issue a

chvg -g <volume-groupname>

which should do the trick. I am not sure if the VG needs to be varied on or off for that, so try in varyoffvg mode and if you get an error do a varyonvg <VG> and try again.

Ah, a last thing:

DON'T DO THAT!

In this case you were lucky, but generally - as i wrote above - it is a bad idea to remove disks which are still known to the system. Always deconfigure them first and pull them only then.

I hope this helps.

bakunin

Hi Bakunin,

rmdev -Rl hdisk5
cfgmgr

We can remove it actually? and can be rediscovered? So this action is just delete the device file, not the data?

We have pulled out the failure disk pdisk5 and added a new hard disk.

In the first, the newly added disk is recognized as a hdisk1. Then I format it with

Create an Array Candidate pdisk and Format to 522 Byte Sectors

then it becomes the array candidate pdisk5. Then I use.

Add Disks to an Existing PCI-X SCSI Disk Array

The disk can be added, but the get the warning such as: the disk is not used for parity and not restriped". I have not captured the exact output.
--> this means that the new disk is only used for data and not used for parity checks and stripped data like Raid 5 behavior?

I though I should try with this first instead of "Add disks to an existing ..." . As after adding, the re-contruct option is not effective.

Reconstruct a PCI-X SCSI Disk Array

Here is the menu-list command:

  List PCI-X SCSI Disk Array Configuration
  Create an Array Candidate pdisk and Format to 522 Byte Sectors
  Create a PCI-X SCSI Disk Array
  Delete a PCI-X SCSI Disk Array
  Add Disks to an Existing PCI-X SCSI Disk Array
  Configure a Defined PCI-X SCSI Disk Array
  Change/Show Characteristics of a PCI-X SCSI Disk Array
  Reconstruct a PCI-X SCSI Disk Array
  Change/Show PCI-X SCSI pdisk Status
  Diagnostics and Recovery Options

I followed this for what I have done
IBM Knowledge Center Error

And this for rebuild -->PCI-X SCSI RAID Controller Reference Guide for AIX. Actually not having the chance to use it as mentioned above.
IBM Knowledge Center Error

You have any idea for this?

So just wonder, as recommended by you, we should unmount all devices/filesystems, but do this mean downtime also in application and not really the "hot-swap". In what case we can do an online replacement? As read, the disk is hot-swap, it can be done online, right? Please advise.

--- Post updated at 04:04 PM ---

Note: I have edited my post.

--- Post updated at 04:14 PM ---

If you look at the pdisks in the raid5 pdisk5

hdisk5    04-08-ff-0,1  Optimal     RAID 5 Array     714.3GB
 pdisk2   04-08-00-5,0  Active      Array Member     142.8GB
 pdisk8   04-08-00-8,0  Active      Array Member     142.8GB
 pdisk4   04-08-01-3,0  Active      Array Member     142.8GB
 pdisk3   04-08-01-5,0  Active      Array Member     142.8GB
 pdisk7   04-08-01-8,0  Active      Array Member     142.8GB
 pdisk5   04-08-01-4,0  Active      Array Member     142.8GB

We can see it has 6 pdisks: 6x142.8=856.8 GB
But with Raid5, we have total size=total disk -1 means 5x142.8=714GB. It matched with 714.3 GB above.

So the OS should recognize the hdisk5 and its VG is 714GB instead of only 540GB

xxx@/#lsvg -p SSAMvg
SSAMvg:
PV_NAME           PV STATE          TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk5            active            532         45          00..00..00..00..45


xxx@/#lsvg  SSAMvg
VOLUME GROUP:       SSAMvg                   VG IDENTIFIER:  00096f540000d7000000015371bd6d50
VG STATE:           active                   PP SIZE:        1024 megabyte(s)
VG PERMISSION:      read/write               TOTAL PPs:      532 (544768 megabytes)
MAX LVs:            256                      FREE PPs:       45 (46080 megabytes)
LVs:                8                        USED PPs:       487 (498688 megabytes)
OPEN LVs:           8                        QUORUM:         2
TOTAL PVs:          1                        VG DESCRIPTORS: 2
STALE PVs:          0                        STALE PPs:      0
ACTIVE PVs:         1                        AUTO ON:        yes
MAX PPs per VG:     32512
MAX PPs per PV:     1016                     MAX PVs:        32
LTG size (Dynamic): 256 kilobyte(s)          AUTO SYNC:      no
HOT SPARE:          no                       BB POLICY:      relocatable

It does a bit more than that: it cleans out ODM entries regarding the disk and so on. They are created anew by cfgmgr But you are right insofar as the device is deleted, not its contents.

Of course it is created as hdisk1 - the system sees a (new) single disk and creates a device file for it.

[quote="phat,post:8,topic:375040"]

then it becomes the array candidate pdisk5. Then I use.

Add Disks to an Existing PCI-X SCSI Disk Array

Look - i am not all too proficient with the SMITty menus (i use it once a year maybe) and i have no AIX system at hand right now to look it up (save for the fact that in order to get these menus you have to install special software like the RAID driver). When you get a menu, instead of executing it you can press <F6> and display the command (or scriptlet) that would be executed. In addition you can look into the file ~root/smit.script to see what SMITty has executed before. I have no idea what the SMITty entry you quoted does.

Probably - but since this would violate what a RAID does and how it does it i suppose the disk is not used at all, neither for data nor parity.

To be honest: i have no idea. But you probably should "Change/Show PCI-X SCSI pdisk Status", and/or "Diagnostics and Recovery Options"

[quote="phat,post:8,topic:375040"]
I followed this for what I have done
IBM Knowledge Center Error

And this for rebuild -->PCI-X SCSI RAID Controller Reference Guide for AIX. Actually not having the chance to use it as mentioned above.
IBM Knowledge Center Error

You have any idea for this?

[quote]

Yes: you should have followed the link on exactly this lastlinked webpage where it says Prepare to remove a disk drive from a system or expansion unit controlled by AIX and followed these instructions first. It says essentially what i told you too: do not pull a disk physically until it is deconfigured/removed from the system.

Yes, it could have been done online but be aware that you already mistreated the system. Maybe it is still possible to do everything online but out of sheer paranoia (sorry - it's a professional trait) i would take a downtime at this point to make sure everything goes well. You haven't told us anything about your configuration but from what i do see in hardware configuration your system isn't exactly brand new (if i had to guess: POWER5 tops, probably running AIX 5.3. ML?, which would make it about 10-15 years old) so i would be even more paranoid. Probably everything is out of support if my wild guess is true. I haven't seen a RAID on any AIX-system perhaps for 15 years now and a RAID made of such small disks is probably quite old too.

Yes, that is all correctly observed - it emphasizes what i said before: the pdisk is probably not used by the RAID at all., Still i wonder how the size got reduced when the disk failed - this should not be the case with a RAID. If a disk fails it still has the same amount of capacity, just nothing to spare any more.

Did you issue the chvg -g SSAMvg already? Or the size always have been that small? IBM RAIDs use not only data/parity disks but also "Hot Spare" disks which take over once another disk breaks. List the status of your pdisks, i think their current role should be displayed there.

I hope this helps.

bakunin

if i had to guess: POWER5 tops, probably running AIX 5.3. ML?, which would make it about 10-15 years old) so i would be even more paranoid. Probably everything is out of support if my wild guess is true. I haven't seen a RAID on any AIX-system perhaps for 15 years now and a RAID made of such small disks is probably quite old too.

Yes, it's correct. It's power 5 and 5.3 ML. We work for customer who use the old technologies and that their legacy. We work on risk but no choice.

Actually execute it but not help. It seems not recognizing the new size

xxx@/#chvg -g SSAMvg
0516-1382 chvg: Volume group is not changed. None of the disks in the
        volume group have grown in size.
0516-732 chvg: Unable to change volume group SSAMvg.

Check the status it seems in good state