[Howto] Update AIX in HACMP cluster-nodes

bakunin · March 9, 2013, 7:31pm

As i have updated a lot of HACMP-nodes lately the question arises how to do it with minimal downtime. Of course it is easily possible to have a downtime and do the version update during this. In the best of worlds you always get the downtime you need - unfortunately we have yet to find this best of worlds.

The following procedure is proven to work with AIX 5.3, 6.x and 7.x and associated HACMP/PowerHA versions. It needs only one takeover, so the downtime is from somewhere from under a minute to some minutes, depending on the nature of your resource group(s).

Communications in HACMP happens via RSCT and for a cluster to work the version of the RSCT-packages have to be in sync. Fortunately it is easy to update the RSCT independent of the rest of the OS. This is what this procedure depends on. We will consider a dual-node cluster with an active and a standby-system (rotating cluster), but the procedure can easily be adapted to other cluster-architectures.

Stop the clustermanager on the standby-node. This will end the cluster-communication. The remaining node will be on its own.
Update the RSCT-packages on both nodes. It won't matter that the communication path over the RSCT-daemons will be disrupted, because there is nobody to communicate with anyways.
Optional step: If you are of the well and truly paranoid type (like me) you can now restart the clustermanager on the standby-node and do a cluster-synchronization. I never experienced any problems when i tried this procedure in a test-environment and skipped this step, but i still feel better to do it when working on a PROD-system.
Stop the clustermanager on the standby-system again and update the rest of AIX and/or HACMP. Because you made sure the RSCT-daemons are already updated and at a equal version it won't do any harm if the versions of the other packages are different.
Once the standby-system has finished the update restart cluster-services and move the resource-group to the standby-system. This takeover will be your downtime.
Update now the remaining node after shutting down cluster-services. After the update finished restart cluster-services and do a cluster-synchronization. You are finished.

I hope this helps.

bakunin

DGPickett · March 11, 2013, 11:43am

So, is the HACMP as a whole never down, just degraded to fewer nodes? It seems like with a HA cluster, you can update hosts in rotation and return them to the pool, so it is only down on host at a time.

bakunin · March 11, 2013, 8:58pm

Yes. Exactly this is the point.

Yes and no. The point is that the HA-communication is done via RSCT and the versions of the RSCT packages have to be consistent throughout the cluster at any time. This is why you have to split up the cluster into single nodes at one point (precisely the point where you update the RSCT). During this phase communication would not be possible. But as each node is single at this time it doesn't recognize this inability to communicate.

I hope this helps.

bakunin

DGPickett · March 12, 2013, 11:30am

It is sad the HA version n+ cannot discover and talk to version n as well as, when available, version n+. Backward compatability is a pretty common theme in the industry for many decades. They were sloppy in their requirements? No message version in the messaging?

MichaelFelt · March 12, 2013, 5:04pm

Looks good - however, have you also verified this with an update to SystemMirror (aka PowerHA v7?). As I understand it, SystemMirror is not using (only?) RSCT - but is using CAA (Cluster Aware AIX) for communication, topology and heartbeats. -- I do not do much with SystemMirror so I am asking - anyone - just to be sure someone does not get surprised when working with/updating to SystemMirror.