Question about HACMP for active-active mode

qiulang · October 27, 2009, 6:10am

Hi all,

I am new to HACMP. So sorry for the newie question. But I did search the forum and it seems that no one asks this before.

So if a 2-node cluster runs in active-active mode (and the same application), what is the benefit of using HACMP ?

If it runs in active-stanby, it is easy to understand because you need HACMP to do failover; but for active-active mode, no failover (right ?), then why do we need HACMP ?

Besides, I find HACMP documents do not use the term "active-active" or "active-standby" (cold standby or hot standby) often. So does "concurrent resource group" mean "active-active" and "cascading/rotating resource group" mean "active-standby" ? Is there a cold standby ?

And what is the relation between "concurrent resource group" and "concurrent shared disk access"; what's the relation between "Cascading/Rotating resource group" and "non-concurrent shared disk access" ?

Thanks!

shockneck · October 27, 2009, 9:38am

HACMP terminology changed now and then. Some terms were used with older HACMP versions 3 and 4, some came into existence with the current HACMP Version 5 while others ceased to exist. The same is true with certain functions/possibilities. IBM Speak itself can be a bit challenging but when you mix up words from different times you end up in a mess.

An active-active cluster has (at least) two nodes and two Resource Groups RG. So there is one RG per node and only during takeover/failover two RG are on one node. This is a very good setup because during normal operation the RG (its application) has its dedicated CPU/RAM/IO. Furthermore not everybody can afford running (i.e. paying for) two servers but using only one. It does not at all mean that there is no takeover / failover possible.

A concurrent HACMP RG means Oracle RAC. I am not aware of another application that is able to run as a concurrent HACMP cluster. While there are two or more nodes active at a time this is not usually called an active-active configuration as no takeover is involved. In a concurrent cluster a node may die but the other(s) keep on working regardless.

Cascading / rotating is related to RG behaviour. If there is a fixed order of how the RG is taken over from node to node (usually in a Cluster with 3+ nodes that are different so that using certain nodes is perferable) this is a cascading behaviour. If the RG does not care on what node it is active, i.e. it can be active on any node a rotation of the RG might be OK.

Concurrent volume group are rarely used these days. Today you are likely to use Enhanced Concurrent Mode VG. And Oracle RAC can use an ECM VG in a Concurrent Cluster.

As you might have guessed by now already: some reading might be required. If you are an absolute beginner with HACMP you might start by reading the HACMP cookbook from December 2005 followed by the PowerHA cookbook (yes, the name changed with 5.5 ....) from September 09.

qiulang · October 27, 2009, 11:30am

Thanks for answering my question!

But back to my original question, if concurrent HACMP RG means Oracle RAC and since no takeover is involved (and RAC10g has its own clusterware), what is the benefit of using HACMP with RAC ?

The other question is if like you said during failover for the active-active cluster, 2 RGs would be on one node. Then I think these 2 RGs should contain different applications. Because if they are the same application (like RAC) I don't see the value of running two same applications on one node, right ?

Thanks.

shockneck · October 27, 2009, 12:27pm

That was your original question?

As always ... The decision on whether to use HACMP or not could be based on what filesystem you want to use. For Oracle RAC you neither need HACMP when using ASM nor when using GPFS. However IIRC you need ECM VG if you use Oracle datafiles on concurrent raw LV.
What also might count is your IT personnels expertise. Do your administrators have good knowledge of GPFS / ASM? Then use it. However if your administrators have good knowledge of HACMP then ECM might be preferable.

One application cannot be member of two RGs.

qiulang · October 27, 2009, 10:11pm

Thanks for the answerd and looks like I have got a lot of reading to do! Man

BTW my original question was "what is the benefit of running HACMP for an active-active cluster with the same app running" and it looks like RAC may be the only one in this case, right?

And I did find out another term HACMP uses, "mutual takeover", which seems to be relevant to my question. So my understanding of "mutual takeover" is an active-active cluster with different app running (shared-nothing mode). It makes sense to using HACMP here. (and it also makes sense why they don't use the term "active-active" now). But is my understanding correct or I totally got it wrong ?

zxmaus · October 28, 2009, 2:00am

Hi,

there is Sybase Cluster Edition that works similar to Oracle RAC.
In addition there are a lot of Vendor applications out there that have built in ha-facilities and can run / should run in a concurrent resource group - not for availability - but for load balancing reasons. And here the cluster facilities are not used for the applications themselves but for the infrastructure that points to the application. If one node goes down, the service IP just starts pointing to the other node - the application - though down on the previous node - remains available. Great solution for any kind of application that doesnt produce volatile data at all.
We have several clusters where Websphere and MQ are running active on both nodes - serving different functions - and again failover to the other node when required - serving different clients, different applications - and yes we want different instances of the same application being up on one node if the other node dies.
We have as well lots of sybase databases in a hacmp setup that have the same content but serve different purposes. Sybase replication server replicates the content of database 1 into database 2. As long as both nodes are up, database 1 is used for trading, database 2 for reporting - when one of the nodes dies, it fails over to the other node - replication remains running - and you still can use both databases for these different purposes - nothing in the listeners, pointer files or anything has to change because the IP of the application doesn't change ... and additional resources are just added via dlpar scripts if required - so nothing is starving ...
Possibilities are countless and so are the benefits of having live-live clusters

Kind regards
zxmaus

qiulang · October 28, 2009, 2:55am

Thanks for the reply! Man, "possibilities are countless" but the learning curve is just

So another newie question. When failover happens, should I let the failed node auto-reboot or should I reboot it manually, say in a production environment?

And if I choose auto-reboot, can HA software like HACMP help ?

zxmaus · October 28, 2009, 3:14am

I hardly can think of a scenario where an autoreboot would fix the problem ?
If you have a failing network - reboot doesnt help - neither with failing storage. And even on a failed application - like i.e. a database - a reboot would would only help if someone would have solved the root cause of the failing (that is btw true for auto-failover of a DB - it might not be a good idea at all). If your node fails due to overcommitted memory, the reboot might clear the caches - but the fact that you don't have enough memory or bad tunables would remain, the problem re-occurs soon.
Kind regards
zxmaus

bakunin · October 29, 2009, 6:00pm

It seems to me like your real problem is adapting to the terminology of HACMP, in this case namely the term "resource group".

Suppose the classical case with no HACMP involved: you have a system where an application runs. The application has some disks holding its data and the data are all in a volume group. Further you have some network adapter (an IP adress) over which the users know they can contact the application.

Now lets abstract from this concept: what do we need to transfer to another system to transfer the task of serving the application? Answer: the volume group with the data, the IP adress the users expect, some start/stop mechanisms for the application - thats it. Exactly this is what is called a resource group. It is a data structure inside the HACMP configuration which binds together exactly these things.

Resource groups always reside on one node of a cluster and can be transferred to another node in case the active node dies. This is what shockneck has tried to explain to you in the active-active configuration: suppose your default would be a node A with some resource group A' operative and a node B with a RG B' operative. If node A dies, RG A' is being transferred to node B which will run both, vice versa, if node B dies.

Yes, "mutual takeover" is just another name for this mechanism. But however you call it, it helps to picture the RGs moving around on the nodes in case of hardware outages.

I hope this helps.

bakunin

qiulang · October 30, 2009, 5:44am

Thanks for the answer.

Actually what I did not understand before was what was the value of running 2 instances of the same application on one node during the failover? If the resource group contains the same application on both nodes, why not just make takeover not happen here ? Because from the client point of view, the service still exists while running two same instances on one node may make the node overloaded.

As zxmaus pointed out that two same instances can server different purposes(say, database 1 is used for trading, database 2 for reporting). I think that makes sense. But if two instances serve the same purpose I still don't see the value in it.