Cluster failure reason

Hi guys !
I'm a French IT student in AIX, and i'm note very fluent in english.

I have a task : Write a script to inform the administrator if on of a cluster UC is not working. I'm not going to ask you the script ^^'

But i want to make a list of the failure reason of a cluster (network, process, ...)

I f you any link or answer it will be great

Hi,

as far as I know each node of a cluster will be already monitored by the cluser software itself (Bull ARF, IBM HACMP). In case of a failure the software will inform you, at least with a mail to root. In a simple solution You have just to forward this mail (e.g. user /etc/aliases).

Regards

Ya actually the final aim is not to use the IBM cluster software (HACMP, powerHA)
The aim is the detect a failure in of the cluster UC, and switch manually to another UC.

That's why, i'm making the list of all possibles reasons of a cluster UC failover.

Okay,

  • LAN (ping, default gateway, routes, ...)
  • SAN (disk errors, failed paths, IO errors (lvm_io_fail), ..)
  • rootvg (disk errors, mirroring)
  • errpt (permanent hardware errors, ...)

Monitoring the errpt for permanet hardware errors is a good start :wink:

Regards

1 Like

Thank you,

I can i check the gatway because the ping and traceroute command don't specify the gateway

---------- Post updated at 03:41 PM ---------- Previous update was at 03:18 PM ----------

Do you know how its working the IBM "active dead gateway detection" ?
The format of the standard output ?

keep in mind the server may not respond respond (frozen or crashed) in which case your monitoring has to also take in account form an external point of vue (can I ping and connect to...)

I'd take a different starting point. Instead of looking for possible failures I'd define what resources have to be up and running to say that the cluster node is ok. If you keep in mind that the final goal of a cluster is to garantee the availability of a service and not the detection of errors your script may look a bit different, while the list of resources is pretty much what XrAy wrote.

Yes, exactly. There are more possible failure modes than one brain can imagine, but a very finite list of things your cluster is supposed to be providing and resources it uses to run.

Make sure you are pining the Persistent IP and NOT the Service IP, because Service IP will jump between the nodes, whereas persistent IP is hard bounded to the node.

You can check the cluster services, and the cluster state, and write a wrapper script to send an email if any of those goes south.
And ofcourse taking into consideration all the valuable suggestions given by forum members.

1 Like

The backbone of the highavailability is the nodes in a cluster checking each other so you should look into heartbeats. They are usually implemented by sharing a disk space, like a concurrent accessable VG, which is rather small and have the nodes write in there some bits and by the freshness of it the nodes can decide who is still up and alive.
Additionally there is heartbeating via network interfaces. Some even use or used serial interfaces etc.
This is an important part of HACMP/PowerHA and other Cluster Technologies.

Have a look here:
Heartbeating in HACMP - AIX 6.1 Information Center

1 Like

So, basically you want to rewrite HACMP, yes? Why not, but be warned: there is a reason good cluster software is not coming in the dozens.

Let us see: a cluster is a device for making some "service" available even in cases of machines failing. So, what is a service?

A "service" is an application you can reach under a certain network address, therefore you need:

one (or more) network addresses,
some filesystems with data,
some processes serving said service.

This, bound into a group, is called a resource group in HACMP terminology.

You need also some device (say, a script, or whatever) telling you when the service is failing. Just checking some processes is problematic, because it could happen in some big software package that a certain process has to stop and another has to start as part of the normal operation. Therefore you need for every resource group a customised way of telling everything is good or not - a so-called application monitor. In its simplest form it will indeed check some processes, but it can be much more sophisticated than that.

This was the "internal" supervision, taking place on one node. You also need an "external" supervision, where the passive node checks if the active node is still alive. This is done via heartbeats, but is not always easy to tell, because if the service is not reachable via, say, network, this could mean that the node is failing or the connecting network is failing. Taking over in the first case corrects the problem while doing so in the second will achieve nothing. HACMP therefore uses network hearbeats, serial heartbeats and through shared disks (classically SCSI or SSA, nowadays FC networks) in parallel.

The cluster state which has to be avoided at all costs is the "split brain" condition: both nodes thinking they are primary and the other is failing. For this to avoid you need some means of shutting down a node as fast as possible. shutdown will be too slow, halt -q will be better and something like cat /etc/hosts > /dev/kmem (not possible any more since AIX 5.3 ML 1) would be best (fastest). Because you need to be able to trigger it from outside HACMP has the DMS (dead-man-switch), a kernel-extension which takes down the system real fast under certain conditions. While most of HACMP consists of scripts calling other scripts, this part is kernel-software. You will have to create such a thing too.

So far, off the top of my head. There is probably much more to say than what came to my mind right now, so just ask. I suggest reading the IBM redbooks about HACMP. Implementing a cluster software is a laudable effort, because even if you fail you will get to appreciate the problems it poses. If you even succeed, all the better.

I hope this helps.

bakunin

1 Like