HACMP

allwin · July 23, 2012, 3:55pm

Does anyone has idea about, what is the ibm standard HACMP trip interval?

We have 20 second.

lssrc -ls topsvcs
Subsystem         Group            PID     Status
 topsvcs          topsvcs          1843200 active
Network Name   Indx Defd  Mbrs  St   Adapter ID      Group ID
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 3 Current group: 0
Packets sent    : 2595710 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 3364963 ICMP 0 Dropped: 0
NIM's PID: 1781806
HB Interval = 2.000 secs. Sensitivity = 5 missed beats
Missed HBs: Total: 0 Current group: 0
Packets sent    : 774925 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 774962 ICMP 0 Dropped: 0
NIM's PID: 1908932
  2 locally connected Clients with PIDs:
haemd(1961990) hagsd(684036)
  Fast Failure Detection available but off.
  Dead Man Switch Enabled:
     reset interval = 1 seconds
     trip  interval = 20 seconds
  Client Heartbeating Disabled.
  Configuration Instance = 263
  Daemon employs no security
  Segments pinned: Text Data.
  Text segment size: 900 KB. Static data segment size: 1493 KB.
  Dynamic data segment size: 5249. Number of outstanding malloc: 158
  User time 135 sec. System time 152 sec.
  Number of page faults: 182. Process swapped out 0 times.
  Number of nodes up: 2. Number of nodes down: 0.

zaxxon · July 23, 2012, 4:25pm

Afaik 20 is the default.

allwin · July 23, 2012, 4:54pm

Thank you. I got that.. But i want to know what is best practice or recommended value??

zaxxon · July 24, 2012, 5:09am

Is there any need for you to tune it? I don't know about a best practice for it but leaving it to default until there might be any need for it to be tuned.
Did you have any problems with the Dead Man Switch? Responding too fast or too late or not at all?

allwin · July 24, 2012, 10:38am

Recent network change in our environment took more then 25 seconds. It is caused issue in cluster.

Secondary server took leadership since it is not get any response from primary.

But actually primary was up and running. After network established it is able to get response but already some of resources moved to secondary.

So primary is gone in to graceful shutdown.

zaxxon · July 24, 2012, 11:50am

At the IBM Info Center there are some related articles about problems with DMS being triggered while everything is node-wise ok:
Help - AIX 6.1 Information Center: HACMP takeover issues

Check out especially:
Releasing large amounts of TCP traffic causes DMS timeout
Deadman switch causes a node failure
Deadman switch time to trigger

gts1999 · August 10, 2012, 6:47am

Do you have any non-ip networks setup in your HACMP configuration? SAN heartbeat and/or Serial network heartbeat?

These would help discern between a true network problem or a node down problem.

bakunin · August 10, 2012, 8:57am

OK, but: how often do you change the network in your environment? If this was a one-time occasion it is probably best to leave it as it is. Like zaxxon said already (words to this effect): if it isn't broken, then don't fix it.

It is good practice to have at least one non-IP connection (i.e. disk heartbeat or even the old-style serial connection) to avoid this. But even if it happens: this is what HACMP is expected to do. If you really have a need to change this behavior this begs the question if HACMP/PowerHA is the right tool for your needs.

This is also to be expected. You don't want to get a "split-brain condition", that is: two hosts both believing they are (or should be) primary. You simply power on your once-primary host, start the cluster services, do an "extended cluster verification" (if you are really paranoid, which is a good trait for a systems administrator) and after successfully verfying the cluster do a resource-group move. Then you have the status quo ante again.

Btw., according to this (rather ancient) HACMP 4.4 document the trip interval is for configuring the DMS (Dead-Man-Switch). If a cluster node doesn't give answer for this amount of time (in seconds) the other node considers it to be dead and not only takes over the resources but also will initiate a shutdown if the node comes back.

Keeping this in mind there can't be any sensible "default" with which to stick- You will have to tune this to your needs. It will always be a trade-off between security - if a node becomes unresponsive you want to take it over as fast as possible - and serviceability - you don't want unnecessary takeovers. What your most sensible trade-off-value should be can't be determined without extensive knowledge of your sites necessities and the intricacies of your environment.

You're officially YOYO (you're on your own).

I hope this helps.

bakunin