Does anyone has idea about, what is the ibm standard HACMP trip interval?
We have 20 second.
lssrc -ls topsvcs
Subsystem Group PID Status
topsvcs topsvcs 1843200 active
Network Name Indx Defd Mbrs St Adapter ID Group ID
HB Interval = 1.000 secs. Sensitivity = 10 missed beats
Missed HBs: Total: 3 Current group: 0
Packets sent : 2595710 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 3364963 ICMP 0 Dropped: 0
NIM's PID: 1781806
HB Interval = 2.000 secs. Sensitivity = 5 missed beats
Missed HBs: Total: 0 Current group: 0
Packets sent : 774925 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 774962 ICMP 0 Dropped: 0
NIM's PID: 1908932
2 locally connected Clients with PIDs:
haemd(1961990) hagsd(684036)
Fast Failure Detection available but off.
Dead Man Switch Enabled:
reset interval = 1 seconds
trip interval = 20 seconds
Client Heartbeating Disabled.
Configuration Instance = 263
Daemon employs no security
Segments pinned: Text Data.
Text segment size: 900 KB. Static data segment size: 1493 KB.
Dynamic data segment size: 5249. Number of outstanding malloc: 158
User time 135 sec. System time 152 sec.
Number of page faults: 182. Process swapped out 0 times.
Number of nodes up: 2. Number of nodes down: 0.
Is there any need for you to tune it? I don't know about a best practice for it but leaving it to default until there might be any need for it to be tuned.
Did you have any problems with the Dead Man Switch? Responding too fast or too late or not at all?
OK, but: how often do you change the network in your environment? If this was a one-time occasion it is probably best to leave it as it is. Like zaxxon said already (words to this effect): if it isn't broken, then don't fix it.
It is good practice to have at least one non-IP connection (i.e. disk heartbeat or even the old-style serial connection) to avoid this. But even if it happens: this is what HACMP is expected to do. If you really have a need to change this behavior this begs the question if HACMP/PowerHA is the right tool for your needs.
This is also to be expected. You don't want to get a "split-brain condition", that is: two hosts both believing they are (or should be) primary. You simply power on your once-primary host, start the cluster services, do an "extended cluster verification" (if you are really paranoid, which is a good trait for a systems administrator) and after successfully verfying the cluster do a resource-group move. Then you have the status quo ante again.
Btw., according to this (rather ancient) HACMP 4.4 document the trip interval is for configuring the DMS (Dead-Man-Switch). If a cluster node doesn't give answer for this amount of time (in seconds) the other node considers it to be dead and not only takes over the resources but also will initiate a shutdown if the node comes back.
Keeping this in mind there can't be any sensible "default" with which to stick- You will have to tune this to your needs. It will always be a trade-off between security - if a node becomes unresponsive you want to take it over as fast as possible - and serviceability - you don't want unnecessary takeovers. What your most sensible trade-off-value should be can't be determined without extensive knowledge of your sites necessities and the intricacies of your environment.