Linux heartbeat on redhat 4:node dead

amrita_garg · April 23, 2009, 4:54am

Hi.
I have started heartbeat on two redhat servers. Using eth0.
Before I start heartbeat I can ping the two server to each other.
Once I start heartbeat both the server become active as they both have warnings that the other node is dead.
Also I am not able to ping each other. After stopping heartbeat the ping works again.

My configuration files are
for server1(192.168.10.43 and its router is 192.168.10.5)
haresources
-----------
ems1 192.168.20.163/24/etho0 netc
ha.cf
------
# /etc/ha.d/ha.cf

# File to write debug messages to
debugfile /var/log/ha-debug

# File to write log messages to
logfile /var/log/ha-log

# Facility to use for syslog()/logger
logfacility local0

# Hearbeat timers
# Refer heartbeat FAQ for how to use these timers.
keepalive 2
deadtime 30
warntime 10
initdead 30

# UDP port used for bcast/ucast communication!
udpport 694

# Interface to broadcast heartbeats over
bcast eth0

# Specify the eth1's IP Address of the other machine EMS server
ucast eth0 192.168.20.162

# Enable automatic failback.
auto_failback on

# Node name in the cluster. Node name must match uname -n
node ems1
node ems2

# Enter a reliable IP address. For ex IP address of the router.
ping 192.168.10.5

# Less commong option.
apiauth ipfail uid=hacluster
apiauth ccm uid=hacluster
apiauth ping gid=haclient uid=root
apiauth default gid=haclient
msgfmt netstring
for server2(192.168.20.162 and its router is 192.168.20.1)
haresources
-----------
ems1 192.168.20.163/24/eth0 netc
ha.cf
------
# /etc/ha.d/ha.cf

# File to write debug messages to
debugfile /var/log/ha-debug

# File to write log messages to
logfile /var/log/ha-log

# Facility to use for syslog()/logger
logfacility local0

# Hearbeat timers
# Refer heartbeat FAQ for how to use these timers.
keepalive 2
deadtime 30
warntime 10
initdead 30

# UDP port used for bcast/ucast communication!
udpport 694

# Interface to broadcast heartbeats over
bcast eth0

# Specify the eth1's IP Address of the other machine EMS server
ucast eth0 192.168.10.43

# Enable automatic failback.
auto_failback on

# Node name in the cluster. Node name must match uname -n
node ems1
node ems2

# Enter a reliable IP address. For ex IP address of the router.
ping 192.168.20.1

# Less commong option.
apiauth ipfail uid=hacluster
apiauth ccm uid=hacluster
apiauth ping gid=haclient uid=root
apiauth default gid=haclient
msgfmt netstring

And here is logs for server1
heartbeat: 2009/04/23_04:52:28 info: Configuration validated. Starting heartbeat 1.2.3.cvs.20050404
heartbeat: 2009/04/23_04:52:28 info: heartbeat: version 1.2.3.cvs.20050404
heartbeat: 2009/04/23_04:52:28 info: Heartbeat generation: 16
heartbeat: 2009/04/23_04:52:28 info: UDP Broadcast heartbeat started on port 694 (694) interface eth0
heartbeat: 2009/04/23_04:52:28 info: ucast: write socket priority set to IPTOS_LOWDELAY on eth0
heartbeat: 2009/04/23_04:52:28 info: ucast: bound send socket to device: eth0
heartbeat: 2009/04/23_04:52:28 info: ucast: bound receive socket to device: eth0
heartbeat: 2009/04/23_04:52:28 info: ucast: started on port 694 interface eth0 to 192.168.20.162
heartbeat: 2009/04/23_04:52:28 info: ping heartbeat started.
heartbeat: 2009/04/23_04:52:28 info: pid 18143 locked in memory.
heartbeat: 2009/04/23_04:52:28 info: Local status now set to: 'up'
heartbeat: 2009/04/23_04:52:29 info: pid 18146 locked in memory.
heartbeat: 2009/04/23_04:52:29 info: pid 18152 locked in memory.
heartbeat: 2009/04/23_04:52:29 info: pid 18148 locked in memory.
heartbeat: 2009/04/23_04:52:29 info: pid 18151 locked in memory.
heartbeat: 2009/04/23_04:52:29 info: pid 18150 locked in memory.
heartbeat: 2009/04/23_04:52:29 info: pid 18147 locked in memory.
heartbeat: 2009/04/23_04:52:29 info: Link ems1:eth0 up.
heartbeat: 2009/04/23_04:52:29 info: pid 18149 locked in memory.
heartbeat: 2009/04/23_04:52:31 info: Link ems2:eth0 up.
heartbeat: 2009/04/23_04:52:31 info: Status update for node ems2: status up
heartbeat: 2009/04/23_04:52:31 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2009/04/23_04:53:00 WARN: node 192.168.10.5: is dead
heartbeat: 2009/04/23_04:53:00 info: Local status now set to: 'active'
heartbeat: 2009/04/23_04:53:00 info: Status update for node ems2: status active
heartbeat: 2009/04/23_04:53:00 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2009/04/23_04:53:00 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2009/04/23_04:53:12 info: local resource transition completed.
heartbeat: 2009/04/23_04:53:12 info: Initial resource acquisition complete (T_RESOURCES(us))
heartbeat: 2009/04/23_04:53:12 info: remote resource transition completed.
heartbeat: 2009/04/23_04:53:12 info: Local Resource acquisition completed.
heartbeat: 2009/04/23_04:53:12 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp
heartbeat: 2009/04/23_04:53:12 received ip-request-resp 192.168.20.163/24/eth0/192.168.20.255 OK yes
heartbeat: 2009/04/23_04:53:12 info: Acquiring resource group: ems1 192.168.20.163/24/eth0/192.168.20.255 netc
heartbeat: 2009/04/23_04:53:13 info: Running /etc/ha.d/resource.d/IPaddr 192.168.20.163/24/eth0/192.168.20.255 start
heartbeat: 2009/04/23_04:53:13 info: /sbin/ifconfig eth0:0 192.168.20.163 netmask 255.255.255.0 broadcast 192.168.20.255
heartbeat: 2009/04/23_04:53:13 info: Sending Gratuitous Arp for 192.168.20.163 on eth0:0 [eth0]
heartbeat: 2009/04/23_04:53:13 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-192.168.20.163 eth0 192.168.20.163 auto 192.168.20.163 ffffffffffff
heartbeat: 2009/04/23_04:53:13 info: Running /etc/init.d/netc start
heartbeat: 2009/04/23_04:53:46 ERROR: Both machines own our resources!
heartbeat: 2009/04/23_04:53:47 ERROR: Both machines own our resources!

Also don't understand why I am getting these errors at the end ERROR: Both machines own our resources!

Thanks,
Amrita

amrita_garg · April 23, 2009, 4:57am

Hi,
I have also added logs for server2
heartbeat: 2009/04/23_03:54:47 info: Configuration validated. Starting heartbeat 1.2.3.cvs.20050404
heartbeat: 2009/04/23_03:54:47 info: heartbeat: version 1.2.3.cvs.20050404
heartbeat: 2009/04/23_03:54:47 info: Heartbeat generation: 14
heartbeat: 2009/04/23_03:54:47 info: UDP Broadcast heartbeat started on port 694 (694) interface eth0
heartbeat: 2009/04/23_03:54:47 info: ucast: write socket priority set to IPTOS_LOWDELAY on eth0
heartbeat: 2009/04/23_03:54:47 info: ucast: bound send socket to device: eth0
heartbeat: 2009/04/23_03:54:47 info: ucast: bound receive socket to device: eth0
heartbeat: 2009/04/23_03:54:47 info: ucast: started on port 694 interface eth0 to 192.168.10.43
heartbeat: 2009/04/23_03:54:47 info: ping heartbeat started.
heartbeat: 2009/04/23_03:54:47 info: pid 13477 locked in memory.
heartbeat: 2009/04/23_03:54:47 info: Local status now set to: 'up'
heartbeat: 2009/04/23_03:54:48 info: pid 13480 locked in memory.
heartbeat: 2009/04/23_03:54:48 info: pid 13483 locked in memory.
heartbeat: 2009/04/23_03:54:48 info: pid 13482 locked in memory.
heartbeat: 2009/04/23_03:54:48 info: pid 13484 locked in memory.
heartbeat: 2009/04/23_03:54:48 info: pid 13481 locked in memory.
heartbeat: 2009/04/23_03:54:48 info: Link ems1:eth0 up.
heartbeat: 2009/04/23_03:54:48 info: Status update for node ems1: status up
heartbeat: 2009/04/23_03:54:48 info: Link ems2:eth0 up.
heartbeat: 2009/04/23_03:54:48 info: pid 13485 locked in memory.
heartbeat: 2009/04/23_03:54:48 info: pid 13486 locked in memory.
heartbeat: 2009/04/23_03:54:48 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2009/04/23_03:55:17 WARN: node 192.168.20.1: is dead
heartbeat: 2009/04/23_03:55:17 info: Local status now set to: 'active'
heartbeat: 2009/04/23_03:55:17 info: Status update for node ems1: status active
heartbeat: 2009/04/23_03:55:17 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2009/04/23_03:55:17 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2009/04/23_03:55:29 info: local resource transition completed.
heartbeat: 2009/04/23_03:55:29 info: Initial resource acquisition complete (T_RESOURCES(us))
heartbeat: 2009/04/23_03:55:29 info: remote resource transition completed.
heartbeat: 2009/04/23_03:55:29 info: No local resources [/usr/lib/heartbeat/ResourceManager listkeys ems2] to acquire.
heartbeat: 2009/04/23_03:56:03 WARN: node ems1: is dead
heartbeat: 2009/04/23_03:56:03 WARN: No STONITH device configured.
heartbeat: 2009/04/23_03:56:03 WARN: Shared disks are not protected.
heartbeat: 2009/04/23_03:56:03 info: Resources being acquired from ems1.
heartbeat: 2009/04/23_03:56:03 info: Link ems1:eth0 dead.
heartbeat: 2009/04/23_03:56:03 info: Running /etc/ha.d/rc.d/status status
heartbeat: 2009/04/23_03:56:03 info: No local resources [/usr/lib/heartbeat/ResourceManager listkeys ems2] to acquire.
heartbeat: 2009/04/23_03:56:03 info: Taking over resource group 192.168.20.163/24/eth0
heartbeat: 2009/04/23_03:56:03 info: Acquiring resource group: ems1 192.168.20.163/24/eth0 netc

So both the server are taking up resources.
Can someone please help, why the server stop seeing each other after heartbeat starts.
Thanks,
Amrita