Broken Centos Cluster

Hi Guys,

Hopefully this is just a quick one - but you never know.

I have/had a Centos Cluster running a Netbackup server - I've had an outage and we seem to have lost a node. As a consequence I'm in a bit of a quandary, not familiar with this software either.

The server is a Dell PowerEdge 1950 running Centos 5.4 with the kernel 2.6.18-164.11.1.el5PAE #1 SMP and wait for it a back ported GFS for compatibility.

I've managed to get the system back and the GFS disk mounted by hacking the /etc/cluster/cluster.conf file as follows - the original file first.

<?xml version="1.0"?>
<cluster alias="scsymbak00" config_version="93" name="scsymbak00">
        <fence_daemon clean_start="0" post_fail_delay="1" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="scsymbak01.xxx.com" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="scsymbak01_drac"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="scsymbak02.xxx.com" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="scsymbak02_drac"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_drac" ipaddr="192.168.0.201" login="root" name="scsymbak01_drac" passwd="drut"/>
                <fencedevice agent="fence_drac" ipaddr="192.168.0.202" login="root" name="scsymbak02_drac" passwd="drut"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="scsymbak_fd" ordered="1" restricted="1">
                                <failoverdomainnode name="scsymbak01.xxx.com" priority="2"/>
                                <failoverdomainnode name="scsymbak02.xxx.com" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="10.143.252.200" monitor_link="1"/>
                        <script file="/etc/init.d/nbclient" name="nbclient_init"/>
                        <script file="/etc/init.d/netbackup" name="netbackup_init"/>
                        <clusterfs device="/dev/mapper/VolGroup10-DATA" force_unmount="1" fsid="41517" fstype="gfs2" mountpoint="/data" name="symbak_GFS"/>
                        <lvm lv_name="DATA" name="VolGroup10_DATA_CLVM2" vg_name="VolGroup10"/>
                        <script file="/etc/init.d/xinetd" name="xinetd_init"/>
                        <script file="/etc/init.d/vxpbx_exchanged" name="vxpbx_init"/>
                        <ip address="10.143.224.200" monitor_link="1"/>
                        <ip address="10.143.226.200" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="scsymbak_fd" exclusive="0" name="netbackup_srv" recovery="restart">
                        <ip ref="10.143.224.200"/>
                        <ip ref="10.143.226.200"/>
                        <ip ref="10.143.252.200"/>
                        <script ref="vxpbx_init"/>
                        <script ref="xinetd_init"/>
                </service>
        </rm>
</cluster>

This was changed to;

<?xml version="1.0"?>
<cluster alias="scsymbak00" config_version="93" name="scsymbak00">
        <fence_daemon clean_start="0" post_fail_delay="1" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="scsymbak02.xxx.com" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="scsymbak02_drac"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="0"/>
        <fencedevices>
                <fencedevice agent="fence_drac" ipaddr="192.168.0.202" login="root" name="scsymbak02_drac" passwd="drut"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="scsymbak_fd" ordered="1" restricted="1">
                                <failoverdomainnode name="scsymbak02.xxx.com" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <ip address="10.143.252.200" monitor_link="1"/>
                        <script file="/etc/init.d/nbclient" name="nbclient_init"/>
                        <script file="/etc/init.d/netbackup" name="netbackup_init"/>
                        <clusterfs device="/dev/mapper/VolGroup10-DATA" force_unmount="1" fsid="41517" fstype="gfs2" mountpoint="/data" name="symbak_GFS"/>
                        <lvm lv_name="DATA" name="VolGroup10_DATA_CLVM2" vg_name="VolGroup10"/>
                        <script file="/etc/init.d/xinetd" name="xinetd_init"/>
                        <script file="/etc/init.d/vxpbx_exchanged" name="vxpbx_init"/>
                        <ip address="10.143.224.200" monitor_link="1"/>
                        <ip address="10.143.226.200" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="scsymbak_fd" exclusive="0" name="netbackup_srv" recovery="restart">
                        <ip ref="10.143.224.200"/>
                        <ip ref="10.143.226.200"/>
                        <ip ref="10.143.252.200"/>
                        <script ref="vxpbx_init"/>
                        <script ref="xinetd_init"/>
                </service>
        </rm>
</cluster>

When I run clustat I see the following.

Cluster Status for scsymbak00 @ Mon Nov 17 16:55:28 2014
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 scsymbak02.xxx.com                                            1 Online, Local, rgmanager

 Service Name                                                     Owner (Last)                                                     State
 ------- ----                                                     ----- ------                                                     -----
 service:netbackup_srv                                            (none)                                                           stopped

Although the disks have come back, the cluster doesn't seem to be up - is there anything else that I should be looking at. The networking hasn't started properly as I'm not seeing the clustered IP's so here is the output of ifconfig.

[root@scsymbak02 cluster]# ifconfig -a
bond0     Link encap:Ethernet  HWaddr 00:1E:C9:AB:BB:11
          inet addr:10.143.252.202  Bcast:10.143.253.255  Mask:255.255.254.0
          inet6 addr: fe80::21e:c9ff:feab:bb11/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:41863 errors:0 dropped:22325 overruns:0 frame:0
          TX packets:47278 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3704818 (3.5 MiB)  TX bytes:32203049 (30.7 MiB)

bond0:1   Link encap:Ethernet  HWaddr 00:1E:C9:AB:BB:11
          inet addr:192.168.0.102  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bond1     Link encap:Ethernet  HWaddr 00:1B:21:18:29:68
          inet addr:10.143.224.202  Bcast:10.143.225.255  Mask:255.255.254.0
          inet6 addr: fe80::21b:21ff:fe18:2968/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:9768 errors:0 dropped:0 overruns:0 frame:0
          TX packets:14251 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:694596 (678.3 KiB)  TX bytes:847554 (827.6 KiB)

bond2     Link encap:Ethernet  HWaddr 00:1B:21:18:29:69
          inet addr:10.143.226.202  Bcast:10.143.227.255  Mask:255.255.254.0
          UP BROADCAST MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

eth0      Link encap:Ethernet  HWaddr 00:1E:C9:AB:BB:11
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:30568 errors:0 dropped:11059 overruns:0 frame:0
          TX packets:21803 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2808757 (2.6 MiB)  TX bytes:11663865 (11.1 MiB)
          Interrupt:177 Memory:f8000000-f8012800

eth1      Link encap:Ethernet  HWaddr 00:1E:C9:AB:BB:13
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:11295 errors:0 dropped:11266 overruns:0 frame:0
          TX packets:25475 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:896061 (875.0 KiB)  TX bytes:20539184 (19.5 MiB)
          Interrupt:169 Memory:f4000000-f4012800

eth2      Link encap:Ethernet  HWaddr 00:1B:21:18:29:68
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:4758 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7181 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:339738 (331.7 KiB)  TX bytes:426270 (416.2 KiB)
          Memory:fd2e0000-fd300000

eth3      Link encap:Ethernet  HWaddr 00:1B:21:18:29:69
          UP BROADCAST SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Memory:fd2a0000-fd2c0000

eth4      Link encap:Ethernet  HWaddr 00:1B:21:18:29:6C
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:5010 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7070 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:354858 (346.5 KiB)  TX bytes:421284 (411.4 KiB)
          Memory:fcce0000-fcd00000

eth5      Link encap:Ethernet  HWaddr 00:1B:21:18:29:6D
          UP BROADCAST SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Memory:fcca0000-fccc0000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:7247 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7247 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:971488 (948.7 KiB)  TX bytes:971488 (948.7 KiB)

sit0      Link encap:IPv6-in-IPv4
          NOARP  MTU:1480  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

So I guess the question is, what should I start with - if I want to boot this cluster as single node - how should I go about it. Are there any other changes that I should make to the cluster.conf file? Or are there any other files that I should be changing as well - any help here would be really appreciated.

Unfortunately I have a dentists appointment, but I'll be back online a little later - but out of the office. However if there are any other files I have to look at or change, I'll be doing that first thing in the morning.

Regards

Dave