Clstat not working in a HACMP 7.1.3 cluster

bakunin · October 3, 2014, 11:47am

I have troubles making clstat work. All the "usual suspects" have been covered but still no luck. The topology is a two-node active/passive with only one network-interface (it is a test-setup). The application running is SAP with DB/2 as database. We do not use SmartAssists or other gadgets.

Here are the OS and HACMP-versions:

# oslevel -s
7100-03-02-1412

# lslpp -L "cluster*"
  Fileset                      Level  State  Type  Description (Uninstaller)
  ----------------------------------------------------------------------------
  cluster.adt.es.client.include
                             7.1.3.1    C     F    PowerHA SystemMirror Client
                                                   Include Files
  cluster.adt.es.client.samples.clinfo
                             7.1.3.0    C     F    PowerHA SystemMirror Client
                                                   CLINFO Samples 
  cluster.es.client.clcomd   7.1.3.1    C     F    Cluster Communication
                                                   Infrastructure
  cluster.es.client.lib      7.1.3.1    C     F    PowerHA SystemMirror Client
                                                   Libraries
  cluster.es.client.rte      7.1.3.1    C     F    PowerHA SystemMirror Client
                                                   Runtime
  cluster.es.client.utils    7.1.3.0    C     F    PowerHA SystemMirror Client
                                                   Utilities 
  cluster.es.cspoc.cmds      7.1.3.1    C     F    CSPOC Commands
  cluster.es.cspoc.rte       7.1.3.1    C     F    CSPOC Runtime Commands
  cluster.es.migcheck        7.1.3.0    C     F    PowerHA SystemMirror Migration
                                                   support 
  cluster.es.nfs.rte         7.1.3.0    C     F    NFS Support 
  cluster.es.server.diag     7.1.3.1    C     F    Server Diags
  cluster.es.server.events   7.1.3.1    C     F    Server Events
  cluster.es.server.rte      7.1.3.1    C     F    Base Server Runtime
  cluster.es.server.testtool
                             7.1.3.0    C     F    Cluster Test Tool 
  cluster.es.server.utils    7.1.3.1    C     F    Server Utilities
  cluster.license            7.1.3.0    C     F    PowerHA SystemMirror
                                                   Electronic License 
  cluster.man.en_US.es.data  7.1.3.1    C     F    Man Pages - U.S. English

cldump works and all other cluster services are working as expected too. Alas, calling clstat:

# clstat -a
Failed retrieving cluster information.

There are a number of possible causes:
clinfoES or snmpd subsystems are not active.
snmp is unresponsive.
snmp is not configured correctly.
Cluster services are not active on any nodes.

Refer to the HACMP Administration Guide for more information.

I followed this procedure and double-checked everything mentioned there:

# tail -3 /etc/snmpdv3.conf
smux            1.3.6.1.4.1.2.3.1.2.1.2         gated_password
VACM_VIEW defaultView        1.3.6.1.4.1.2.3.1.2.1.5    - included -
smux     1.3.6.1.4.1.2.3.1.2.1.5      clsmuxpd_password ::1 128

# snmpinfo -m dump -v -o /usr/es/sbin/cluster/hacmp.defs cluster  
clusterId.0 = 1560242040
clusterName.0 = "<mycluster>"
clusterConfiguration.0 = ""
clusterState.0 = 2
clusterPrimary.0 = 1
clusterLastChange.0 = 1412260986
clusterGmtOffset.0 = -3600
clusterSubState.0 = 32
clusterNodeName.0 = "<my-node-name-a>"
clusterPrimaryNodeName.0 = "<my-node-name-a>"
clusterNumNodes.0 = 2
clusterNodeId.0 = 1
clusterNumSites.0 = 0

I also made sure the services are up and snmpd is the correct one:

# lssrc -g cluster
Subsystem         Group            PID          Status 
 clstrmgrES       cluster          10027094     active
 clinfoES         cluster          18743412     active

# lssrc -a
 aixmibd          tcpip            27263194     active
 snmpmibd         tcpip            5046514      active
 hostmibd         tcpip            30802078     active
[...]
 snmpd            tcpip            24772704     active

# ls -l /usr/sbin/snmpd
lrwxrwxrwx    1 root     system            9 Feb  5 2014  /usr/sbin/snmpd -> snmpdv3ne

The loopback-addresses for IPv6 are there in the /etc/hosts :

# head -2 /etc/hosts
127.0.0.1               loopback localhost      # loopback (lo0) name/address
::1                     loopback localhost      # IPv6 loopback (lo0) name/address

In the cited document it is mentioned to remove the comments in /etc/snmpdv3.conf as a last-ditch effort which i did. The services were restarted as described there and finally the whole system rebooted. I also did a cluster verification and synchronisation (in fact several times, before and after the reboot).

To be honest i am out of ideas what i still could do.

bakunin

jim_mcnamara · October 3, 2014, 2:06pm

I know nothing about AIX, but if the implementation of the snmp protocol is anything like elsewhere(so there may be some huge faults in my understanding), consider:

Are there required MIB lists missing as a startup parameter for snmpd?

'Failed to retrieve' can alternatively be rendered as 'do not know how'. MIB lists provide the know how. Or. It can mean 'permission denied'. So I assume permissions strings have not been changed from default. And your UDP stack/ports are all up correctly?

bakunin · October 3, 2014, 4:51pm

Thank you, Jim.

In fact all the MIB settings are in place (this is basically what the mentioned entries in /etc/snmpdv3.conf do) and the quoted line with the snmpinfo command proves that snmp is up and working as expected. I could have (and in fact - have) started snmpwalk instead and it shows the whole MIB tree for HACMP being in place. The listing is quite long so i didn't post it but in fact it is there.

In addition, if SNMP would not be configured correctly in respect to HACMP then the cldump should also not work, but does so. This is why i believe that SNMP is not the problem here but it is the common problem if clstat is not working so i posted the respective info beforehand.

bakunin

igalvarez · October 27, 2014, 10:32am

Hi, did you solve the issue?

If not, which level of AIX do you have?

It's an 'forever' old issue on hacmp.powerHA

Did you check on support if there any efix?
http://www14.software.ibm.com/webapp/set2/psearch/search?domain=power&new=y&os=aix

I remember we solved this issue with an APAR.

bakunin · October 27, 2014, 3:59pm

As it is, no. This is a test cluster for the latest AIX/PowerHA version and its integration with SAP.

See post #1, the output of the "oslevel" command.

Not to my knowledge. I have about 50 other clusters in my environment (mostly HACMP 6.x and 5.x, but also a few on 7.x, OS versions are 6.1-7.1.3), and "clstat" is working on any of them. I use to check cluster statii with "cldump" so i commonly do not use clstat, but i would like to understand why it is not working - just out of curiosity.

I would do so but right now i do not even understand where the problem is. If i could point to a certain fileset as the culprit i would try to get an update/efix/whatever or open an PMR, but i am not sure if there is anything left i could do before. There is no point in issuing a software call only to learn that "just do this, that and that to make it work as expected".

I'll be thankful if you could tell what the issue was because right now i don't even understand where the problem is.

bakunin

igalvarez · October 29, 2014, 8:11am

HI bakunin, sorry the delay..

we have got this error from time to time in our old AIX 6.1 (powerHA 6.1 GLVM) clusters. In deed last week we had to upgrade nodes from AIX 6.1TL6 to TL9 because a problem with clstat/cldump. But this is not your problem..

The steps we use here for all powerHA 6.1 clusters, sure are the same on your link above, are:

#!/bin/ksh
#
stopsrc -s hostmibd
stopsrc -s snmpmibd
stopsrc -s aixmibd
stopsrc -s snmpd
sleep 8
startsrc -s snmpd
startsrc -s aixmibd
startsrc -s snmpmibd
startsrc -s hostmibd
sleep 120
stopsrc -s clinfoES
startsrc -s clinfoES
sleep 120

Really sorry I can not help in this case...

bakunin · December 3, 2014, 10:27am

Finally i found a "solution" to my problem: install a even newer version. As it seems the version i used was somewhat differently abled, as i believe the politically correct euphemism for "buggy" is. (A big THANK YOU goes to IBM for letting me do the beta-testing of software i thought to have purchased. I only bought a cluster-software but got a built-in adventure game at no cost.)

Here is what i did: first, install the latest AIX release (AIX 7.1, TL3 SP3):

# lslpp -l bos.rte
  Fileset                      Level  State      Description         
  ----------------------------------------------------------------------------
Path: /usr/lib/objrepos
  bos.rte                   7.1.3.30  COMMITTED  Base Operating System Runtime

Path: /etc/objrepos
  bos.rte                   7.1.3.30  COMMITTED  Base Operating System Runtime

# oslevel -s
7100-03-04-1441

# instfix -i | grep SP
    All filesets for 71-00-011037_SP were found.
    All filesets for 71-00-021041_SP were found.
    All filesets for 71-00-031115_SP were found.
    All filesets for 71-01-011141_SP were found.
    All filesets for 71-00-041140_SP were found.
    All filesets for 71-01-021150_SP were found.
    All filesets for 71-01-031207_SP were found.
    All filesets for 71-00-051207_SP were found.
    All filesets for 71-01-041216_SP were found.
    All filesets for 71-00-061216_SP were found.
    All filesets for 71-01-051228_SP were found.
    All filesets for 71-00-071228_SP were found.
    All filesets for 71-02-011245_SP were found.
    All filesets for 71-00-081241_SP were found.
    All filesets for 71-01-061241_SP were found.
    All filesets for 71-02-021316_SP were found.
    All filesets for 71-00-091316_SP were found.
    All filesets for 71-01-071316_SP were found.
    All filesets for 71-00-101334_SP were found.
    All filesets for 71-01-081334_SP were found.
    All filesets for 71-02-031334_SP were found.
    All filesets for 71-01-091341_SP were found.
    All filesets for 71-02-041341_SP were found.
    All filesets for 71-03-011341_SP were found.
    All filesets for 71-03-021412_SP were found.
    All filesets for 71-01-101415_SP were found.
    All filesets for 71-02-051415_SP were found.
    All filesets for 71-03-031415_SP were found.
    All filesets for 71-02-061441_SP were found.
    All filesets for 71-03-041441_SP were found.

This i did on both nodes. I am not sure if this was necessary, but together with the other changes (see below) it did the job. Next was to update the cluster software itself:

# lslpp -l | grep -i cluster
  bos.cluster.rte           7.1.3.30  COMMITTED  Cluster Aware AIX
  bos.cluster.solid         7.1.1.15  COMMITTED  POWER HA Business Resiliency
  cluster.adt.es.client.include
  cluster.adt.es.client.samples.clinfo
  cluster.es.client.clcomd   7.1.3.2  COMMITTED  Cluster Communication
  cluster.es.client.lib      7.1.3.2  COMMITTED  PowerHA SystemMirror Client
  cluster.es.client.rte      7.1.3.2  COMMITTED  PowerHA SystemMirror Client
  cluster.es.client.utils    7.1.3.1  COMMITTED  PowerHA SystemMirror Client
  cluster.es.cspoc.cmds      7.1.3.2  COMMITTED  CSPOC Commands
  cluster.es.cspoc.rte       7.1.3.2  COMMITTED  CSPOC Runtime Commands
  cluster.es.migcheck        7.1.3.0  COMMITTED  PowerHA SystemMirror Migration
  cluster.es.nfs.rte         7.1.3.1  COMMITTED  NFS Support
  cluster.es.server.diag     7.1.3.2  COMMITTED  Server Diags
  cluster.es.server.events   7.1.3.2  COMMITTED  Server Events
  cluster.es.server.rte      7.1.3.2  COMMITTED  Base Server Runtime
  cluster.es.server.testtool
                             7.1.3.0  COMMITTED  Cluster Test Tool 
  cluster.es.server.utils    7.1.3.2  COMMITTED  Server Utilities
  cluster.license            7.1.3.0  COMMITTED  PowerHA SystemMirror
  mcr.rte                   7.1.3.30  COMMITTED  Metacluster Checkpoint and
  bos.cluster.rte           7.1.3.30  COMMITTED  Cluster Aware AIX
  bos.cluster.solid          7.1.1.0  COMMITTED  Cluster Aware AIX SolidDB 
  cluster.es.client.clcomd   7.1.3.0  COMMITTED  Cluster Communication
  cluster.es.client.lib      7.1.3.2  COMMITTED  PowerHA SystemMirror Client
  cluster.es.client.rte      7.1.3.2  COMMITTED  PowerHA SystemMirror Client
  cluster.es.cspoc.rte       7.1.3.0  COMMITTED  CSPOC Runtime Commands 
  cluster.es.migcheck        7.1.3.0  COMMITTED  PowerHA SystemMirror Migration
  cluster.es.nfs.rte         7.1.3.1  COMMITTED  NFS Support
  cluster.es.server.diag     7.1.3.0  COMMITTED  Server Diags 
  cluster.es.server.events   7.1.3.0  COMMITTED  Server Events 
  cluster.es.server.rte      7.1.3.2  COMMITTED  Base Server Runtime
  cluster.es.server.utils    7.1.3.2  COMMITTED  Server Utilities
  mcr.rte                   7.1.3.30  COMMITTED  Metacluster Checkpoint and
  cluster.man.en_US.es.data  7.1.3.2  COMMITTED  Man Pages - U.S. English

After this (and of course a reboot) i did a final cluster verification, then started the cluster without any problems. SNMP (and, as far as i can see, everything else) was working as expected. All in all it took me about 25 minutes per node, most of it can be done in parallel if the cluster can be stopped. Plan about 30-60 minutes for the whole update if you have the resources ready from the NIM server and everything else working.

I hope this helps.

bakunin

pierrick · January 12, 2015, 12:33pm

Hi bakunin,

Thank you for this interesting thread.
The same problem occurs on AIX 7.1 TL3 and also AIX 6.1 TL9.
These 2 updates use PowerHA 7.1.3.

Could you please give a reference to the IBM support service request ?
Or say clearly what packages you updated ?

best regards,
Pierrick

bakunin · January 12, 2015, 2:09pm

Yes, actually the system was at 7.1 TL3, SP2 at the start (see post #1). The only package missing was infocenter, which is why TL2 was shown in the oslevel output.

As stated above i have not opened any PMR, i just wanted to understand the problem. The packages i updated where the updates necessary to get from 1412 to 1441 (which, in IBM versions, means from the version issued in week 12 of 2014 to the one issued in week 41 of 2014). Compare posts #1 and #4 for that purpose.

As i used a lpp_source from my NIM-server to do the update i didn't keep notes about every single package. I might be able to pull it out of the lslpp-history if your interest is not satisfied yet.

I hope this helps.

bakunin