sun Cluster resource group cant failover

lesliek · July 8, 2008, 7:24am

I have rcently setup a 4 node cluster running sun cluster 3.2

and I have installed 4 zones on each node. when installing the zones I had to install the zone on all nodes the on the last node do a zlogin -C <zonename>

this worked ok.

theni I tried to siwitch the zone to node a thei work fine when I try to switch to node b this also works fine but whenI try to switch to c or d this doe not work .
it seems like it get to a certain stage the it falls over when trying to load the zone-rs I have setup HAStorage plus to deal iwth mounting the file system and this works fine for a and b but it can load the zone-rs does any one have any cluse ?

incredible · July 8, 2008, 9:25am

open up 2 sessions. perform the failover on one session. on the other, tail the cluster log file and show me what you're getting.

lesliek · July 8, 2008, 11:45am

Can you confirm where the cluster logs are kept ? I assume you are not taling about /var/adm/messages

incredible · July 9, 2008, 1:43am

log files - /var/cluster/logs, /var/adm/messages
sccheck logs - /var/cluster/sccheck/report.<date>
CCR files - /etc/cluster/ccr
Cluster Infra file - /etc/cluster/ccr/infrastructure

lesliek · July 9, 2008, 6:03am

Hi,

Thanks for geting back in contact. I have attached a copy of the messages file when trying to failover the prox2-rg resource group,
I will send additional info from the files you have requested:-

Jul 9 11:54:51 C2SRV2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <hastorageplus_prenet_start> for resource <proxy2-HAS-rs>, resource group <proxy2-rg>, node <C2SRV2>, timeout <1800> seconds
Jul 9 11:54:51 C2SRV2 Cluster.RGM.rgmd: [ID 252072 daemon.notice] 50 fe_rpc_command: cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hastorageplus/hastorageplus_prenet_start>:tag=<proxy2-rg.proxy2-HAS-rs.10>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
Jul 9 11:54:51 C2SRV2 Cluster.RGM.rgmd: [ID 285716 daemon.notice] 20 fe_rpc_command: cmd_type(enum):<2>:cmd=<null>:tag=<proxy2-rg.proxy2-HAS-rs.10>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<0>, ...)
Jul 9 11:54:51 C2SRV2 Cluster.RGM.rgmd: [ID 316625 daemon.notice] Timeout monitoring on method tag <proxy2-rg.proxy2-HAS-rs.10> has been suspended.
Jul 9 11:54:54 C2SRV2 Cluster.Framework: [ID 801593 daemon.notice] stdout: becoming primary for proxy2-dg
Jul 9 11:54:56 C2SRV2 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Jul 9 11:54:56 C2SRV2 /scsi_vhci/ssd@g600a0b800029d28e000005ff48649a5c (ssd24): path /pci@780/SUNW,qlc@0/fp@0,0 (fp1) target address 200a00a0b829d290,b is now STANDBY because of an externally initiated failover
Jul 9 11:55:01 C2SRV2 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Jul 9 11:55:01 C2SRV2 Initiating failover for device ssd (GUID 600a0b800029d28e000005ff48649a5c)
Jul 9 11:55:03 C2SRV2 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Jul 9 11:55:03 C2SRV2 Failover operation completed successfully for device ssd (GUID 600a0b800029d28e000005ff48649a5c): failed over from <none> to primary
Jul 9 11:55:03 C2SRV2 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Jul 9 11:55:03 C2SRV2 /scsi_vhci/ssd@g600a0b800029d2160000057148649e21 (ssd25): path /pci@780/SUNW,qlc@0/fp@0,0 (fp1) target address 200a00a0b829d290,c is now STANDBY because of an externally initiated failover
Jul 9 11:55:08 C2SRV2 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Jul 9 11:55:08 C2SRV2 Initiating failover for device ssd (GUID 600a0b800029d2160000057148649e21)
Jul 9 11:55:09 C2SRV2 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Jul 9 11:55:09 C2SRV2 Failover operation completed successfully for device ssd (GUID 600a0b800029d2160000057148649e21): failed over from <none> to secondary
Jul 9 11:55:10 C2SRV2 Cluster.RGM.rgmd: [ID 285716 daemon.notice] 20 fe_rpc_command: cmd_type(enum):<3>:cmd=<null>:tag=<proxy2-rg.proxy2-HAS-rs.10>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<0>, ...)
Jul 9 11:55:10 C2SRV2 Cluster.RGM.rgmd: [ID 316625 daemon.notice] Timeout monitoring on method tag <proxy2-rg.proxy2-HAS-rs.10> has been resumed.
Jul 9 11:55:12 C2SRV2 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <hastorageplus_prenet_start> completed successfully for resource <proxy2-HAS-rs>, resource group <proxy2-rg>, node <C2SRV2>, time used: 1% of timeout <1800 seconds>
Jul 9 11:55:12 C2SRV2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <hastorageplus_monitor_start> for resource <proxy2-HAS-rs>, resource group <proxy2-rg>, node <C2SRV2>, timeout <90> seconds
Jul 9 11:55:12 C2SRV2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <gds_svc_start> for resource <proxy2-zone-rs>, resource group <proxy2-rg>, node <C2SRV2>, timeout <300> seconds
Jul 9 11:55:12 C2SRV2 Cluster.RGM.rgmd: [ID 333393 daemon.notice] 49 fe_rpc_command: cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hastorageplus/hastorageplus_monitor_start>:tag=<proxy2-rg.proxy2-HAS-rs.7>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
Jul 9 11:55:12 C2SRV2 Cluster.RGM.rgmd: [ID 252072 daemon.notice] 50 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWscgds/bin/gds_svc_start>:tag=<proxy2-rg.proxy2-zone-rs.0>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
Jul 9 11:55:12 C2SRV2 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <hastorageplus_monitor_start> completed successfully for resource <proxy2-HAS-rs>, resource group <proxy2-rg>, node <C2SRV2>, time used: 0% of timeout <90 seconds>
Jul 9 11:55:13 C2SRV2 genunix: [ID 408114 kern.info] /pseudo/zconsnex@1/zcons@1 (zcons1) online

Jul 9 12:00:16 C2SRV2 Cluster.RGM.rgmd: [ID 764140 daemon.error] Method <gds_svc_start> on resource <proxy2-zone-rs>, resource group <proxy2-rg>, node <C2SRV2>: Timeout.
Jul 9 12:00:16 C2SRV2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <hastorageplus_monitor_stop> for resource <proxy2-HAS-rs>, resource group <proxy2-rg>, node <C2SRV2>, timeout <90> seconds
Jul 9 12:00:16 C2SRV2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <gds_svc_stop> for resource <proxy2-zone-rs>, resource group <proxy2-rg>, node <C2SRV2>, timeout <300> seconds
Jul 9 12:00:16 C2SRV2 Cluster.RGM.rgmd: [ID 333393 daemon.notice] 49 fe_rpc_command: cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hastorageplus/hastorageplus_monitor_stop>:tag=<proxy2-rg.proxy2-HAS-rs.8>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
Jul 9 12:00:16 C2SRV2 Cluster.RGM.rgmd: [ID 252072 daemon.notice] 50 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWscgds/bin/gds_svc_stop>:tag=<proxy2-rg.proxy2-zone-rs.1>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2:DES} =<1>, ...)
Jul 9 12:00:16 C2SRV2 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <hastorageplus_monitor_stop> completed successfully for resource <proxy2-HAS-rs>, resource group <proxy2-rg>, node <C2SRV2>, time used: 0% of timeout <90 seconds>

lesliek · July 9, 2008, 6:08am

hi,

I have attached information from the /etc/cluster/ccr/infrastructure :

bash-3.00# cat infrastructure
ccr_gennum 37
ccr_checksum CA2D84527351512B9D561320FA5CB117
cluster.name scberl
cluster.state enabled
cluster.properties.cluster_id 0x48450FEC
cluster.properties.installmode disabled
cluster.properties.private_net_number 172.16.0.0
cluster.properties.private_netmask 255.255.248.0
cluster.properties.private_subnet_netmask 255.255.255.128
cluster.properties.private_user_net_number 172.16.4.0
cluster.properties.private_user_netmask 255.255.254.0
cluster.properties.private_maxnodes 64
cluster.properties.private_maxprivnets 10
cluster.properties.auth_joinlist_type sys
cluster.properties.auth_joinlist_hostslist C2SRV1,C2SRV2
cluster.properties.transport_heartbeat_timeout 10000
cluster.properties.transport_heartbeat_quantum 1000
cluster.properties.udp_session_timeout 480
cluster.properties.cmm_version 1
cluster.nodes.1.name C2SRV3
cluster.nodes.1.state enabled
cluster.nodes.1.properties.private_hostname clusternode1-priv
cluster.nodes.1.properties.quorum_vote 1
cluster.nodes.1.properties.quorum_resv_key 0x48450FEC00000001
cluster.nodes.1.adapters.1.name bge2
cluster.nodes.1.adapters.1.state enabled
cluster.nodes.1.adapters.1.properties.device_name bge
cluster.nodes.1.adapters.1.properties.device_instance 2
cluster.nodes.1.adapters.1.properties.transport_type dlpi
cluster.nodes.1.adapters.1.properties.lazy_free 1
cluster.nodes.1.adapters.1.properties.dlpi_heartbeat_timeout 10000
cluster.nodes.1.adapters.1.properties.dlpi_heartbeat_quantum 1000
cluster.nodes.1.adapters.1.properties.nw_bandwidth 80
cluster.nodes.1.adapters.1.properties.bandwidth 70
cluster.nodes.1.adapters.1.properties.ip_address 172.16.0.129
cluster.nodes.1.adapters.1.properties.netmask 255.255.255.128
cluster.nodes.1.adapters.1.ports.1.name 0
cluster.nodes.1.adapters.1.ports.1.state enabled
cluster.nodes.1.adapters.2.name bge21001
cluster.nodes.1.adapters.2.state enabled
cluster.nodes.1.adapters.2.properties.device_name bge
cluster.nodes.1.adapters.2.properties.device_instance 1
cluster.nodes.1.adapters.2.properties.transport_type dlpi
cluster.nodes.1.adapters.2.properties.lazy_free 1
cluster.nodes.1.adapters.2.properties.dlpi_heartbeat_timeout 10000
cluster.nodes.1.adapters.2.properties.dlpi_heartbeat_quantum 1000
cluster.nodes.1.adapters.2.properties.nw_bandwidth 80
cluster.nodes.1.adapters.2.properties.bandwidth 70
cluster.nodes.1.adapters.2.properties.vlan_id 21
cluster.nodes.1.adapters.2.properties.ip_address 172.16.1.1
cluster.nodes.1.adapters.2.properties.netmask 255.255.255.128
cluster.nodes.1.adapters.2.ports.1.name 0
cluster.nodes.1.adapters.2.ports.1.state enabled
cluster.nodes.1.cmm_version 1
cluster.nodes.2.name C2SRV4
cluster.nodes.2.state enabled
cluster.nodes.2.properties.quorum_vote 1
cluster.nodes.2.properties.quorum_resv_key 0x48450FEC00000002
cluster.nodes.2.properties.private_hostname clusternode2-priv
cluster.nodes.2.adapters.1.name bge2
cluster.nodes.2.adapters.1.properties.device_name bge
cluster.nodes.2.adapters.1.properties.device_instance 2
cluster.nodes.2.adapters.1.properties.transport_type dlpi
cluster.nodes.2.adapters.1.properties.lazy_free 1
cluster.nodes.2.adapters.1.properties.dlpi_heartbeat_timeout 10000
cluster.nodes.2.adapters.1.properties.dlpi_heartbeat_quantum 1000
cluster.nodes.2.adapters.1.properties.nw_bandwidth 80
cluster.nodes.2.adapters.1.properties.bandwidth 70
cluster.nodes.2.adapters.1.properties.ip_address 172.16.0.130
cluster.nodes.2.adapters.1.properties.netmask 255.255.255.128
cluster.nodes.2.adapters.1.state enabled
cluster.nodes.2.adapters.1.ports.1.name 0
cluster.nodes.2.adapters.1.ports.1.state enabled
cluster.nodes.2.adapters.2.name bge21001
cluster.nodes.2.adapters.2.properties.device_name bge
cluster.nodes.2.adapters.2.properties.device_instance 1
cluster.nodes.2.adapters.2.properties.vlan_id 21
cluster.nodes.2.adapters.2.properties.transport_type dlpi
cluster.nodes.2.adapters.2.properties.lazy_free 1
cluster.nodes.2.adapters.2.properties.dlpi_heartbeat_timeout 10000
cluster.nodes.2.adapters.2.properties.dlpi_heartbeat_quantum 1000
cluster.nodes.2.adapters.2.properties.nw_bandwidth 80
cluster.nodes.2.adapters.2.properties.bandwidth 70
cluster.nodes.2.adapters.2.properties.ip_address 172.16.1.2
cluster.nodes.2.adapters.2.properties.netmask 255.255.255.128
cluster.nodes.2.adapters.2.state enabled
cluster.nodes.2.adapters.2.ports.1.name 0
cluster.nodes.2.adapters.2.ports.1.state enabled
cluster.nodes.2.cmm_version 1
cluster.nodes.3.name C2SRV1
cluster.nodes.3.state enabled
cluster.nodes.3.properties.quorum_vote 1
cluster.nodes.3.properties.quorum_resv_key 0x48450FEC00000003
cluster.nodes.3.properties.private_hostname clusternode3-priv
cluster.nodes.3.adapters.1.name bge2
cluster.nodes.3.adapters.1.properties.device_name bge
cluster.nodes.3.adapters.1.properties.device_instance 2
cluster.nodes.3.adapters.1.properties.transport_type dlpi
cluster.nodes.3.adapters.1.properties.lazy_free 1
cluster.nodes.3.adapters.1.properties.dlpi_heartbeat_timeout 10000
cluster.nodes.3.adapters.1.properties.dlpi_heartbeat_quantum 1000
cluster.nodes.3.adapters.1.properties.nw_bandwidth 80
cluster.nodes.3.adapters.1.properties.bandwidth 70
cluster.nodes.3.adapters.1.properties.ip_address 172.16.0.131
cluster.nodes.3.adapters.1.properties.netmask 255.255.255.128
cluster.nodes.3.adapters.1.state enabled
cluster.nodes.3.adapters.1.ports.1.name 0
cluster.nodes.3.adapters.1.ports.1.state enabled
cluster.nodes.3.adapters.2.name bge21001
cluster.nodes.3.adapters.2.properties.device_name bge
cluster.nodes.3.adapters.2.properties.device_instance 1
cluster.nodes.3.adapters.2.properties.vlan_id 21
cluster.nodes.3.adapters.2.properties.transport_type dlpi
cluster.nodes.3.adapters.2.properties.lazy_free 1
cluster.nodes.3.adapters.2.properties.dlpi_heartbeat_timeout 10000
cluster.nodes.3.adapters.2.properties.dlpi_heartbeat_quantum 1000
cluster.nodes.3.adapters.2.properties.nw_bandwidth 80
cluster.nodes.3.adapters.2.properties.bandwidth 70
cluster.nodes.3.adapters.2.properties.ip_address 172.16.1.3
cluster.nodes.3.adapters.2.properties.netmask 255.255.255.128
cluster.nodes.3.adapters.2.state enabled
cluster.nodes.3.adapters.2.ports.1.name 0
cluster.nodes.3.adapters.2.ports.1.state enabled
cluster.nodes.4.name C2SRV2
cluster.nodes.4.state enabled
cluster.nodes.4.properties.quorum_vote 1
cluster.nodes.4.properties.quorum_resv_key 0x48450FEC00000004
cluster.nodes.4.properties.private_hostname clusternode4-priv
cluster.nodes.4.adapters.1.name bge2
cluster.nodes.4.adapters.1.properties.device_name bge
cluster.nodes.4.adapters.1.properties.device_instance 2
cluster.nodes.4.adapters.1.properties.transport_type dlpi
cluster.nodes.4.adapters.1.properties.lazy_free 1
cluster.nodes.4.adapters.1.properties.dlpi_heartbeat_timeout 10000
cluster.nodes.4.adapters.1.properties.dlpi_heartbeat_quantum 1000
cluster.nodes.4.adapters.1.properties.nw_bandwidth 80
cluster.nodes.4.adapters.1.properties.bandwidth 70
cluster.nodes.4.adapters.1.properties.ip_address 172.16.0.132
cluster.nodes.4.adapters.1.properties.netmask 255.255.255.128
cluster.nodes.4.adapters.1.state enabled
cluster.nodes.4.adapters.1.ports.1.name 0
cluster.nodes.4.adapters.1.ports.1.state enabled
cluster.nodes.4.adapters.2.name bge21001
cluster.nodes.4.adapters.2.properties.device_name bge
cluster.nodes.4.adapters.2.properties.device_instance 1
cluster.nodes.4.adapters.2.properties.vlan_id 21
cluster.nodes.4.adapters.2.properties.transport_type dlpi
cluster.nodes.4.adapters.2.properties.lazy_free 1
cluster.nodes.4.adapters.2.properties.dlpi_heartbeat_timeout 10000
cluster.nodes.4.adapters.2.properties.dlpi_heartbeat_quantum 1000
cluster.nodes.4.adapters.2.properties.nw_bandwidth 80
cluster.nodes.4.adapters.2.properties.bandwidth 70
cluster.nodes.4.adapters.2.properties.ip_address 172.16.1.4
cluster.nodes.4.adapters.2.properties.netmask 255.255.255.128
cluster.nodes.4.adapters.2.state enabled
cluster.nodes.4.adapters.2.ports.1.name 0
cluster.nodes.4.adapters.2.ports.1.state enabled
cluster.blackboxes.1.name switch1
cluster.blackboxes.1.state enabled
cluster.blackboxes.1.properties.type switch
cluster.blackboxes.1.ports.1.name 1
cluster.blackboxes.1.ports.1.state enabled
cluster.blackboxes.1.ports.2.name 2
cluster.blackboxes.1.ports.2.state enabled
cluster.blackboxes.1.ports.3.name 3
cluster.blackboxes.1.ports.3.state enabled
cluster.blackboxes.1.ports.4.name 4
cluster.blackboxes.1.ports.4.state enabled
cluster.blackboxes.2.name switch2
cluster.blackboxes.2.state enabled
cluster.blackboxes.2.properties.type switch
cluster.blackboxes.2.ports.1.name 1
cluster.blackboxes.2.ports.1.state enabled
cluster.blackboxes.2.ports.2.name 2
cluster.blackboxes.2.ports.2.state enabled
cluster.blackboxes.2.ports.3.name 3
cluster.blackboxes.2.ports.3.state enabled
cluster.blackboxes.2.ports.4.name 4
cluster.blackboxes.2.ports.4.state enabled
cluster.cables.1.properties.end1 cluster.nodes.1.adapters.1.ports.1
cluster.cables.1.properties.end2 cluster.blackboxes.1.ports.1
cluster.cables.1.state enabled
cluster.cables.2.properties.end1 cluster.nodes.1.adapters.2.ports.1
cluster.cables.2.properties.end2 cluster.blackboxes.2.ports.1
cluster.cables.2.state enabled
cluster.cables.3.properties.end1 cluster.nodes.2.adapters.1.ports.1
cluster.cables.3.properties.end2 cluster.blackboxes.1.ports.2
cluster.cables.3.state enabled
cluster.cables.4.properties.end1 cluster.nodes.2.adapters.2.ports.1
cluster.cables.4.properties.end2 cluster.blackboxes.2.ports.2
cluster.cables.4.state enabled
cluster.cables.5.properties.end1 cluster.nodes.3.adapters.1.ports.1
cluster.cables.5.properties.end2 cluster.blackboxes.1.ports.3
cluster.cables.5.state enabled
cluster.cables.6.properties.end1 cluster.nodes.3.adapters.2.ports.1
cluster.cables.6.properties.end2 cluster.blackboxes.2.ports.3
cluster.cables.6.state enabled
cluster.cables.7.properties.end1 cluster.nodes.4.adapters.1.ports.1
cluster.cables.7.properties.end2 cluster.blackboxes.1.ports.4
cluster.cables.7.state enabled
cluster.cables.8.properties.end1 cluster.nodes.4.adapters.2.ports.1
cluster.cables.8.properties.end2 cluster.blackboxes.2.ports.4
cluster.cables.8.state enabled
cluster.quorum_devices.2.name d15
cluster.quorum_devices.2.state enabled
cluster.quorum_devices.2.properties.votecount 1
cluster.quorum_devices.2.properties.gdevname /dev/did/rdsk/d15s2
cluster.quorum_devices.2.properties.path_1 enabled
cluster.quorum_devices.2.properties.path_2 enabled
cluster.quorum_devices.2.properties.access_mode scsi2
cluster.quorum_devices.2.properties.type scsi2

lesliek · July 9, 2008, 6:16am

hi,

I have also attached acopy of the /etc/cluster/ccr/rgm_rg_proxy2-rg
file:

bash-3.00# cat rgm_rg_proxy2-rg
ccr_gennum 6
ccr_checksum 53FF13F4E152CAB05ED6D524C74B089C
Unmanaged FALSE
Nodelist 1,2,3,4
Maximum_primaries 1
Desired_primaries 1
Failback FALSE
RG_System FALSE
Resource_list proxy2-HAS-rs,proxy2-zone-rs
RG_dependencies
Global_resources_used *
RG_mode Failover
Implicit_network_dependencies TRUE
Pathprefix
RG_description
Pingpong_interval 3600
RG_project_name
RG_SLM_type manual
RG_SLM_pset_type default
RG_SLM_CPU_SHARES 1
RG_SLM_PSET_MIN 0
RG_affinities
Auto_start_on_new_cluster TRUE
Suspend_automatic_recovery FALSE
Ok_To_Start
RS_proxy2-HAS-rs Type=SUNW.HAStoragePlus:6;Type_version=6;R_description=;On_off_switch=1,2,3,4;Monitored_switch=1,2,3,4;Resource_project_name=;Resource_dependencies=;Resource_dependencies_weak=;Resource_dependencies_restart=;Resource_dependencies_offline_restart=;Extension;FilesystemMountPoints=/opt/zones/mail/proxy2.mail.internal,/opt/zones/mail/proxy2.mail.internal/mounts/var
RS_proxy2-zone-rs Type=SUNW.gds:6;Type_version=6;R_description=;On_off_switch=1,2,3,4;Monitored_switch=1,2,3,4;Resource_project_name=;Resource_dependencies=proxy2-HAS-rs;Resource_dependencies_weak=;Resource_dependencies_restart=;Resource_dependencies_offline_restart=;Extension;Start_command=/opt/SUNWsczone/sczbt/bin/start_sczbt -R proxy2-zone-rs -G proxy2-rg -P /opt/ParameterFile;Stop_command=/opt/SUNWsczone/sczbt/bin/stop_sczbt -R proxy2-zone-rs -G proxy2-rg -P /opt/ParameterFile;Probe_command=/opt/SUNWsczone/sczbt/bin/probe_sczbt -R proxy2-zone-rs -G proxy2-rg -P /opt/ParameterFile;Network_aware=FALSE;Stop_signal=9

incredible · July 9, 2008, 7:07am

Failover operation completed successfully for device ssd (GUID 600a0b800029d2160000057148649e21): failed over from <none> to secondary

What command did you issue to test failover? Can you do it once more and capture exactly hat you get during that time.?

incredible · July 9, 2008, 8:44am

Are you aware ?
#200892: Sun Cluster 3.x Servers With Certain Qlogic HBA Drivers Attached to EMC Arrays may Encounter System Panics and Failed Service Failover

lesliek · July 9, 2008, 8:52am

HI,

the command I used was clrg switch -n C2SRV2 proxy2-rg

I have tail -f the messages and I have attached this information:

Jul 9 14:35:06 C2SRV2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <hastorageplus_prenet_start> for resource <proxy2-HAS-rs>, resource group <proxy2-rg>, node <C2SRV2>, timeout <1800> seconds
Jul 9 14:35:06 C2SRV2 Cluster.RGM.rgmd: [ID 252072 daemon.notice] 50 fe_rpc_command: cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hastorageplus/hastorageplus_prenet_start>:tag=<proxy2-rg.proxy2-HAS-rs.10>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2DES} =<1>, ...)
Jul 9 14:35:06 C2SRV2 Cluster.RGM.rgmd: [ID 285716 daemon.notice] 20 fe_rpc_command: cmd_type(enum):<2>:cmd=<null>:tag=<proxy2-rg.proxy2-HAS-rs.10>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2DES} =<0>, ...)
Jul 9 14:35:06 C2SRV2 Cluster.RGM.rgmd: [ID 316625 daemon.notice] Timeout monitoring on method tag <proxy2-rg.proxy2-HAS-rs.10> has been suspended.
Jul 9 14:35:09 C2SRV2 Cluster.Framework: [ID 801593 daemon.notice] stdout: becoming primary for proxy2-dg
Jul 9 14:35:11 C2SRV2 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Jul 9 14:35:11 C2SRV2 /scsi_vhci/ssd@g600a0b800029d28e000005ff48649a5c (ssd24): path /pci@780/SUNW,qlc@0/fp@0,0 (fp1) target address 200a00a0b829d290,b is now STANDBY because of an externally initiated failover
Jul 9 14:35:16 C2SRV2 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Jul 9 14:35:16 C2SRV2 Initiating failover for device ssd (GUID 600a0b800029d28e000005ff48649a5c)
Jul 9 14:35:18 C2SRV2 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Jul 9 14:35:18 C2SRV2 Failover operation completed successfully for device ssd (GUID 600a0b800029d28e000005ff48649a5c): failed over from <none> to primary
Jul 9 14:35:18 C2SRV2 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Jul 9 14:35:18 C2SRV2 /scsi_vhci/ssd@g600a0b800029d2160000057148649e21 (ssd25): path /pci@780/SUNW,qlc@0/fp@0,0 (fp1) target address 200a00a0b829d290,c is now STANDBY because of an externally initiated failover
Jul 9 14:35:23 C2SRV2 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Jul 9 14:35:23 C2SRV2 Initiating failover for device ssd (GUID 600a0b800029d2160000057148649e21)
Jul 9 14:35:25 C2SRV2 scsi: [ID 243001 kern.info] /scsi_vhci (scsi_vhci0):
Jul 9 14:35:25 C2SRV2 Failover operation completed successfully for device ssd (GUID 600a0b800029d2160000057148649e21): failed over from <none> to secondary
Jul 9 14:35:25 C2SRV2 Cluster.RGM.rgmd: [ID 285716 daemon.notice] 20 fe_rpc_command: cmd_type(enum):<3>:cmd=<null>:tag=<proxy2-rg.proxy2-HAS-rs.10>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2DES} =<0>, ...)
Jul 9 14:35:25 C2SRV2 Cluster.RGM.rgmd: [ID 316625 daemon.notice] Timeout monitoring on method tag <proxy2-rg.proxy2-HAS-rs.10> has been resumed.
Jul 9 14:35:27 C2SRV2 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <hastorageplus_prenet_start> completed successfully for resource <proxy2-HAS-rs>, resource group <proxy2-rg>, node <C2SRV2>, time used: 1% of timeout <1800 seconds>
Jul 9 14:35:27 C2SRV2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <hastorageplus_monitor_start> for resource <proxy2-HAS-rs>, resource group <proxy2-rg>, node <C2SRV2>, timeout <90> seconds
Jul 9 14:35:27 C2SRV2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <gds_svc_start> for resource <proxy2-zone-rs>, resource group <proxy2-rg>, node <C2SRV2>, timeout <300> seconds
Jul 9 14:35:27 C2SRV2 Cluster.RGM.rgmd: [ID 333393 daemon.notice] 49 fe_rpc_command: cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hastorageplus/hastorageplus_monitor_start>:tag=<proxy2-rg.proxy2-HAS-rs.7>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2DES} =<1>, ...)
Jul 9 14:35:27 C2SRV2 Cluster.RGM.rgmd: [ID 252072 daemon.notice] 50 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWscgds/bin/gds_svc_start>:tag=<proxy2-rg.proxy2-zone-rs.0>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2DES} =<1>, ...)
Jul 9 14:35:27 C2SRV2 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <hastorageplus_monitor_start> completed successfully for resource <proxy2-HAS-rs>, resource group <proxy2-rg>, node <C2SRV2>, time used: 0% of timeout <90 seconds>
Jul 9 14:35:28 C2SRV2 genunix: [ID 408114 kern.info] /pseudo/zconsnex@1/zcons@1 (zcons1) online
Jul 9 14:40:33 C2SRV2 Cluster.RGM.rgmd: [ID 764140 daemon.error] Method <gds_svc_start> on resource <proxy2-zone-rs>, resource group <proxy2-rg>, node <C2SRV2>: Timeout.
Jul 9 14:40:33 C2SRV2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <hastorageplus_monitor_stop> for resource <proxy2-HAS-rs>, resource group <proxy2-rg>, node <C2SRV2>, timeout <90> seconds
Jul 9 14:40:33 C2SRV2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <gds_svc_stop> for resource <proxy2-zone-rs>, resource group <proxy2-rg>, node <C2SRV2>, timeout <300> seconds
Jul 9 14:40:33 C2SRV2 Cluster.RGM.rgmd: [ID 333393 daemon.notice] 49 fe_rpc_command: cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hastorageplus/hastorageplus_monitor_stop>:tag=<proxy2-rg.proxy2-HAS-rs.8>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2DES} =<1>, ...)
Jul 9 14:40:33 C2SRV2 Cluster.RGM.rgmd: [ID 252072 daemon.notice] 50 fe_rpc_command: cmd_type(enum):<1>:cmd=</opt/SUNWscgds/bin/gds_svc_stop>:tag=<proxy2-rg.proxy2-zone-rs.1>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2DES} =<1>, ...)
Jul 9 14:40:33 C2SRV2 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <hastorageplus_monitor_stop> completed successfully for resource <proxy2-HAS-rs>, resource group <proxy2-rg>, node <C2SRV2>, time used: 0% of timeout <90 seconds>
Jul 9 14:43:35 C2SRV2 Cluster.RGM.fed: [ID 605976 daemon.notice] SCSLM zone <proxy2.mail.internal> down
Jul 9 14:43:35 C2SRV2 SC[SUNWsczone.stop_sczbt]proxy2-rg proxy2-zone-rs: [ID 567783 daemon.notice] stop_command rc<0> - Shutdown started. Wed Jul 9 13:40:33 BST 2008
Jul 9 14:43:35 C2SRV2 SC[SUNWsczone.stop_sczbt]roxy2-rgroxy2-zone-rs: [ID 567783 daemon.notice] stop_command rc<0> - Changing to init state 0 - please wait
Jul 9 14:43:35 C2SRV2 SC[SUNWsczone.stop_sczbt]roxy2-rgroxy2-zone-rs: [ID 567783 daemon.notice] stop_command rc<0> - showmount: proxy2.mail.internal: RPC: Program not registered
Jul 9 14:43:35 C2SRV2 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <gds_svc_stop> completed successfully for resource <proxy2-zone-rs>, resource group <proxy2-rg>, node <C2SRV2>, time used: 60% of timeout <300 seconds>
Jul 9 14:43:35 C2SRV2 Cluster.RGM.rgmd: [ID 224900 daemon.notice] launching method <hastorageplus_postnet_stop> for resource <proxy2-HAS-rs>, resource group <proxy2-rg>, node <C2SRV2>, timeout <1800> seconds
Jul 9 14:43:35 C2SRV2 Cluster.RGM.rgmd: [ID 252072 daemon.notice] 50 fe_rpc_command: cmd_type(enum):<1>:cmd=</usr/cluster/lib/rgm/rt/hastorageplus/hastorageplus_postnet_stop>:tag=<proxy2-rg.proxy2-HAS-rs.11>: Calling security_clnt_connect(..., host=<C2SRV2>, sec_type {0:WEAK, 1:STRONG, 2DES} =<1>, ...)
Jul 9 14:43:36 C2SRV2 Cluster.RGM.rgmd: [ID 515159 daemon.notice] method <hastorageplus_postnet_stop> completed successfully for resource <proxy2-HAS-rs>, resource group <proxy2-rg>, node <C2SRV2>, time used: 0% of timeout <1800 seconds>
Jul 9 14:43:36 C2SRV2 Cluster.Framework: [ID 801593 daemon.notice] stdout: no longer primary for proxy2-dg

incredible · July 9, 2008, 8:57am

When a Solaris Zone is managed by the Sun Cluster HA for Solaris Containers data service, the Solaris Zone becomes a failover Solaris Zone, or multiple-masters Solaris Zone, across the Sun Cluster nodes. The failover is managed by the Sun Cluster HA for Solaris Containers data service, which runs only within the global zone.

lesliek · July 9, 2008, 8:59am

Hi,

In Relation to the San I am using

Sun StorageTek 6140

Thanks

incredible · July 9, 2008, 9:13am

Perform the following step for each resource group you want to return to the original node.
# clrg switch -h nodename resourcegroup
if your cluster is 3.2 you should not use Network_resources_used any more, just place your logical host in the dependency list.

From the messages I see two probable root causes.

The master server installed on shared storage.
The master server resource does not depend on the necessary HASP resource.
The problem arises due to a probable misconfiguration.

It is nearly 100% sure that the dependency from the master resource to the underlying HAStoragePlus resourece was missing. The symptoms are classic, if the dependency is missing, RGM calls the validation on the second node. On this node there is no shared storage, so the Agent works as expected. The problem gets fixed if the necessary dependency is added.

lesliek · July 9, 2008, 9:20am

I have the data service installed on all nodes in the cluster.

there is are files in the
/opt/SUNWsczone/sczbt/util/proxy2-sczbt_config
/opt/SUNWsczone/sczbt/util/sczbt_register
/opt/ParameterFile/sczbt_proxy2-zone-rs

these are used to create the proxy2-zone-rs resource and this resource doe not run on all the servers. when you do try to failover the resource group proxy2-rg it failsover all the resources apart from the last one proxy2-zone-rs

I created the resouce by editing the proxy2-sczbt_config file
then I registered the config file i.e
/opt/SUNWsczone/sczbt/util/sczbt_register -f /opt/SUNWsczone/sczbt/util/proxy2-sczbt_config

this created the /opt/ParameterFile/sczbt_proxy2-zone-rs

and then I copied the /opt/ParameterFile/sczbt_proxy2-zone-rs
to all other nodes in the cluster so the should be identical

I have attached the config/Parameter files :#

bash-3.00# cat proxy2-sczbt_config
#
# Copyright 2007 Sun Microsystems, Inc. All rights reserved.
# Use is subject to license terms.
#
# ident "@(#)sczbt_config 1.4 07/09/14 SMI"
#
# This file will be sourced in by sczbt_register and the parameters
# listed below will be used.
#
# These parameters can be customized in (key=value) form
#
# RS - Name of the resource
# RG - Name of the resource group containing RS
# PARAMETERDIR - Name of the parameter file direcrory
# SC_NETWORK - Identfies if SUNW.LogicalHostname will be used
# true = zone will use SUNW.LogicalHostname
# false = zone will use it's own configuration
#
# NOTE: If the ip-type keyword for the non-global zone is set
# to "exclusive", only "false" is allowed for SC_NETWORK
#
# The configuration of a zone's network addresses depends on
# whether you require IPMP protection or protection against
# the failure of all physical interfaces.
#
# If you require only IPMP protection, configure the zone's
# addresses by using the zonecfg utility and then place the
# zone's address in an IPMP group.
#
# To configure this option set
# SC_NETWORK=false
# SC_LH=
#
# If IPMP protection is not required, just configure the
# zone's addresses by using the zonecfg utility.
#
# To configure this option set
# SC_NETWORK=false
# SC_LH=
#
# If you require protection against the failure of all physical
# interfaces, choose one option from the following list.
#
# - If you want the SUNW.LogicalHostName resource type to manage
# the zone's addresses, configure a SUNW.LogicalHostName
# resource with at least one of the zone's addresses.
#
# To configure this option set
# SC_NETWORK=true
# SC_LH=<Name of the SC Logical Hostname resource>
#
# - Otherwise, configure the zone's addresses by using the
# zonecfg utility and configure a redundant IP address
# for use by a SUNW.LogicalHostName resource.
#
# To configure this option set
# SC_NETWORK=false
# SC_LH=<Name of the SC Logical Hostname resource>
#
# Whichever option is chosen, multiple zone addresses can be
# used either in the zone's configuration or using several
# SUNW.LogicalHostname resources.
#
# e.g. SC_NETWORK=true
# SC_LH=zone1-lh1,zone1-lh2
#
# SC_LH - Name of the SC Logical Hostname resource
# FAILOVER - Identifies if the zone's zone path is on a
# highly available local file system
#
# e.g. FAILOVER=true - highly available local file system
# FAILOVER=false - local file system
#
# HAS_RS - Name of the HAStoragePlus SC resource
#

RS=proxy2-zone-rs
RG=proxy2-rg
PARAMETERDIR=/opt/ParameterFile
SC_NETWORK=false
SC_LH=
FAILOVER=true
HAS_RS=proxy2-HAS-rs

#
# The following variable will be placed in the parameter file
#
# Parameters for sczbt (Zone Boot)
#
# Zonename Name of the zone
# Zonebrand Brand of the zone. Current supported options are
# "native" (default), "lx" or "solaris8"
# Zonebootopt Zone boot options ("-s" requires that Milestone=single-user)
# Milestone SMF Milestone which needs to be online before the zone is
# considered booted. This option is only used for the
# "native" Zonebrand.
# LXrunlevel Runlevel which needs to get reached before the zone is
# considered booted. This option is only used for the "lx"
# Zonebrand.
# SLrunlevel Solaris legacy runlevel which needs to get reached before the
# zone is considered booted. This option is only used for the
# "solaris8" Zonebrand.
# Mounts Mounts is a list of directories and their mount options,
# which are loopback mounted from the global zone into the
# newly booted zone. The mountpoint in the local zone can
# be different to the mountpoint from the global zone.
#
# The Mounts parameter format is as follows,
#
# Mounts="/<global zone directory>:/<local zone directory>:<mount options>"
#
# The following are valid examples for the "Mounts" variable
#
# Mounts="/globalzone-dir1:/localzone-dir1:rw"
# Mounts="/globalzone-dir1:/localzone-dir1:rw /globalzone-dir2:rw"
#
# The only required entry is the /<global zone directory>, the
# /<local zone directory> and <mount options> can be omitted.
#
# Omitting /<local zone directory> will make the local zone
# mountpoint the same as the global zone directory.
#
# Omitting <mount options> will not provide any mount options
# except the default options from the mount command.
#
# Note: You must manually create any local zone mountpoint
# directories that will be used within the Mounts variable,
# before registering this resource within Sun Cluster.
#

Zonename="proxy2.mail.internal"
Zonebrand="native"
Zonebootopt=""
Milestone="multi-user-server"
LXrunlevel="3"
SLrunlevel="3"
Mounts=""
########################
Paramerter file :

bash-3.00# cat sczbt_proxy2-zone-rs
#!/usr/bin/ksh
#
# Copyright 2007 Sun Microsystems, Inc. All rights reserved.
# Use is subject to license terms.
#
#
# Parameters for sczbt (Zone Boot)
#
# Zonename Name of the zone
# Zonebrand Brand of the zone. Current supported options are
# "native" (default), "lx" or "solaris8"
# Zonebootopt Zone boot options ("-s" requires that Milestone=single-user)
# Milestone SMF Milestone which needs to be online before the zone is
# considered as booted. This option is only used for the
# "native" Zonebrand.
# LXrunlevel Runlevel which needs to get reached before the zone is
# considered booted. This option is only used for the "lx"
# Zonebrand.
# SLrunlevel Solaris legacy runlevel which needs to get reached before the
# zone is considered booted. This option is only used for the
# "solaris8" Zonebrand.
# Mounts Mounts is a list of directories and their mount options,
# which are loopback mounted from the global zone into the
# newly booted zone. The mountpoint in the local zone can
# be different to the mountpoint from the global zone.
#
# The Mounts parameter format is as follows,
#
# Mounts="/<global zone directory>:/<local zone directory>:<mount options>"
#
# The following are valid examples for the "Mounts" variable
#
# Mounts="/globalzone-dir1:/localzone-dir1:rw"
# Mounts="/globalzone-dir1:/localzone-dir1:rw /globalzone-dir2:rw"
# The only required entry is the /<global zone directory>, the
# /<local zone directory> and <mount options> can be omitted.
#
# Omitting /<local zone directory> will make the local zone
# mountpoint the same as the global zone directory.
#
# Omitting <mount options> will not provide any mount options
# except the default options from the mount command.
#
# Note: You must manually create any local zone mountpoint
# directories that will be used within the Mounts variable,
# before registering this resource within Sun Cluster.
#

Zonename="proxy2.mail.internal"
Zonebrand="native"
Zonebootopt=""
Milestone="multi-user-server"
LXrunlevel="3"
SLrunlevel="3"
Mounts=""

lesliek · July 9, 2008, 9:29am

Hi,

can you please confirm what is it exactly that I need to do as I am not sure?

how is this done :
if your cluster is 3.2 you should not use Network_resources_used any more, just place your logical host in the dependency list