sembii
February 28, 2015, 4:31pm
1
Hi all,
Few hours ago I did some changes in our Solaris cluster servers. Below are changes I did :
Installed latest Solaris 10 patchset from oracle.
Enabled BSM log module. Entered into single user mode and rebooted. After reboot changed to multi-user mode and rebooted again.
Now cluster service automatically stopping after about 10 mins. Also 2 of 5 cluster resource are offline.
dbcon-rs - offline
svr-rs - starting
lsnr-rs - online
hasp-rs - online
rs - online
Quorum device and shared disks are all online. Please help me, it's enterprise production system. Now it's not working.
achenle
February 28, 2015, 5:59pm
2
How did you fix it on your test cluster, where you test these kind of things before trying them on critical production systems?
sembii
March 1, 2015, 2:36am
3
There are no test cluster env. So I here is that problem.
Missing a test environment is the problem you need to fix in the first place.
In the meantime, please show us more information. All you've told us is your cluster is broken.
The best thing you can do is back out all your changes and go back to the way you were set up before. You do have a way to do that, don't you?
If you don't, read this:
sembii
March 1, 2015, 11:54am
6
Hi all, now we stopped cluster services and system is working on first node without cluster.
I'm trying to find what was the cause of failure. Below are some info when cluster is not working.
bash-3.2# /usr/cluster/bin/clresource status
=== Cluster Resources ===
Resource Name Node Name State Status Message
------------- --------- ----- --------------
fepprod-dbcon-rs fep1prod Offline Offline
fep2prod Offline Offline
fepprod-svr-rs fep1prod Offline Offline
fep2prod Starting Unknown
fepprod-lsnr-rs fep1prod Offline Offline
fep2prod Offline Offline
fepprod-hasp-rs fep1prod Offline Offline
fep2prod Online Online
fepprod-rs fep1prod Offline Offline - LogicalHostname offline.
fep2prod Online Online - LogicalHostname online.
Also, when I try to switch active node below error occured :
resource group is undergoing a reconfiguration, try again later
Now, node1 is patchset updated and working without clustering. I'm going to install patchset on node2 and switch active node to node2. Hoping that I can find something helpful after patchset installation on node2.
---------- Post updated 03-02-15 at 12:53 AM ---------- Previous update was 03-01-15 at 11:45 PM ----------
---------- Post updated at 12:54 AM ---------- Previous update was at 12:53 AM ----------
One interesting thing I found. There is one failed device in cluster devices.
cldev status -v
/dev/did/rdsk/d8 fep1prod Fail
Maybe it raised a problem ?
U can try using EASEUS part master to check clusters
sembii
March 31, 2015, 2:11am
8
Hooray, found the reason of cluster failure. It was oracle db user issue. Oracle DB server resource timed out due to Oracle DB locked user.
So we solved a problem via below order.
Unlocked oracle db user which is used to connect Sun Cluster Oracle Resource to Ora DB.
Took the resource group offline.
Disabled oracle db resources and application resources.
Brought resource group online on active node.
Started oracle db manually.
About 10 mins, enabled oracle db resources and application resources.
After that, everything works fine.