Solaris cluster critical issue

sembii · February 28, 2015, 4:31pm

Hi all,

Few hours ago I did some changes in our Solaris cluster servers. Below are changes I did :

Installed latest Solaris 10 patchset from oracle.
Enabled BSM log module. Entered into single user mode and rebooted. After reboot changed to multi-user mode and rebooted again.

Now cluster service automatically stopping after about 10 mins. Also 2 of 5 cluster resource are offline.

dbcon-rs - offline
svr-rs - starting 
lsnr-rs - online
hasp-rs - online
rs - online

Quorum device and shared disks are all online. Please help me, it's enterprise production system. Now it's not working.

achenle · February 28, 2015, 5:59pm

How did you fix it on your test cluster, where you test these kind of things before trying them on critical production systems?

sembii · March 1, 2015, 2:36am

There are no test cluster env. So I here is that problem.

jlliagre · March 1, 2015, 3:19am

Missing a test environment is the problem you need to fix in the first place.

achenle · March 1, 2015, 8:46am

In the meantime, please show us more information. All you've told us is your cluster is broken.

The best thing you can do is back out all your changes and go back to the way you were set up before. You do have a way to do that, don't you?

If you don't, read this:

sembii · March 1, 2015, 11:54am

Hi all, now we stopped cluster services and system is working on first node without cluster.
I'm trying to find what was the cause of failure. Below are some info when cluster is not working.

bash-3.2# /usr/cluster/bin/clresource status
=== Cluster Resources ===

Resource Name      Node Name   State                  Status Message
-------------      ---------   -----                  --------------
fepprod-dbcon-rs   fep1prod    Offline                Offline
                   fep2prod    Offline                Offline

fepprod-svr-rs     fep1prod    Offline                Offline
                   fep2prod    Starting               Unknown

fepprod-lsnr-rs    fep1prod    Offline                Offline
                   fep2prod    Offline                Offline

fepprod-hasp-rs    fep1prod    Offline                Offline
                   fep2prod    Online		      Online

fepprod-rs         fep1prod    Offline                Offline - LogicalHostname offline.
                   fep2prod    Online		      Online - LogicalHostname online.

Also, when I try to switch active node below error occured :

resource group is undergoing a reconfiguration, try again later

Now, node1 is patchset updated and working without clustering. I'm going to install patchset on node2 and switch active node to node2. Hoping that I can find something helpful after patchset installation on node2.

---------- Post updated 03-02-15 at 12:53 AM ---------- Previous update was 03-01-15 at 11:45 PM ----------

---------- Post updated at 12:54 AM ---------- Previous update was at 12:53 AM ----------

One interesting thing I found. There is one failed device in cluster devices.

cldev status -v

/dev/did/rdsk/d8             fep1prod             Fail

Maybe it raised a problem ?

danish2012 · March 4, 2015, 4:59am

U can try using EASEUS part master to check clusters

sembii · March 31, 2015, 2:11am

Hooray, found the reason of cluster failure. It was oracle db user issue. Oracle DB server resource timed out due to Oracle DB locked user.

So we solved a problem via below order.

Unlocked oracle db user which is used to connect Sun Cluster Oracle Resource to Ora DB.
Took the resource group offline.
Disabled oracle db resources and application resources.
Brought resource group online on active node.
Started oracle db manually.
About 10 mins, enabled oracle db resources and application resources.

After that, everything works fine.