SunOS 8.5 random shutdown

fah_kinright · July 31, 2015, 1:24pm

hope this is the correct spot.

we have 4 unit running SunOS 8.5 2 in offices and 2 running equipment in the lab.
the communicate over our network to run/monitor jobs on the equipment.
they randomly shut off recently. twice last thursday and once this friday.

the vendor of the equipment says it's a network problem. i asked my network department if any strange "flickers" happened ( i am still waiting for our network people to answer me from last week, FML)

today when it happened, the user managed to get a report from one of the machines. i will paste it below. when looking at the report i does mention "communication out of sync" but is it really a network error or hardware failing??

when i do an iostat -E command it says nothing and i don't know what else to really do.

anyone else come across this, and or fixed it. any help would be appreciated

thanks

07/31/2015  10:09:33  Machine:5213  (Rel:8.9.0.c, BCCB [28108], CNFDFilDes.c, 450)
ERROR: CN-0000
Inconsistent CN protocols active (1 and 790644820) or communication out of sync. (trying to recover)

07/31/2015  10:09:33  Machine:5213  (Rel:8.9.0.c, BCCB [28108], CNXA_server.c, 222)
ERROR: CN-0000 (linked to CN-0000)
Error occurred in `CNXArecRqMsg'

07/31/2015  10:09:33  Machine:5213  (Rel:8.9.0.c, BCCB [28108], BCCBcom.c, 191)
ERROR: BC-2802 (linked to CN-0000)
Communication Error

07/31/2015  10:09:33  Machine:5213  (Rel:8.9.0.c, BCCB [28108], BCCB.c, 467)
ERROR: BC-2202 (linked to BC-2802)
Error handling start batch event. Seq nr 0

07/31/2015  10:09:33  Machine:5213  (Rel:8.9.0.c, BCCB [28108], BCCB.c, 4223)
ERROR: BC-2210 (linked to BC-2202)
Error handling event START_EVENT for batch 0

07/31/2015  10:09:33  Machine:5213  (Rel:8.9.0.c, BCCB [28108], BCCB.c, 4953)
OK (linked to BC-2210)
Error handling message, continue anyway 

07/31/2015  10:09:36  Machine:5213  (Rel:8.9.0.c, PMCH [6134], CNFDFilDes.c, 450)
ERROR: CN-0000
Inconsistent CN protocols active (1 and 790644820) or communication out of sync. (trying to recover)

07/31/2015  10:09:36  Machine:5213  (Rel:8.9.0.c, PMCH [6134], CNXA_server.c, 222)
ERROR: CN-0000 (linked to CN-0000)
Error occurred in `CNXArecRqMsg'

07/31/2015  10:09:36  Machine:5213  (Rel:8.9.0.c, PMCH [6134], PMCHdl.c, 354)
ERROR: PM-2004 (linked to CN-0000)
CH request message receive failure

07/31/2015  10:09:36  Machine:5213  (Rel:8.9.0.c, PMCH [6134], PMCHdl.c, 536)
ERROR: PM-2001 (linked to PM-2004)
PMCHdl Task Aborted

07/31/2015  10:09:36  Machine:5213  (Rel:8.9.0.c, PMWI [6108], PMWIndows.c, 923)
ERROR: PM-0105 (linked to PM-0400)
Notifier stopped

07/31/2015  10:09:36  Machine:5213  (Rel:8.9.0.c, PMWI [6108], PMWIndows.c, 2153)
ERROR: PM-0106 (linked to PM-0105)
Error during PMWI main loop

07/31/2015  10:09:36  Machine:5213  (Rel:8.9.0.c, PMWI [6108], PMWIndows.c, 2223)
ERROR: PM-1003 (linked to PM-0106)
PMWIndows Task Aborted

07/31/2015  10:09:36  Machine:5213  (Rel:8.9.0.c, PMWI [6108], PMWIndows.c, 2226)
EVENT: PM-1002
PMWIndows Task Stopped

07/31/2015  10:09:40  Machine:5213  (Rel:8.9.0.c, BCCB [28108], CNFDFilDes.c, 702)
EERROR: CN-0000
ioctl() failed: Connection reset by peer

07/31/2015  10:09:55  Machine:5213  (Rel:8.9.0.c, BCCB [28108], BCCBcom.c, 191)
ERROR: BC-2802 (linked to CN-0001)
Communication Error

07/31/2015  10:09:55  Machine:5213  (Rel:8.9.0.c, BCCB [28108], BCCB.c, 467)
ERROR: BC-2202 (linked to BC-2802)
Error handling start batch event. Seq nr 0

07/31/2015  10:09:55  Machine:5213  (Rel:8.9.0.c, BCCB [28108], BCCB.c, 4223)
ERROR: BC-2210 (linked to BC-2202)
Error handling event START_EVENT for batch 0

07/31/2015  10:09:55  Machine:5213  (Rel:8.9.0.c, BCCB [28108], BCCB.c, 4953)
OK (linked to BC-2210)
Error handling message, continue anyway

hicksd8 · July 31, 2015, 4:59pm

Firstly, when you say Solaris 8.5.2 I think that you really mean 5.8.x which means it's what we all know as Solaris 8.

You call it "random shutdown" which infers the operating system is shutting down (and restarting) but is it????

After one of these incidents login as root as issue the command

uptime

which asks the O/S how long it's been up for. If the O/S actually did crash the output will show the O/S has only been up for a minute or two. If the output says the O/S has been up for hours then it is indeed a network crash that you are seeing.

---------- Post updated at 09:59 PM ---------- Previous update was at 09:45 PM ----------

Is this box running Open WebMail? If so, I think the errors you are posting may be coming from that application.

fah_kinright · August 4, 2015, 10:54am

thanks for the reply.

yes i must have did a typo. SunOS 5.8

I will have to try this command after the next random shut down. I am under the impression it's a total shut down.
I will have to confirm if the systems are running Webmail, and report back.

thanks again

---------- Post updated at 10:54 AM ---------- Previous update was at 08:24 AM ----------

just to add info.
the client believes there is a mail client on one unit (in lab) using to communicate with another unit, but it was blocked or turned off many years ago. so way before this problem occured:confused:

Don_Cragun · August 4, 2015, 2:55pm

I'm sure this isn't your problem... But about 30 years ago while I was working at Sun, I would start a long running job just before I left work for the evening and it died and the machine rebooted every weekday night sometime between 9:45pm and 10pm. I could see from the logs that it rebooted five to nine times every weekday night and then ran perfectly until the next night (with no reboots on Saturday or Sunday).

I finally stayed late one night to find out what the problem was... I had made the mistake a having one plug unused on the power strip under my desk and having that open plug visible and easily accessible to the cleaning crew. They plugged in a vacuum cleaner and used it as their power source as they swept the hallway and about 6 offices around mine. Every time they turned on the vacuum cleaner, the system detected the power surge and rebooted.

The cleaning company got strongly worded, new instructions from upper management the next day and the problem magically disappeared. :rolleyes:

fah_kinright · August 4, 2015, 2:59pm

interesting... i don't think it's the cleaning crew in this instance, but we might need to check the lab. perhaps something new was plugged into the same wall outlet or power bar. thanks for the info.