Solaris stuck during boot after reconfigure boot

Hello,

I have a problem with my machine that won't boot properly.

The story is that I installed a software called apcupsd, which is a control application for my APC battery UPS. I have used version 3.14.10 earlier, but as part of restoring my previously crashed os harddrive I wanted to install it again and this time took the latest version 3.14.14.
Anyway, after compile, make and make install the installation script instructed that I should do a reconfigure boot using reboot -- -r before starting it for the first time.
(I should add that the ups has been connected with a USB cable since a few weeks back I just haven't installed the software up until now.)

Without perhaps fully understanding what a reconfigure boot entails, I happily entered reboot -- -r and the computer started shutting down and the ssh connection got lost as part of it shutting down.
As it is a headless server I could at the time not really see what was happening, but when I could not contact the machine 10 minutes later I got worried and connected a monitor to it. It had froze with just a text line with sunos etc. with a non blinking cursor. So I waited a little while and then powercycled the machine. After that it booted past grub to the Oracle/Solaris splash screen with the spinning circle and was stuck there for a few hours. With the circle still spinning I figured that it had got stuck on something and that another powercycle would make it leave reconfiguring and boot as normal.
Again it got past grub to the spinning circle and there it has been standing overnight and the whole day today.

I am really lost and need some help on where to begin resolving this. Can I make it boot normally some how or what can I do?

Which Solaris version are you using? Try to boot to single user mode and check the log files for possible errors.

x86: How to Boot a System to Run Level S (Single-User Level) (System Administration Guide: Basic Administration)

1 Like

Of course I forgot, my system is a x86 Solaris 11.3.

I tried booting to single user mode using the instructions at How to Boot a System to a Single-User State (Run Level S) -
Booting and Shutting Down Oracle(R) Solaris
11.3 Systems
but it does not seem to pull through and if I press a key at the splash screen it just says

SunOS Release 5.11 Version 11.3 64-bit
Copyright (c) 1983, 2015 Oracle and/or its affiliates. All rights reserved.
Booting to milestone "milestone/single-user:default".

So I guess that it tries to boot to single user mode, but gets stuck on something. A bit frustrating.

---------- Post updated at 07:12 PM ---------- Previous update was at 06:45 PM ----------

I found that you can get verbose output by adding -v to the kernel line.
Now with some more output it stays after

dump device is /dev/zvol/dsk/rpool/dump size 3 GB (3967 MB)

Can this be a clue?

No, that is just the dump device. Try to read on here:

SMF Best Practices and Troubleshooting -
Managing System Services in Oracle(R) Solaris 11.3

Can you boot your system to milestone=none ?

1 Like

I can boot with milestone=none. So it is not completely screwed at least.

After setting smf to milestone:default, I have a number of offline services and three unavailable services. At first sight this looks like a bit of an overpowering task for my current skillset so I will need some help breaking it down. Going through every service log file using vi on a keyboard setting not matching my keyboard will take days.
How can I narrow it down to which service is the one stopping the booting? Is it as simple as the first offline service in the list?
Looking at one of the unavailable services log (rpc server), it doesn't seem to reveal anything.

Skickat fr�n min D5803 via Tapatalk

I recall that there are known issues with APC Powershute on Solaris 10 and Solaris 11.

Question is, how did you install it? Did you follow any special instructions (from the web) or did you just execute the program?

I have a note made to myself which says to search the web for APC Kbase FA276327. Perhaps try that first and post back the result especially if that gets you nowhere.

---------- Post updated at 10:08 AM ---------- Previous update was at 10:01 AM ----------

Also, there's Solaris install notes here:

ftp://ftp.apc.com/apc/public/software/windows/2k/pcns/221/install.htm

although not specific to Solaris 11 might be useful.

I read through the readmes which didn't say anything about Solaris in particular and installed with just compile, make and make install. That was it, can't remember if I did something different when I installed it before a couple of years ago.

Will read through the links and see if I get somewhere and report back.

Thanks for the patient help and support btw. :slight_smile:

Skickat fr�n min D5803 via Tapatalk

I have read through the links and they refer to a different program. The one I have installed is a community created software. The links refer to APCs own powerchute app.

Anyone have any suggestions on how to figure out which service the boot up (or loading to milestone state default) is getting stuck on?

I have started thinking of trying to rollback the rpool to a previous state if possible following http://docs.oracle.com/cd/E19253-01/819-5461/ghzvk/index.html
If I do try a rollback, is there anything that I should consider beforehand, so I don't mess this up any more?

I am very thankful for any input in this matter. :slight_smile:

Skickat fr�n min D5803 via Tapatalk

So has anything changed since your original post? Does it still get endlessly stuck on a normal boot?

This is Solaris 11.3 x86?

I am at work now but no nothing has really changed. I have not been able to pin point what the cause is and hence have not been able to change it.
I am a bit busy with work too at the moment so progress is admittedly a little slow right now too.
Yes this is Solaris 11.3 x86.

So when you boot 'verbose' does the output just stop completely and get stuck? What is the last thing it outputs?

Also, do note that there is another level above 'verbose' which is 'debug'

boot -m debug

or

reboot -- -m debug

which really gives it text diarrhea. Sometimes it's useful and often not. However, it would be interesting to know where that gets stuck. It might tell us something. You might need to use x-off (ctrl-s) and x-on (ctrl-q) to stop and start the screen as output is pretty fast. It reports everything it does!!!

3 Likes

After booting with verbose output it seemed to stop on the service network/physical:default.
So i booted with manifest=none and turned network/physical:default off with svcadm and then did a reboot -- -r and it booted all the way to graphical desktop!

So the boot problem is identified and kind of solved, but the source problem still remains.

Well in to a properly booted desktop I tried to enable network/physical:default but it just won't seem to start properly.

Output from svcs and log below.

root@server:~# svcs -x
svc:/network/physical:default (physical network interface configuration)
State: offline* transitioning to online since November 26, 2016 12:39:11 PM CET
Reason: Start method is running.
See: http://support.oracle.com/msg/SMF-8000-C4
See: dladm(1M)
See: ipadm(1M)
See: nwam(5)
See: /var/svc/log/network-physical:default.log
Impact: 13 dependent services are not running. (Use -v for list.)

root@server:~# cat /var/svc/log/network-physical\:default.log
[ Nov 16 20:38:49 Executing start method ("/lib/svc/method/net-physical start"). ]
[ Nov 16 20:38:49 Timeout override by svc.startd. Using infinite timeout. ]
[ Nov 16 20:39:03 Method "start" exited with status 0. ]

(output shorted a bit, writing by hand)

[ Nov 26 10:18:05 Executing start method ("/lib/svc/method/net-physical start"). ]
[ Nov 26 10:18:05 Timeout override by svc.startd. Using infinite timeout. ]
[ Nov 26 10:22:50 Executing start method ("/lib/svc/method/net-physical start"). ]
[ Nov 26 10:22:50 Timeout override by svc.startd. Using infinite timeout. ]
[ Nov 26 12:39:11 Enabled. ]
[ Nov 26 12:39:11 Executing start method ("/lib/svc/method/net-physical start"). ]
[ Nov 26 12:39:11 Timeout override by svc.startd. Using infinite timeout. ]

Various outputs below

root@server:~# dladm
LINK                  CLASS   MTU   STATE    OVER
net0                  phys    1500  unknown  --
vboxnet0              phys    1500  up       --
transferzone/net0     vnic    1500  up       vboxnet0

root@server:~# ipadm
NAME          CLASS/TYPE STATE        UNDER     ADDR
lo0           loopback   ok           --        --
   lo0/v4     static     ok           --        127.0.0.1/8
   lo0/v6     static     ok           --        ::1/128
net0          ip         disabled     --        --
   net0/v4    dhcp       disabled     --        ?
   net0/v6    addrconf   disabled     --        ::

root@server:~# ls /etc/ | grep host
hostid
hostname.vboxnet0
hosts

root@server:~# cat /etc/hosts
#
#Copyright 2009 Sun Microsystems, Inc. All rights reserved.
#Use is subject to license terms.
#
#Internet host table
#
::1 server localhost
127.0.0.1 server localhost loghost

root@server:~# cat /etc/hostid
# DO NOT EDIT
"_I__45heac"

root@server:~# cat /etc/hostname.vboxnet0
192.168.56.1

I am not really sure what to look for here. The only thing I noticed is that in /etc/ there is no "hostname.net0" the same as .vboxnet0 but I am not sure if there should be such a file or not.
/var/adm/messages does not really say anything either.
Running /lib/svc/method/net-physical start directly from prompt does not result in any output and it just sits there without going anywhere until i press ctrl+c.

Any suggestions?

I think you've hit the same problem that I started this thread about on 20-04-2016 so join the club!

I never got an answer to this and I haven't been able to solve it myself.

Do you think that it's the same problem?

I put the thread up in the hope that some big Solaris guns who are members on here (and who are Oracle employees) would push this inside Oracle but, so far, no luck.

So what I'm saying is that I don't think this problem is as a result of something you've done. I still say that it's a bug.

---------- Post updated at 07:29 PM ---------- Previous update was at 06:53 PM ----------

Additional comment:

The obvious conclusion to draw is that the required network interface driver is not available (ie, not on the media) but if that were the case, how come the network interface always works fine when running from the 'live' DVD but won't start the network after installation and booting from the HD. I still don't get it I'm afraid. Just doesn't make sense. Even then, the problem occurs randomly and is not consistent on every boot. Sometimes it works just fine and seems to be something to do with a DHCP server (or router) going away (rebooting) and then coming back. Subsequent boots fail to start the network for some reason and screw the OS.

If you install Solaris 11.1 x86 and then do an over-the-web upgrade to Solaris 11.3 this problem never occurs. Weird.

1 Like

I'm not sure if it is the same issue, but considering the tips in your thread did not help me, I start to fear for it.
Does that mean that my only way out is to do a fresh install of Solaris 11.1 instead?
And I who was so happy with getting everything working together so nicely, zones and all. :frowning:

Nothing I can try to get it working?

Have you tried to remove the software you installed before everything went south? Also have a look if this helps:

https://docs.oracle.com/cd/E19253-01/817-1985/ecdps/index.html

1 Like

I did try to uninstall using make uninstall as instructed in the documentation of the software, but that unfortunately did not work.
Trying to troubleshoot the service according to your link did not work either.

Since the snapshots of my rpool are before I configured the zones and services and with the possibility of an existing bug that makes my machine unreachable in 11.3 I have decided to do a clean install of 11.1 and set everything up from the beginning while my memory of it is fairly fresh. I made some configscripts this time when I installed everything which will help me set things up fairly quickly as well.

Even if I manage to fix this with the help of you guys, I won't fully trust it not to happen again. And not least, I need this machine up and running and troubleshooting it takes too long.

Lesson learned: turn on timeslider the first thing you do and be generous with snapshots before doing stuff.
It's just easy to forget those things when 11.1 was rock solid for several years. :stuck_out_tongue:

Thanks anyways for trying to help me as much as you have.

@Zorken.......Out of curiosity can you please post the make/model/chipset of your network card/interface. This being a network issue I've often wondered whether this problem might be chipset specific. Thanks.

For the record, I have the most problem with a built-in Broadcom NetXtreme BCM5721 Gigabit PCI Express chipset. If I want to avoid trouble I avoid the current Solaris 11.3 Live DVD install media.

My network adapter is a built-in Realtek 8111 Gigabit LAN controller, residing on a ASUS M4A88T-M motherboard.

I installed Solaris from a USB stick using sol-11_3-live-x86.usb downloaded from oracle.com.

1 Like

Thanks, that's useful to know. So it's not chipset specific then.