Bouncing Unix Servers

Hi there,

I was wondering if any of you good people out there could answer these question:

A - why Unix servers are bounced once a in while in commercial environments?

B - in what circumestances Unix Server are bouced?

Many thanks for your time.

Kind regards
MH

Where I work now we have a scheduled monthly maintenance window. On a particular Saturday we get to do whatever we need to the boxes. Everybody that uses the systems knows this and nobody works that Saturday.

Our policy is any system which hasn't been rebooted for 90 days is rebooted. Oracle servers and some Veritas Cluster Server clusters are rebooted each month instead. There are a few production boxes exempted from the policy because they support manufacturing plants which have shifts working on Saturdays.

As for the reason why - it is basically just for cleanup purposes. Stale NFS handles, zombie processes, small memory leaks from applications, etc. are all cleaned up by the reboot. Also it is a way of testing to make sure everything runs smoothly with the startup and shutdown scripts so we don't find problems with them when there is an unscheduled reboot.

I've worked other places that never rebooted machines unless it was for a specific purpose - hardware upgrades/repairs, software installs which require it, etc. It all depends on the environment and what the system admins prefer whether a scheduled reboot is appropriate or not.

Hi rhfrommn,

Many many thanks for you response.

Best regards

MH

Hi rhfrommn,

Hope all is well with you.

In your response to my question "Bouncing Servers", you mentioned you also bounced the Oracle servers and some Veritas Cluster Server.
Just to ask you what is the command for shutting down and starting up an Oracle Server, lets say Oracle 8i and would you use the same command for Oracle 11i and above.

Once again many thanks for your time.

Kind regards

MH

Have a look at an Oracle SQLPLUS manual for the shutdown command.
There are different guises of Oracle shutdowns depending on additional parameters.
But I am no Oracle Admin.
In cluster start/stop scripts that I wrote I used the
"shutdown immediate" command when logged on as SYSDBA to the instance.
But this was after getting confirmation from Oracle DBAs.
They should better know what their servers can cope with.

Hello buffoonix,

Thank you very much for your response.

Soon after posting my thread "Bouncing Oracle Servers" I had a look at Oracle SQL*PLUS Manual (Oracle 9i) on the web and found out my answer for this paticular version.

As you said, there are number of ways to startup and shutdown an Oracle server
1- using SQL*PLUS
2- using RMAN
3- using Oracle Eneterprise Manager.

Once again many thanks for your time.

Best regards

MH

Sorry for the delayed response, I was out of the office a couple days.

The way we do it is that the Oracle DBAs have written scripts placed in /etc/rc2.d on our Sun boxes to start up and shut down their databases. When I reboot the box during maintenance I just issue the command "init 6" to cause a reboot. That command runs all the shutdown scripts before the reboot occurs, so the DBA's script takes care of their databases. That way I don't have to log into Oracle at all. One of the DBA team will be in the office for maintenance as well so if there is a problem (for example one of the servers hangs on the shutdown script or the databases don't restart after the reboot) they can check it out.

Hello rhfrommn,

This is very kind of you.

As mentioned in my reply to buffoonix, I found some info about my question but you response shed some more light on the issue.

If I understood this correctly, when you change the run level of a Sun Box to (init 0) running the Oracle database, will shutdown your Oracle database and when you change the run level to init 6 to bring up the Box will result in executing the script placed in /etc/rc2.d which starts up your Oracle database. I hope I got it right!!!

On a different note, I was wondering if you be happy to let me know your e-mail address as I have difficulty to log into my Hp-ux (C200) workstation running on Hp-ux 11.x Operating System.

Well, once again many thanks for your time and I look forward to hearing frome you.

Best regards

MH

Yes, when you change runlevels it runs the scripts in /etc/rcX.d as they go through that level. Scripts with a K in front of the name are run with a "stop" flag, and scripts with an S are run with a "start" flag. So the oracle DBAs will put a script in that handles their shutdown and startup tasks.

Using init 6 changes it from runlevel 3 (the normal running level) to 0 then back to 3 again. Runlevel 6 is a special "reboot" runlevel which the machine actually never stays at - telling it to go to 6 just makes it reboot by going to 0 then immediately to 3.

All the above info is for Solaris. I think most of it applies to HP-UX (at least the general ideas) but the specifics may be different. Unfortunately my only exposure to HP-UX was a 3 month contract with a few HP machines I did over 2 years ago, so I know not much at all about HP-UX. Sorry I can't be of more specific help there.

Hello rhfrommn,

This is very good of you explaining it so clear. As mentioned in my last thread, I am having difficulty to intract with the IPL by pressing "Esc" key soon after powering up my Hp-ux (C200) workstation which is running on Hp-ux 11.x. Shame, that your HP-ux is not as good as your Solaris. I do not know how to deal with it as I am new with unix operating system!!!!!

Well, once again many thanks for all you kind response.

Best wishes

MH

As far as the HP-UX boot process is concerned
I only have some experience with HP enterprise servers
but not HP workstations.
The server HW is usually equipped with a so called
Guardian Service Processor (GSP) - similar I guess to SUN's NVRAM -
Although I doubt that the PDC for an HP workstation incorporates the
same functionality as for HP servers, they might however behave similarily.
Through an attached terminal at the console port you can access the GSP
by pressing ^B (i.e. Ctrl+b key).
Usually, if Autoboot is enabled an HP box will after the POST phase
(which can take up to quater of an hour, depending on how many Gigs of RAM
are installed) display for 10 secs a screen where it asks to press *any* key
to bypass the autoboot and possibly interact with ISL.
Having pressed any key you can e.g. search boot paths of attached bootable media.
Having entered "bo" you will be asked if you want to interact with ISL.
Just type "y" here and you will get into ISL.
If you go to docs.hp.com you can search for your HW and you will find docs
that describe the features of your box as well as docs that describe the HP
boot phases.
There should also exist manpages for hpux, pdc, isl on your box.

Hi buffoonix

Many thanks for you response. This is mydilemma. To start, I am not a System Administrator but am teaching my self about Hp-ux Admin tasks.

To get to the point, I have a Hp-ux (C200) workstation running on Hp-ux 11.x Operating System. Couple of weeks ago when practicing on file access permission, I have set the permission on a directory (perhaps a file I am not sure) to 0544 and did some other practic (vi) and then turned the machine off (shutdown -yh 0).

When next day I powered up my machine and wanted to login as a root through CDE, I got this message:

"account locked in the commercial security database"

I tried to interrupt the boot process (to interact with ISL ) by hitting the Esc key soon after turning the power ON, but I could not interrupt the boot process.

I turn power ON, kernel gets loaded, CDE appears ready to login and when I login as root, I get the above message. I do not know what to do.

rhfrommn and you are very kind in responding to my threads and on that note, I thought if you could help me to resolve this problem.

Many many thanks for you time.

MH

Hello,

I'm pretty much at the end of what I know here, but I have one last suggestion. I know in the FAQ section on this site they have instructions for how you can recover a system where you lost root access due to forgetting the password or the account getting corrputed or whatever. If you check the instructions for HP-UX maybe you could use those to boot into some kind of maintenance mode and fix the root account that way. Unfortnately like I mentioned before my HP specific knowledge is way too small to know HOW to do that part, but I would think there has to be a way to fix it once you access the machine in maintenance mode.

Good luck.
Ralph

Oh, one other thing. When I did have that short contract working on some HP-UX I found that HP had an absolutely fantastic support forum. By far the best vendor-specific one I've ever seen, way better than Sun's bigadmin site for example.

Here is the link to the start page. I think you can sign up for access even without a vaild HP support contract.

http://www1.itrc.hp.com/service/index.do?admit=-682735245\+1142960908614\+28353475

I agree to what Ralph said.
The HP ITRC is a phantastic forum.
Even if you have an HP software support contract
you often get quicker responses from fellow sysadmins whose main platform
is HP-UX, most of the times containing a solution or hints that at least will
help you further.
But back to your problem.
I am afraid I have no experience with HP workstations.
Btw, have you looked here for a manual of your workstation model?
It should at least drop a line how you can access maintenance mode.
http://h20000.www2.hp.com/bizsupport/TechSupport/Product.jsp?prodTypeId=12454&prodCatId=296720&locale=en_US&contentType=SupportManual&docIndexId=179111
That's where you need to get in order to fix your locked root account.
Btw, a locked root account or lost root password is such a common issue
that I am convinced you will find a thread treating it in the afore mentioned
HP ITRC forum.
The ITRC also has a great knowledge base that you can query with regard
to you problem (but I fear this is only accessible to support contract holders).
If all else fails you should at least be able to boot from a Core OS CD
which starts up an ASCII menu from where you can enter a root shell.
I also don't know the Commercial Security Database.
Somewhere in it must be a field that has a lock set for your root account.
Maybe you had entered too many times a wrong password or similar?
But I am convinced that the lock flag can be removed,
mabe even by moving the whole DB out of place or by providing an interim
empty one.
Please search docs.hp.com for your case.
Most of the HP documents are downloadable.
E.g. here are some manpages that may be relevant to you
http://docs.hp.com/en/B2355-60127/isl.1M.html
http://docs.hp.com/en/B2355-60127/hpux.1M.html
http://docs.hp.com/en/B2355-60127/boot.1M.html

If regained access to your workstation and if you have a streamer available
I would strongly advise you to create a disaster recovery tape
by the make_tape_recovery command.
This is part of the freely available Ignite utility.
Search HP site for download and documentation.
Creating an Ignite tape is as easy as issuing one short command.
After successful creation you have a bootable recovery medium
where you either could reinstall the whole OS within half an hour
or where you could access a root shell should your root disks get broken.
HTH

Hi buffonix & Ralph

Thanks again for the reply. Sorry for not getting back to you earlier. I will be looking on the HP ITRC and the links you provided me to see what I can find.

I did look into the Manual for this model, but not much joy. To access the maintenance mode, one needs to be login and get to the prompt and then change the run level to a desired mode (init 0 and so on). The problem I am facing is the initial interaction with ISL and to my mind it was caused by playing up with file permissions.

To give a better picture, this machine has 2 disks vg00 (root disk), vg01(a practice disk) and a floppy and a CD drive. I bought it form a "used HP" dealer 400 - 500 miles away from where I live on the Net, not subject to any support after sale. Once I regained access I will be looking how I can master creating a disaster recovery tape and Ignite tape so that I am able to reinstall the whole OS. I do not know about "streamer" what is it?

I can appreciate both of you are more competent with Sun than HP and the same time thanks again for all the hints and help. Let's I hope, I be in the position of giving you the good news that I am sorted.

Kind regards

MH

good god, UNIX=NO REBOOTS, what kind of data center is this....

"Stale NFS handles, zombie processes, small memory leaks from applications, etc. are all cleaned up by the reboot"

the root cause of these should be found, not band-aided by a reboot...

you should never ever reboot unix servers unless you change the kernal or specific uprgrades require it...

Please let me know where to send my resume. I'd love to work at your datacenter where nothing ever breaks. :slight_smile:

Unfortunately, I'm not in that situation. We have 7 admins responsible for a couple hundred servers and several hundred more workstations. We support well over 1000 users and have automounter maps that allow them to connect to several hundred project directories on a SAN with over 100 TB of storage. And we're responsible for EVERYTHING in the Unix and Storage environment from password resets to desktop linux support to system architecture to filling out purchase orders for new equipment.

In a perfect world I agree with you we'd be able to keep machines up constantly by fixing each problem as it happened. But with the thousands of mounts and unmounts that happen every day we get some stale file handles for example. There are plenty of other little problems that come up which really don't need to be solved immediately that the monthly or 90-day reboot clears up. There is absolutely no way we could spend the time having a system administrator track each of them down individually without double the people. And there is no need for us to do it - for over a decade the monthly maintenance policy has been in place and the business units and users we support agree with it. So we let the minor stuff I mentioned go and clean it up during maintenance by rebooting.

Also, we are in the medical industry so there are very strict regulations about reliability and disaster recovery. Many of our machines are required to be rebooted on a schedule to prove that they are configured properly and will come up correctly after an unplanned outage. For example, the Veritas clusters I mentioned rebooting monthly. Our DR policy requires that to prove the clusters are able to function properly in a failover situation where one system crashes. We actually have to sign and file documents verifying the status of each system after it comes back up. Thus it doesn't matter if we think they need it for a technical reason or not, a lot of those reboots are going to happen to satisfy the policies put on us by the regulatory department.

So I'd finish by pointing out my last paragraph of the original message. It all depends on the environment. Just as you said is the right way to do it, most places I've worked did not have scheduled reboots. However, due to specific factors in the environment I work in now we have to do it. You need to know your users, machines and environment well enough to know what reboot policy is best for your situation.