Administrator responsibilities, in case of power outage?

TECK · January 24, 2011, 1:27pm

Hi guys,

I was wondering if you could share some of your knowledge, in the event of a power outage.
Let presume you are on duty and you get a call at midnight because half of your cabinets have no power, air conditioning is down and you deal with a ton of 500 error messages on your boxes.

What would you do, in this situation? From my very small experience, I would do this:
Make sure all vital boxes with sensitive data get an UPS source hooked ASAP, so they can be shut gracefully. Once the power supply is restored, I would check each system for errors and restore corrupted data from backup, if any.

I would appreciate if you could give me an example how would you deal with this situation, in a more appropriate manner. My goal is to find out what would you do, before the power issues are solved. Thanks for sharing your experience.

citaylor · January 24, 2011, 2:31pm

UPS are nearly always essential - even small ones can make the difference between a system shutting down gracefully and just turning off (Ive found in the past that if you calculate the downtime of the system and the cost of re-installing, including your own time spent doing that, then you tend to justify UPS on nearly all equipment)
Transactional filesystems can improve things when hardware has an abrupt power failure, but you cant rely on that fact. Also I have found that often network equipment is forgotten when spec'ing up UPS - services such as DNS, network shared filesystems and the like can often stop systems shutting down in a timely manner if the network has just been turned off. Make sure that systems with databases have large UPS as they can take a while to sync their disks and stop. I found that Active Directories and Windows Exchange Servers can take ages and ages to stop - so can need long running UPS. With machines which host virtual machines, often you can get the virtual machine to "suspend" instead of shutting down - this can make overall shutdown of the host system quicker. My last tip is to get the UPS to check their batteries regularly - ive too often found that UPS have batteries that have degraded to the point that they are useless.
I generally feel that if I am at the point of restoring a system image, then I have failed in my emergency measures, so although that is obviously the most important backup measure, I would try to make sure you never have to use it.

I hope some of these points help in your UPS decisions...

Corona688 · January 24, 2011, 2:34pm

UPSes with degraded batteries can be worse than useless; they might forget their state and stay off after an extended power outage is fixed! I had to drive 250km to swap one stupid box over that once...

TECK · January 24, 2011, 4:47pm

Once the electricity issues are dealt with, what would you do next? Presuming that you reboot several boxes and they simply refuse to start properly the services, deadlocks etc. I'm trying to also find out how I should deal with a situation where several essential boxes cannot be started for X reasons.

I presume I could investigate why the services don't start, starting with disks checkup and ending with data integrity (i.e. service reinstall, database restore, etc.)?

mark54g · January 24, 2011, 4:49pm

I would use this as a reason, for management's awareness, to get every vital system on a UPS and regimented backup and recovery process.

TECK · January 24, 2011, 4:50pm

But you will still be stuck to fix the issues, at midnight... when your boss is sleeping like a baby.

citaylor · January 24, 2011, 5:04pm

Well, I guess that electrical issues would be the biggest. If the boxes shutdown gracefully, there is no obvious reason why it wouldnt then boot again gracefully.

First I guess you'd have to identify which boxes are down (UPS boxes may have survived, or some may be down depending on the UPS battery time). I would recommend a list of boxes to check whether they are up or down. An SNMP system may be useful to check on hosts, network applicances and services. Then once you have identified which services need to be started, you need to identify in what order (Network Switches, DNS, DHCP, SAN/NAS boxes, Active Directory, file servers, etc). Make sure they booted ok before you move onto secondary services. Create a document detailing how you would test these services to make sure they are working and the definitive order of which to boot first. Once they are up, then list the secondary services you would need to reboot and how to test they are working. With UNIX hosts check the /var/log/messages (or appropriate syslog entries), on windows check event viewer to check that everything is running ok. To be honest you cant really second guess why services may be down, so it is hard to preempt that. You should make sure you have all the necessary documentation, including error messages for all the services you are trying to run so that in an emergency you can find it quickly. You could build a plan on what you would do in the event that a piece (or multiple pieces) of hardware have failed. Eg spare hardware, restore documentation, etc. Keep a telephone list of people that may be called upon to fix hardware or software services in an emergency. Keep a list of hardware serial numbers, contracts, SLA's and telephone numbers for emergency callout for hardware and software vendors, so that you can call them in an emergency to get them fixed. Virtual machines are very useful as you can have 2 or more host machines with standby virtual images containing up-to-date backups that can be started in the event that a given piece of hardware has died. VMware, for example, allows you to create pools of virtual machine hosts that can take over functionality easily and quickly should one fail....erm otherwise I would get a book on the subject or google the subject as a whole, as Im sure there are major area's Ive missed. I hope this helps...

Corona688 · January 24, 2011, 5:07pm

Essentially, you're trying to predict the future; it's difficult to be comprehensive about what might go wrong.

Maybe you had hardware failures, maybe you didn't. Maybe some system services left clutter that needs to be cleaned out before they start, maybe they didn't. Maybe someone put in settings that didn't get saved into any configuration file, maybe they didn't. Maybe you had disk corruption as a result of unclean shutdown, maybe you didn't. Maybe the RAIDs can't find all their disks for some reason and won't go. Or not. We know nothing about your systems or the software and data therein. At this point all we can do is guess, and know that whatever we say is 99% likely to be wrong.

citaylor · January 24, 2011, 5:07pm

Ah, also you need an escallation plan in case you fail to meet your customers Service Level Agreements, or someone fails to meet yours. For example who do you phone if a hardware vendor doesnt come and fix hardware, or if you fail to get a service running....your boss, his boss ? A list of consequences and priorities of services that hosts provide can help you plan what to do first, and what to escalate.

mark54g · January 25, 2011, 9:57am

Another thing to consider during a power outage is that you may lose cooling, even if you are on UPS. You should also consider having a thermal sensor that can send a graceful shutdown to your systems.

Depending on the needs of your organization, and expected SLA and uptime requirements, you should consider redundant UPS systems on different power phases, as you can lose an entire phase of power and not lose the system if they are on different phases.

Proper backups should be taken (And to make them proper, a copy of the data, in a recoverable format, should be off site).