Opinion on auto-restart of failed apps/services

mglenney · May 18, 2011, 4:45pm

I'm becoming a dying breed where I work. More and more sys admins are advocating automatically restarting failed services such as tomcat, jboss, etc. I've always been against doing this except with buggy apps that can't be fixed or avoided.

My main argument is that I feel it's a trick used by lazy sys admins who don't want to troubleshoot their apps. Almost everything we have that is customer facing is behind a load balancer (we have a lot of customers). If the LB is properly configured, it will pull a node out of the rotation if it fails a health check. If the pool is sized properly it will have at least n+1 servers running and should be able to handle the load if one node dies or is removed. I feel we should let the app fail, alert on it, remove it from the pool, and troubleshoot it to find out why. Turn up a new node to take it's place if necessary. If the bad app is auto restarted and it is indeed bad, we will continue to route customers to it and it could negatively affect them.

They argue that "apps just fail" and that we should restart them asap to keep them up and servicing customers.

I'm starting to feel like the old geezer of the group and these damn kids won't get off my lawn. If you wouldn't mind, please let me know your take on this. I'm not looking for everyone to agree with me and I'm not against changing my views. They just haven't provided a good argument.

Thanks,

MG

Corona688 · May 18, 2011, 9:15pm

Let me put it this way: What function does leaving them down perform for you? Does restarting them prevent you from debugging them?

jim_mcnamara · May 18, 2011, 10:44pm

sysadmins push for maximal uptime. This is what they are paid for:

system availability
data security

mglenney · May 20, 2011, 6:14pm

Well, like I said, everything is in a pool, behind a load balancer. An app that fails is a broken app. Maybe it was just a glitch and a restart would remedy it, but "maybe" it's not. Maybe the app on that host is actually broke. If you restart it, the load balancer will continue to send traffic to it and those customers will be affected. To me that's a huge negative.

jim mcnamera said "sysadmins push for maximal uptime. This is what they are paid for". Where I work it's all about SLA's. If I'm sending customers to a malfunctioning node (or worse, sending the SLA monitor to a malfunctioning node) we take a hit on SLA. Big no-no here. So, yes, uptime is what we're paid for, but uptime for the service, not for the individual services running behind it. Plus, what if you had an app that crashes once a day but you autorestart it, resulting in 99.99% uptime. Would you seriously consider that a success?

MG

Corona688 · May 21, 2011, 11:19pm

I don't know why you quoted me, you didn't answer either question.

methyl · May 22, 2011, 8:21pm

The hard bit can be detecting that the application has failed. Just relying on the output from a single "ps" command is not safe because a busy system may give a blank or incomplete response to a "ps" command.

To paraphrase Corona688 there is no harm in installing a workaround while you find and repair the root cause ... or determine that the root cause cannot be repaired.

If for example you have a client-server application running on an unreliable network (like the Internet) there is a good case to configure a client retry mechanism backed with a carefully-designed automatic client restart combined with a matching dead-session cleanup in the server.

Neo · May 22, 2011, 10:03pm

Uptime is gold.

I am a strong advocate of always having a watchdog process in place to watch all critical services and restart them if they go down.

Debugging why processes fail is another topic and certainly should not be used as an excuse to shave uptime down.

mglenney · May 23, 2011, 6:37pm

I think I did answer the first but in a round-about way. It's not that leaving it down performs some function. It's that leaving it down keeps it out of the pool. So if the app is bad I don't start sending customer to it again.

The second question just go wrapped in with the first but my answer applies here as well. No it doesn't prevent me from debugging them. Just prevents me from returning a potentially bad server to the pool.

---------- Post updated at 03:37 PM ---------- Previous update was at 03:26 PM ----------

I agree with you on this. Where I work, however, "uptime" is measured as an SLA. So if I'm hosting a web service, we're not judged by the individual uptime of the hosts and services that make up a pool, we're judged by the availability of the service. So "Does the service respond with the proper data in < 1 sec" is far more important than "Did all the servers in the web service pool stay up this year".

If I restart apache (as an example) automatically, and there is something wrong with that instance that causes it to respond in 5 sec instead of <1, I'll have to answer for that.

So my general philosophy is that you provide high availability with hot spares, load balancing, or some other type of redundancy. Not with auto-restarts. But it sounds like I may be alone in this.

Are most people being measured by uptime on individual hosts/apps as opposed to service availability?

Corona688 · May 23, 2011, 7:18pm

If a server going down doesn't break anything then I guess it's not quite as important. You never hinted about that until now though, many things aren't in pools like that. Presumably some servers are more important than others, too -- you can't avoid having some sort of physical storage somewhere...

Create · May 25, 2011, 2:57pm

Most of the servers I administer are behind a load balancer just like yours and they come out of a pool when they are acting up. However, what I tend to do (for tomcat applications for example) is get a thread dump of what the application is doing at the time, get all the logs, as well as what processes including memory/cpu usage are running on the system. I then use this to debug it. If this is a single point of failure I go for the quick restart after collecting as much data as possible.

Our uptime is managed by service availability, but our SLA's include hosts/services on those hosts. So I am required to respond to them even if the service is still functioning properly.