Unix crontab or jenkins pipeline scheduling?

girish1428 · February 28, 2024, 11:58pm

Hi, Could you please help me decide which is a good option when it comes to scheduling a script on around 50 servers which runs every 5 mins on all 50 servers and checks if the CPU usage is is higher than a threshold (say higher than 80%) and if so, it will then check the status of the applications hosted on those 50 servers. On an average each server will have around 5 applications. If any of the application went down, it will try to restart the application.
Also to schedule a similar script of high memory usage on all 50 servers.

Which of the following tool is a good option to schedule such a script please?

Unix Crontab.
or
Jenkins pipeline.

Looking for for your suggestions please. Thanks in advance.

Kind regards.

drysdalk · February 29, 2024, 12:55am

Hi,

I'm actually going to suggest something different here, and say that this sounds like it might be time for you to consider a monitoring system to track such things. There's a wide variety to choose from these days - personally I have experience of Nagios, and it can have event handlers attached to alerts so that it takes action if a particular event occurs (most classically, this is used to attempt to re-start a service if it is detected to have failed).

It sounds like this sort of thing is really what you're after - a system to detect if something has gone wrong, and to either alert you if it does, and to also attempt to take corrective action itself to deal with the alert. Again with regard to Nagios you can set thresholds for things like memory usage, CPU usage, system load, and a variety of other things besides.

If there is some way you can track the availability of each application (e.g. an HTTP/HTTPS request to a given IP or URL to see if you get a 200 or a 500; or an external script or plugin you run to check if the application is working), then you could configure Nagios to directly monitor each application, and configure it to do whatever you wanted it to do if it detected a critical failure of that application.

Just something to consider. Very often, people who go down this road of writing a script to handle one thing end up writing another script to handle another thing, and then another, and then before you know it they've basically re-invented the wheel and effectively created their own monitoring system out of an ever-increasingly-complex-and-fragile set of interlocking scripts, whereas if they'd gone with a proper monitoring system right back at the start they could probably have avoided a lot of pain and ended up with the same, or better, result.

Hope this helps !

DrScriptt · February 29, 2024, 1:03am

+1 to everything that @drysdalk says.

I've been working with Xymon at ${DAY_JOB} and have it doing lots of monitoring and testing things, then alerting me when things are out of spec.

I wasn't aware that NAGIOS could actually take action on target systems to try to remediate things. -- @drysdalk will you please confirm or correct on NAGIOS's ability.

@girish1428 one of the things that I think is surprisingly difficult to determine is what the threshold is. Pure CPU is one thing but you may not get anywhere near it if you're resource constrained elsewhere; memory, network, disk, etc. Also, determining the actual alert threshold of whatever you're checking can be somewhat odd. Where is the low water mark and high water mark. Some things start to snowball and get bad faster and faster (super linear load growth). Some things simply increase monotonically (linear load growth). Then there's the fact that is using 95% of available CPU on the system actually a bad thing or is it okay that you're using most of, but not all of, the dedicated resources for the task at hand.

The venerable load average comes to mind, but that can be somewhat nebulous and behaves differently on different platforms.

You seem to have a conditional test that I think I would not have. I would like to always check the status of the applications, independent of the load on the systems. -- Maybe it's a wording issue, but it seems like you said only if the load is over 80% then check the application.

Edit: I've inherited / taken over an Xymon instance at work which we use to monitor things like this. It easily checks many basic things once every five minutes and reports the status. Xymon also includes many additional tests that can be enabled (e.g. check a web server URL and the reply status code). You can even create custom test extensions that can check anything you can check at a command line. We've got an extension that is checking to make sure that there is an Oracle RMAN backup for databases in the last day. Lots and LOTS of options. -- One thing that I've not yet messed with is having Xymon try to take any corrective action. But I suspect that I could see if there's a way to have a custom test perform some sort of corrective action if necessary.

system · March 21, 2024, 1:03am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.