Timeout procedure for using to much memory or cpu

cokedude · October 11, 2015, 8:08pm

How hard is it to create some kind of timeout procedure for using to much memory or cpu on a linux/unix server? What would you have to do to do this?

Don_Cragun · October 11, 2015, 8:37pm

I don't know what you mean by a "timeout procedure". One can use ulimit to constrain memory and CPU usage by a process or by a user, but those limits don't set a timeout; they kill processes that exceed the established CPU limits and fail attempts to grow a process (such as by calling fork() or malloc() ) by users/processes that reach memory allocation limits.

cokedude · October 12, 2015, 12:35am

This happens WAY to often. I understand you may need full resources if you are doing something with heavy resources. But when you leave something running for over 3 days that is ABSOLUTELY ridiculous. If you can't get what your doing done in like an hour then you should be it on a personal server. Hundreds of people need to use that server and as you can tell from the screenshot it is an old server with limited resources. Can you think of a way to stop this after like an hour? In this case the person was using 99.9% of the server.

PiXhost - Free Image Hosting

Don_Cragun · October 12, 2015, 2:25am

And if no one else is using the system, why shouldn't one user get to use 99.9% of the system? If other users should be run with higher priority, then nice the long running processes so they can run all day, but other processes will get preferential treatment when they do run.

Or, if you think these long running processes should be run on a different server; run them on a different server.

With virtual machines, you could set up resource limits for each virtual machine running on your physical hardware, but you probably aren't going to install virtualization on old hardware.

cokedude · October 15, 2015, 7:07pm

No one else was able to logon to the server for three days.

Don_Cragun · October 15, 2015, 10:38pm

Having a process run for more than 3 days is perfectly normal.

Being unable to login for 3 days is a completely different issue; and is not even close to normal.

MadeInGermany · October 21, 2015, 1:11pm

What's your OS?
If you have a recent Linux kernel with cgroups you can try

echo �1' > /proc/sys/kernel/sched_autogroup_enabled

Permanent entry in /etc/sysctl.conf

kernel.sched_autogroup_enabled = 1

This should better balance the CPU usage between different users. (Yet I have no practical experience.)
--
You can write a script that reads a config file with "procname:cputime" tupels, for example

x:30
y:1440

That means process "x" may run 30 CPU minutes and process "y" may run 1 CPU day.
The script can check every 10 minutes if any of the running processes exceed these limits, warn the user, and finally kill them (the processes :D).

bakunin · October 21, 2015, 3:53pm

madeingermany:

You can write a script that reads a config file with "procname:cputime" tupels, for example
x:30
y:1440
That means process "x" may run 30 CPU minutes and process "y" may run 1 CPU day.
The script can check every 10 minutes if any of the running processes exceed these limits, warn the user, and finally kill them (the processes :D).

This is an interesting idea, but couldn't that easily be circumvented by renaming the process?

How about this: you create a queuing system, which starts the processes. Set the ulimits for users to values so that they have to use the queueing system and cannot start their jobs directly. This queueing system can be configured by a definition file like MadeInGermany mentioned. You set some parameters like RAM usage, CPU usage, etc., upon which the queueing system decides if a job has to be canceled or not.

A similar idea was the start of the VQS (Vienna Queueing System (german)) back then in the late eighties. It was designed to run very big, massively parallel jobs using a large cluster of IBM RS/6000-systems running AIX. The idea was to make big jobs possible but abort them after relatively short time so that the system was free for other jobs. The short time the jobs ran was enough to test and refine them so that you needed the the long-running job classes only for the final run.

I hope this helps.

bakunin

MadeInGermany · October 21, 2015, 5:22pm

Other examples for queuing systems, in declining complexity/price:
LSF, Sun/Oracle GridEngine, the batch command.
But my point of view is that users are - friendly. They exhaust the system by mistake. Because they don't know better. Unless someone teaches them.