Determine threshold for CPU

I'm writing an application that should display whether a system is running �fine� (normal activity) or if it has reached a critical level and thus indicate through a graphical interface using a green-yellow-red color scheme. The server machines in question are running AIX (but it shouldn't differ much through various UNIX systems, though important to note it uses POWER). The solution will be applied on both single server machines with 100% (CPU) capacity and clusters which allow utilization of more than 100%.

I'm well aware that threshold like these are most commonly determined through a lot of trial & error and testing but I would like to come to a conclusion as to which would be the most appropriate threshold with some facts to back it up.

Which leads me to the following questions, how do I set these thresholds in a theoretical way? By thresholds I mean for example �should it turn red and alert with a critical warning at 90%, then how come?�, �Why not 85%?�.
There's also possible spikes in the CPU usage, so should it only indicate as critical after 2 minutes of usage above 85%?

My main question is: Are there any algorithms or past works that have done something similar? Any research papers or books that you know of? I've tried to research this a bit without much success, most of what I could find was related to the x86 architecture and not POWER. Even if the two architectures differ a bit, there's also many similarities so some methods may work with them both.

The truth is only the admin responsible on the machine could tell you... It will depend on your knowledge of what and how things run on a machine and so necessarily be not the same on another... furthermore the contentions will differ too, one beeing a memory hog the other a cPU intensive consumer because og laborious calculation algorithms...
A Generalist thing to apply threshold will be only valid on a generalist box...

There are other considerations too, such as is your CPU allocation fixed or variable? It might sound odd, but an LPAR can (configuration choice) use more CPU if it is available on the whole server and other LPARs are fully using (or indeed there is some unallocated). You also need to know if you have a share of processors or whole CPUs allocated. That can really skew the figures too.

You would need to better clarify what you have.

What output do you get from something like vmstat 5 3 ?

Robin

vmstat measures a certain interval, then you get the average CPU usage from that interval.
That means your check must wait until the interval is finished.
For example

vmstat 5 2

The second value line is the average from the 5 seconds interval.
(The first value line is the average since the system was booted - not very useful.)
"Normal" thresholds for usr%,system%,iowait% are 75,55,30 for warning and 90,70,40 for critical.
Another measurement is the loadavg, this is the runqueue length. The runqueue gets longer if the scheduler is too busy to run the task according to the schedule.
The advantage of the loadavg is that the system provides the measurement interval; there are even 3 intervals: 1 minute, 5 minutes, 15 minutes.
The command line tool for this is uptime .
In the "infrastructure monitoring" sub-forum I have provided some Nagios-plugin-scripts that work on many platforms. Even if you do not have Nagios, you can see the commands in the code. Actually the check_load5.sh uses uptime and the check_cpu_stats.sh uses vmstat .

What is a machine? In "Openstack" terms - is the machine the host, or the virtual machine?

100% of what? On POWER virtualization - 100% of a processor, or of entitlement (which can get as high as 2000% - yes 2000! although 1000 is the more typical ridiculous number.)

Or are you looking a lcpu percentage: 25% lcpu could mean 100% of all the virtual processors - operating in single-threaded 'scheduling'.

The other thing to be aware of is AIX stats are PURR (processor utilization resource register) - that are processor (hardware) counters, not time-based metrics. A program like vmstat might say 95% user plus 5% system, but it is only 1% of the physical processor (i.e. the physical usage was 1%, and of that 1% 95% user "user%").

So, data-only can be very difficult. For advice you will need advice from someone who knows the expected workload and reasons for "virtual" sizing decisions.

Great ambition - difficult to define the meaning of the variables - as in all things performance - there is a sauce called "it depends" that flavors the numbers you see/observe.