Tricky situation with process cpu usage - AIX

OS: AIX

so we frequently receive a lot of cpu related alerts. all types of checks have been created to keep an eye on the cpu but a lot of these checks make too much noise as the CPU is always being seen as high. the system and application owners say there's no issue with the cpu.

so now, i'm thinking of adding a new condition to the existing current check. this condition will add up the cpu usage of all processes found in the process table. if they're less than 50%, then, the check will never alert.

can someone please let me know if there's anything wrong with this thinking?

here's the command i'm using:

ps -e -o pcpu -o pid -o user -o args | sort -k 1 | tail -r 

if the total cpu usage of all processes is less than 50 then, that means the system is ok?

Good Job.I dont see any problem in design or code either

that doesn't seem right. the output from the ps command does not top

I expect the sum of all PCPU is related to the us column in

vmstat 2 2

(last line)
I say "related", because I don't know how AIX scales the PCPU (total? per CPU core? per logical processor?).
I also do not see how your command should sum up all PCPU if tail only shows the last 10 processes.
Then, for post-processing it is better to omit the header line

ps -e -o pcpu= -o pid= -o user= -o args=

And the floating point column is better numerically sorted with

sort -k1n
1 Like

man ps on AIX:

       C
            (-f, l, and -l flags) CPU utilization of process or thread, incremented each time the system clock ticks and the process or thread is found to be running. 
            The value is decayed by the scheduler by dividing it by 2 once per second. For the sched_other policy, CPU utilization is used in determining process 
            scheduling priority. Large values indicate a CPU intensive process and result in lower process priority whereas small values indicate an
            I/O intensive process and result in a more favorable priority.

man ps on AIX:

       %CPU
            (u and v flags) The percentage of time the process has used the CPU since the process started. The value is computed by dividing the 
            time the process uses the CPU by the elapsed time of the process. In a multi-processor environment, the value is further divided by the 
            number of available CPUs because several threads in the same process can run on different CPUs at the same time. (Because the 
            time base over which this data is computed varies, the sum of all %CPU fields can exceed 100%.)

I'm not sure if the above means anything to the experienced AIX users on here. but the second definition seems suggest the sum of all CPU usages of all process can be used somehow.

Ok, it is scaled, but they simply say CPUs
Old man pages often say CPUs because the authors could not imagine that a CPU might have more than one core, not to mention hyper-threading...
A typical measurement is the vmstat, where (100 - id) gives the used CPU%.
Another typical measurement is the system load, as given by the uptime command. Which is often r + b + w from the vmstat command, integrated over 1 minute, 5 minutes, 15 minutes.
What do you currently measure?

load is not being monitored. which is what im leaning towards. just dont know how to figure out what load number will signify if a host is being stressed.

most of the aix hosts i have here have multiple CPUs (at least 2). so does anyone have a proven calculation on how to determine if a load number is too high for a particular host?