What happened if CPU utilization is near to 100 % in AIX 6.1?

Hi all,
We have a setup where our application is running on 2 AIX servers ( AIX 6.1 , 16 CPU, P5 570 boxes). These boxes works as disaster recovery server for each other i.e. in case of 1 box failure, whole load will run out of other box.
Average CPU utilization on each box is between 30-40 % with a max CPU utilization of around 65 % on each box.
There has been a question raised whether one box processing capacity is enough to run the whole load and what would be the impact if CPU utilization reaches close to 100 % (assuming memory and all other parameters are not a problem)
What should be the best way to estimate the average & worst case impact on the application performance if whole load is run out of 1 box
What we have done is

  • taken CPU utilization for each minute interval from both the servers
  • Arithmetic add the cpu utilization to arrive a theoretical cpu utilization beyond 100 %
  • estimate how long CPU utilization would be above 100 % in a day
  • Then arrive at a performance hit based on the above 2 factors.
    Would like to check with experts here, is this the right approach and what other factors should be taken into consideration.
    Thanks

So, is this IBM HA configured as a two-node Active-Active cluster?

All that would likely happen from the OS point of view is that you will be CPU bound and processing will slow a little. The end-user symptoms you may see though may be worse. Does your application have a time-out in it? If the processing does not complete quite as quickly, will that be a problem? Is this a time critical application, e.g. financial transactions critical to the sequence or millisecond for real-time trading perhaps?

The worry will be if a time-out occurs although a background query is still processing. What you tend to get then is users re-submitting. Eventually you will be flooded by user requests and the CPU will never go idle for the rest of the day.

You need to answer these questions for yourself to see if you have adequate provision.

Robin

Thanks Robin for your response.
Yes, 2 boxes are configured as GOVLAN cluster. Application time out is not an issue for this scenario. Is there a way to quantify how slow the processing would be ?

There is more about running an application than simply counting CPU ticks. I don't know a "GOVLAN" cluster (to be honest: never heard about this cluster product), but the reason one uses a cluster is usually not load-sharing:

If you have a 2-way cluster you get additional availability. Even if one system (or some component of one system) breaks the whole still works. Shutting down one system will create the risk of the service (i.e. the application) not being available for some time. Assess this risk in terms of cost: how much will it cost to have the application not available for, say, 24 hours (normal response time for IBM)? Now, if this is too much, how much will the premium service cost to make IBM respond within less time? Calculate all these numbers carefully (or let someone calculate this) and you have something to weigh against the cost of the additional hardware.

CPU utilisation is least problematic performance-wise. Typically performance problems come in one of three types: a system is CPU bound, memory-bound or I/O-bound. Memory-bound systems start swapping and this is usually a killer. I/O-bound systems slow down dramatically too, because I/O has the lowest bandwidth (compared to memory and CPU) to begin with. CPU-bound systems only get a bit slower and this might not be a problem as long as the application is not time-critical.

A good idea is to set up some long-term monitoring to measure CPU (and some other resources) utilisation statistics. If you have such statistics for several days/weeks/months you can calculate all sorts of trends and better estimate the time when CPU saturation will happen.

Last thing: if your CPU utilization is low why don't you take away some CPUs from the LPARs in question and assign these to other systems? You could assign them back any time in case you face CPU saturation at some point in time.

I hope this helps.

bakunin

cpu usage is a machine characteristic - compareable to rpm - and 100% utilization is like being at the end of the red zone. Other than it is high it tells us nothing about the performance of the vehicle - such as mpg (miles per gallon) might.

In other words, study how linear your performance is, in application terms, compared to machine terms, and you will have the best approximation of an answer to your question.

hope this helps.

if peak utilization on 1 box had already gone up as high as 65%, your backup setup will only be fine while total utilization from both boxes loads on 1 server is less than 95%* ... once your total utilization reaches 100%, your users will definitely see performance hits ...

my quick rule of thumb here -- if i am not allowed to test 1 server to host both servers' daily loads and i do not have access to metrics -- would be to see what is the maximum total utilization of each server and add them together ... if below 80%, the "backup" server should be able to last long enough for the downed server to be fully recovered without the users seeing any performance issues ... if above 80%, there is a higher risk that users will see the performance hits long enough to complain that i do not know what i am doing while i am actually doing everything to recover the downed server ...

but just like everything we do, always take into account your computing environment and your users' job functions ... a performance hit on a development server is not as critical as a performance hit on an application server handling billions of dollars worth of financial transactions a day ... if systems are hyper-critical, always have a 3rd box handy and ready to go ...

*the actual threshold may be higher but i try to err on the side of caution ...

Just Ice - good points.

But on a virtualized system, at least on POWER, user/sys/idle/wait are all relative to the column "pc" and/or "ec" when using shared processors.

On a system (this one idle) with a dedicated processor(s) the values you see for cpu consumption can be used in the "normal" way.

System configuration: lcpu=1 mem=9216MB

 kthr          memory                         page                       faults           cpu    
------- --------------------- ------------------------------------ ------------------ -----------
  r   b        avm        fre    re    pi    po    fr     sr    cy    in     sy    cs us sy id wa
  6   0     773317    1494583     0     0     0     0      0     0     0    701   195  1  1 98  0
  7   0     773324    1494576     0     0     0     0      0     0     0   1004   186  0  2 98  0
  6   0     773324    1494576     0     0     0     0      0     0     0    629   197  0  2 98  0

However, when using shared processors, if your (summed) entitlement - which is what vmstat is showing (use mpstat or sar for a per logical processor breakdown - and total at the end) AND the summed entitlement is less than 100% you have at least the rest of your entitlement for additional processing.
While the number is less than entitlement AND user+sys is near 95% or higher, what this says is WHEN active, the processor is doing "user or sys" activities - "idle" time is being given back to the hypervisor for other activities.

$ lparstat 5 2

System configuration: type=Shared mode=Uncapped smt=On lcpu=2 mem=1024MB psize=1 ent=0.20 

%user  %sys  %wait  %idle physc %entc  lbusy   app  vcsw phint
----- ----- ------ ------ ----- ----- ------   --- ----- -----
  0.2   0.8    0.0   98.9  0.00   1.9    1.0  1.00   310     0 
  0.5   0.8    0.0   98.8  0.00   2.1    0.0  1.00   329     0 
$ vmstat -w 5 2

System configuration: lcpu=2 mem=1024MB ent=0.20

 kthr          memory                         page                       faults                 cpu          
------- --------------------- ------------------------------------ ------------------ -----------------------
  r   b        avm        fre    re    pi    po    fr     sr    cy    in     sy    cs us sy id wa    pc    ec
  3   0     241787       3751     0     0     0     0      0     0    16   1123   245  5  3 92  0  0.02  10.2
  2   0     241787       3751     0     0     0     0      0     0     3     36   166  0  1 99  0  0.00   1.8

The same "user+sys" times when above entitlement could be a problem if the app number is getting very small (my system only has 1 cpu, so it is always small - :slight_smile: )

metrics are good if you have them but -- in my experience anyways -- they do not compare to actual production loads and user reviews for testing backup scenarios. i used to work in a highly critical financial production environment so i always stayed on the cautious side.

also, while virtualized environments are optimized for resource-sharing, the host device only has a finite set of resources and they too will eventually be exhausted should the load on the host be too much. which is why most virtualized environments i know will always have redundant setups. :slight_smile: