High CPU Usage , users affected

Thala · February 3, 2014, 2:25am

Dear All,

One production Server is affected by high usage.
Application is slow now. Please guide me how to solve it?
NMON Report shows full cpu usage.

Here im posting some server details.

bash-3.2# lparstat -i
Node Name                                  : *********
Partition Name                             : OBIEE App Server 1
Partition Number                           : 7
Type                                       : Shared-SMT-4
Mode                                       : Capped
Entitled Capacity                          : 1.30
Partition Group-ID                         : 32775
Shared Pool ID                             : 0
Online Virtual CPUs                        : 2
Maximum Virtual CPUs                       : 3
Minimum Virtual CPUs                       : 1
Online Memory                              : 21504 MB
Maximum Memory                             : 30720 MB
Minimum Memory                             : 10240 MB
Variable Capacity Weight                   : 0
Minimum Capacity                           : 1.00
Maximum Capacity                           : 3.00
Capacity Increment                         : 0.01
Maximum Physical CPUs in system            : 16
Active Physical CPUs in system             : 16
Active CPUs in Pool                        : 16
Shared Physical CPUs in system             : 16
Maximum Capacity of Pool                   : 1600
Entitled Capacity of Pool                  : 1460
Unallocated Capacity                       : 0.00
Physical CPU Percentage                    : 65.00%
Unallocated Weight                         : 0
Memory Mode                                : Dedicated
Total I/O Memory Entitlement               : -
Variable Memory Capacity Weight            : -
Memory Pool ID                             : -
Physical Memory in the Pool                : -
Hypervisor Page Size                       : -
Unallocated Variable Memory Capacity Weight: -
Unallocated I/O Memory entitlement         : -
Memory Group ID of LPAR                    : -
Desired Virtual CPUs                       : 2
Desired Memory                             : 21504 MB
Desired Variable Capacity Weight           : 0
Desired Capacity                           : 1.30
Target Memory Expansion Factor             : -
Target Memory Expansion Size               : -
Power Saving Mode                          : Disabled

Please help me out.

Can i DLPAR temporally ? Is there any temporary fix?

Thanks,
Sharath

techy1 · February 3, 2014, 5:34pm

to be honest there isn't any info in there =).

Can you send the output of this:

ps aux | head -1; ps aux | sort -rn +2 | head -10

The PID on the top cross reference that with this command:

ps -ef|grep <PID>

have you checked out

topas

This will show you what application/process is taking up the CPU usage. But based on the lpar stats there is nothing to tell you what is causing the issue.

You may way to also send the vmstat:

vmstat -Iwt 2

Thala · February 3, 2014, 11:21pm

Hi ,
Thanks for your reply techy, I am not experienced in monitoring stuffs, ill try to post your required details while causing today.

Actually that was an impact when user are high during business hours.
It was normal now. OBIEE app is running in this server.
I checked the Process, disk IO, network traffic at that time. I suspect only the nqsserve (BIP owner ) process consuming more usage.
I only have this output which i executed yesterday,

Yesterday During Business Hours
----------------------------------

bash-3.2# sar -u -P ALL 5 2

AIX PRDBIAPP1 1 6 00F7B1B64C00    02/03/14

System configuration: lcpu=8 ent=1.30 mode=Capped

11:55:43 cpu    %usr    %sys    %wio   %idle   physc   %entc
11:55:48  0       79      16       3       2    0.21    16.2
          1       66       5       2      27    0.12     9.0
          2       45       3       0      52    0.08     5.9
          3       32       4       0      65    0.07     5.1
          4       87      10       2       1    0.37    28.2
          5       42       4       0      54    0.13     9.7
          6       35       3       0      62    0.11     8.6
          7       26       3       1      71    0.10     7.8
          U        -       -       1       9    0.12     9.4
          -       56       7       2      34    1.18    90.6
11:55:53  0       65      18      12       5    0.16    12.2
          1       86       3       2      10    0.18    13.8
          2        0       3       0      96    0.05     3.8
          3        0       4       0      96    0.05     3.8
          4       86      11       0       3    0.42    32.3
          5       40       5       0      54    0.14    11.0
          6        0       2       0      98    0.09     7.0
          7        0       2       0      98    0.09     7.0
          U        -       -       1       8    0.12     9.1
          -       52       7       3      38    1.18    90.9

Average   0       73      17       7       4    0.18    14.2
          1       78       4       2      16    0.15    11.4
          2       28       3       0      69    0.06     4.8
          3       18       4       0      78    0.06     4.5
          4       86      11       1       2    0.39    30.2
          5       41       5       0      54    0.13    10.4
          6       19       2       0      78    0.10     7.8
          7       14       2       0      84    0.10     7.4
          U        -       -       1       9    0.12     9.3
          -       54       7       2      36    1.18    90.7
bash-3.2#

Present - Business hour
-----------------------------

bash-3.2# sar -u -P ALL 5 2

AIX PRDBIAPP1 1 6 00F7B1B64C00    02/04/14

System configuration: lcpu=8 ent=1.30 mode=Capped

09:37:16 cpu    %usr    %sys    %wio   %idle   physc   %entc
09:37:21  0       70      13       8       9    0.19    14.9
          1       82       9       2       6    0.25    19.2
          2        3       3       1      93    0.06     4.7
          3        2       3       0      95    0.06     4.7
          4       83      10       4       3    0.27    20.9
          5       68       8       1      23    0.17    13.1
          6        1       3       0      96    0.06     4.8
          7        1       2       0      96    0.06     4.8
          U        -       -       1      12    0.17    12.8
          -       53       7       4      36    1.13    87.2
09:37:26  0       66      14      11       8    0.18    13.7
          1       85       7       2       6    0.24    18.7
          2       55       4       3      37    0.11     8.3
          3       55       5       0      41    0.11     8.3
          4       79       9       3      10    0.23    17.8
          5       77       7       1      15    0.21    16.1
          6       18       7       0      75    0.08     6.3
          7       29       3       0      68    0.08     6.5
          U        -       -       0       4    0.05     4.2
          -       64       7       3      26    1.25    95.8

Average   0       68      13      10       9    0.19    14.3
          1       83       8       2       6    0.25    18.9
          2       36       4       3      57    0.08     6.5
          3       36       4       0      60    0.08     6.5
          4       81      10       3       6    0.25    19.4
          5       73       8       1      19    0.19    14.6
          6       11       5       0      84    0.07     5.6
          7       17       3       0      80    0.07     5.7
          U        -       -       1       8    0.11     8.5
          -       58       7       3      31    1.19    91.5

The customer wants me to increase the performance on that time, i am out of it
The mode=capped , so is this the reason its causing high cpu?

--Thanks.

_XrAy · February 4, 2014, 7:28am

...delete nonsense (mixed up virtual and logical cpus)

zaxxon · February 4, 2014, 7:42am

It is close to it's entitled capacity (up to 95%), but did not hit the 1.3 processing units.
There could also be tuning capacity in the application. I found this here you might have a look into:
https://blogs.oracle.com/pa/entry/test
There is a link, I can not access since I have no account there anymore:
https://support.oracle.com/rs?type=doc&id=1333049.1

Check the document and see if your box is tuned as they advise. This document also exists for 10g.

Also maybe setup nmon to monitor your AIX LPARs. Will be easier to check when customer says it was slow 10 mins before he calls and you have no history.

techy1 · February 4, 2014, 8:54am

as zaxxon said.

I've come across some oracle servers were I/O was a problem causing CPU problems.

I would high suggest as well to setup nmon reports and monitor these for a day or two, to really give you an idea of what your system is really doing. the 1.3 seems odd, and defiantly uncapped is best if allowed, but keep in mind as well if there is some config problem going on uncapping the server is going to be a pain for your other lpars.

First i'd adjust the CPU to maybe 1.6 or personally i would go for a min of 2 on a oracle server.

Ensure nmon is installed on your server and add this line to the crontab:

  /usr/bin/nmon -M -^ -f -d -T -A -s 60 -c 1435 -m <path/to/logfile>

I'd set this up and review it for I/O, CPU, Mem and ensure everything is working correctly first before uncapping.

ps. I'm sure your aware but be sure not to send the nmon file as that contains sensitive data.

rbatte1 · February 5, 2014, 6:06am

So, this is a partitioned server. It has an allocation of 1.3 CPUs. I'm assuming therefore that there are other partitions defined, and perhaps a little spare CPU on the chassis as a whole. If the partition is capped, then it will use up to 1.3 CPUs and no more. If it is uncapped and there is spare CPU then it will burst through the limit and you will see the value for entitled CPU on sar or vmstat exceeding 100%.

If you take the cap off the partition and other servers are busy, they will be guaranteed to get their allocated CPU as a minimum, however as already pointed out by Zaxxon, you are not CPU bound (95% entitled capacity)

Consider partitions:-

1.3 CPU shared
3.0 CPU shared
2.0 CPU dedicated
1.7 CPU shared

Server has 8 CPUs. Two are dedicated, so out of the reckoning. If the shared CPU partitions are all uncapped, then if the other two are idle, the busy one could get 6 CPUs. If all are busy, then they will be limited as shown. If partition 4 is idle and 1 & 2 are busy then they will compete for the spare CPU (after both have reached their entitled CPU limit) and you can weight them to show a preference.

You may be better just upping your CPU allocation a little then re-activating the partition (not just a reboot) else your end user will get used to having the full spare CPU available and then complain when it's in use elsewhere on the chassis. Is there another partition you could squeeze down, but take the cap off because it is rarely busy?

We have our set as 0.1 CPU, production uncapped, test/dev capped.

I hope that this helps.

Robin
Liverpool/Blackburn
UK