topas - computational memory 95% : Any Impact?

panchpan · February 1, 2011, 1:05am

Hello Gurus,
I am using AIX 5 and on running topas command. I can see the computational memory is 93.3% with Swap Paging memory at 2.2%. Could you please advise if there is any impact by the growth of computational memory?

Below is the stat:

MEMORY
Real,MB   22528
% Comp     93.3
% Noncomp   6.6
% Client    6.6

PAGING SPACE
Size,MB   17376
% Used      2.2
% Free     98.8

Thank you

bakunin · February 1, 2011, 7:41am

AIX (and most other Unix-systems, for that matter) try to put as much of the available memory to good use as is possible. That means: if a system doesn't need memory for programs it will increase the size of all sorts of buffers, caches, etc. to improve performance. Once more memory is demanded by running programs these buffers will decrease in size automatically.

So in short: the high memory consumption is absolutely normal and is to be expected. The only possible reason for concern is when the output of "vmstat" shows non-zero values in the "pi" and "po" (page in / page out) columns,

Search this forum for "performance" and "vmstat" to find lots of threads where this issue (and related issues) is discussed in much more detail.

I hope this helps.

bakunin

zxmaus · February 1, 2011, 8:27am

well let me disagree here ... computational memory above 85% is never a good idea - filecaching is important for almost every workload - every IO needs to be cached and this is done with non-comp memory - and if you have for example an oracle DB with 90% comp memory, you can kiss performance goodbye as the system is scanning/freeing itself to death. I would add memory to get at least below 80%

regards
zxmaus

panchpan · February 1, 2011, 6:36pm

a) Could there be a memory leakage in application leading computational memory increasing so fast?
b) Could it lead to system crash if computational memory reaches what percentage?

Thank you

zxmaus · February 1, 2011, 7:39pm

We do not know which AIX version you are running, we do not know your vmo settings, we do not know what is running on your server, so its a little hard to tell what may or may not cause your computational memory utilization and if it is a leak or normal growth. Since AIX has a dynamic kernel, the memory utilization grows over time which is normal, and many workloads are preallocating memory - Websphere, Oracle, Sybase, DB2, SAS just to name a few.

Depending on your AIX version and tunable settings you may start paging out computational content - which usually slows down your corresponding application as AIX is paging in that case entire processes to disk - it stops the process, moves it - and than continues using it.

When you are that high in comp memory, that means you have literally no room in memory for non-comp (file caching) what is as well vital for many applications. If they are requesting memory, than the system starts thrashing (excessive scanning / freeing memory) what may cause a huge amount of cpu overhead and slowdown of your box - this can go that far that no other processing happens on the corresponding server.

Since you are already starting to page (2% paging space used) you are at risk that when you filling your paging space, your system might at first randomly kill processes to survive and later either hang with no login possible or crash and restart itself to free up memory.

What you need to do is monitor your system very closely - like if the amount of comp memory in use is growing and if your paging space utilization is growing.

IBM recommends to have comp memory around 75% (and from a performance perspective I can only confirm this - my boxes - DB servers - perform best when they are somewhere between 66 and 80%) - depnding on the criticality of your applications, buying a few gig of memory may be cheaper than risking a crash.

Hope that helps
kind regards
zxmaus

panchpan · February 1, 2011, 7:50pm

Thank you very much zxmaus.

I am running AIX 5 and it has applications talking like tuxedo, websphere, oracle 11g.

a) Can you give me the URL where IBM recommends to have comp memory around 75%?
b) At what percentage of computational memory - Applications running on it or Server will crash? Or computational memory and paging space both can keep growing and things will hang/crash only when paging memory is around 90%. Basically trying to understand the importance of computational memory.

Thanks again!

zxmaus · February 1, 2011, 9:01pm

Hi,

in about 2000 PMRs which I raised in the past 15 years for performance issues But I will see if I find something on the pages.

I think you need to understand what computational memory is - and what computational memory NOT is to understand why it is important to have sufficient non-computational memory in your system?

Computational memory = basically everything that does not exist on your hdisk - for example DB content changed but not yet committed, anything volatile that is work in progress - basically everything that will be gone when your system crashes.

Non computational memory is everything that has a place on disk but still needs to be working in memory - like shared libraries, executables, ... on top of this each and every disk IO needs 1 page of non-computational memory - so every read and every write.

Usually your system is hopefully tuned in a way that computational memory is NOT paged to disk before you reach 97% (minperm=3%). As well your lrud_file_repage is hopefully set to 0. That means until you do not exceed 97% computational memory, your system will page when it needs to fulfill IO or filecaching needs, or if you start a process or you carry out a job (batches, backups and much more) - but it will not page out for example the DB processes. Nonetheless it will fill up your pagingspace.
If your computational memory needs exceed 97% than the system does not care anymore if what is in memory and needs to be paged out is your database- or application data or if its filecaching - it will just page out everything.
You are saying you are running oracle. Each oracle process doing work can consume 1 pp worth of (shared) memory. In our company that is about 64 mb as we have rather small disks - but this can be 128, 256 or more MB in certain setups - and oracle can have quite a few processes (some of our DB servers have a few thousand). Oracle is a 64bit application, so shared memory is only limited by the amount of physical memory available in the box - so this should not cause application issues - BUT Oracle has on top of its processes as well its SGA - a fixed amount of memory that is allocated for oracle only - so neither the kernel nor any other application running on your system will have any access to it. And - if the SGA is not locked, it may potentially be paged out in total to pagingspace - if this happens, it is kind of disatrous for your DB performance.
On top of that you are running Tuxedo (weblogic) and websphere on your server. I do not know for sure but I believe that Tuxedo is a 32 bit application - I do know for sure that Websphere is. And both are preallocating predefined additional shared memory for themselves. Problem with 32bit applications is, that they cannot use unlimited shared memory - AIX offers only 13 shared memory segments for their use (this can be amended with certain system settings to be exported in the environment but I would not assume that has been done in first instance). And ... 32bit applications use different shared libraries than 64bit applications, so the system needs to hold even more content of this kind in (non-computational) memory - which makes your non-comp needs even bigger. Generally its not such a great idea to mix DBs and Webapplications on the same server for various reasons - the above is one of many, the different needs in workload (DBs = large parallel serial threads, Webapplications lots of mini-threads and random with corresponding loads of IOs ) ...

If you like, you can post a vmstat -Iwt 2 10 output taken at a busy timeframe and vmstat -s output and maybe svmon -G here ... that will help me explain a little further ?

Interesting to see would be as well the top area of commands like svmon -U oracle and similar for your other application users assuming that Tuxedo and Websphere are running under their own ID ? You will be surprised how much that adds up.

You maybe want to have a look here as well.

Kind regards
zxmaus

panchpan · February 1, 2011, 10:15pm

Thank you so much. I understand it now. Here are the outputs of two commands.

# vmstat -Iwt 2 10

System configuration: lcpu=8 mem=22528MB ent=2.00

   kthr            memory                         page                       faults                 cpu             time
----------- --------------------- ------------------------------------ ------------------ ----------------------- --------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa    pc    ec hr mi se
  6   0   0    5507597      24376    11    11     0     0     0      0   482   9042  3459  7  4 88  0  0.25  12.4 14:05:16
  7   0   0    5507596      24358     1   704     0     0     0      0   783   8209  4026  8 19 72  1  0.58  28.9 14:05:18
  6   0   0    5507632      24260    25    26     0     0     0      0   760   8591  3887 19  4 77  0  0.49  24.3 14:05:20
  6   0   0    5507635      24191    24    35     0     0     0      0   709  17631  3755 13 10 76  0  0.49  24.5 14:05:22
  6   0   0    5507634      24131    19    39     0     0     0      0   726  16594  4079 15 11 74  0  0.54  27.0 14:05:24
  6   0   0    5507636      24050    40    37     0     0     0      0   833  10969  3851 22  5 73  0  0.57  28.5 14:05:26
  5   0   0    5507629      23964    53    59     0     0     0      0   879  11698  4336 17  6 77  0  0.47  23.4 14:05:28
  3   0   0    5507636      23881    29    53     0     0     0      0   764   9547  3834 20  5 76  0  0.51  25.6 14:05:30
 12   0   0    5507637      23776    46    98     0     0     0      0   635 146920  3842 26  8 67  0  0.69  34.3 14:05:32
  5   0   0    5507781      23516    67    97     0     0     0      0  1018  78393  4588 26  7 67  0  0.69  34.5 14:05:34

# vmstat -s
           2987546854 total address trans. faults
             63021994 page ins
             65838621 page outs
                15661 paging space page ins
               102513 paging space page outs
                    0 total reclaims
           1091749680 zero filled pages faults
             12316257 executable filled pages faults
            751950154 pages examined by clock
                 2213 revolutions of the clock hand
             49701994 pages freed by the clock
             29262304 backtracks
                    0 free frame waits
                    0 extend XPT waits
              3287521 pending I/O waits
            128860618 start I/Os
             74277577 iodones
           5183493194 cpu context switches
            693577055 device interrupts
            187124308 software interrupts
           1128729939 decrementer interrupts
              1336301 mpc-sent interrupts
              1336265 mpc-received interrupts
             52813406 phantom interrupts
                    0 traps
          43653242538 syscalls

zxmaus · February 2, 2011, 12:12am

I am glad it helped.

from your output below

I can see that you are currently not paging what is good - and no scan to free what is good as well - but 22 GB avm (computational memory) when you only have 22 GB is not that good as every new connection / DB query will probably cause paging.

From your vmstat output ...

Each of these will definitely cause your system to slow down - so you should try to avoid them. Same recommendations I almost always give: try to mount with noatime option and switch Oracle to SETALL - ideally this frees up some computational memory.

Regards
zxmaus

panchpan · February 2, 2011, 12:32am

Thanks a lot zxmaus.

What do you mean by mount with noatime option and switch Oracle to SETALL? I mean the commands for them?

Thanks again!

zxmaus · February 2, 2011, 1:13am

Hello,

for SETALL ask your DBAs ... its a setting within oracle: filesystemio_options=SETALL - it usually is set to none or async - and should be set from oracle 9 onwards to SETALL.
For noatime - smitty chfs - mountoptions noatime (and for dumps if you have such a filesystem choose noatime,rbrw)

Regards
zxmaus

Kind regards
Nicki

bakunin · February 2, 2011, 5:07am

Depending on the exact circumstances this is true in most cases, of course. I was a bit too general in my answer.

@panchpan:

Your vmstat output shows several things. First let's look at the memory situation: The "avm" and "fre" columns are displayed in pages of 4k each. You roughly have 5.5 mio pages available ( 5.5 mio x 4k ~ 22GB ) and of these are ~25k pages (~100MB) not used by the system at all. This might be a bit on the light side as a reserve and - as zxmaus has pointed out - you should watch and monitor the system closely to proactively find out probably bottlenecks. Even if you don't have one you might be close to getting one, as zxmaus has already suggested.

On the other hand your "pi" and "po" columns (page in / page out) are constantly zero, which means there is no paging going on yet. Your "vmstat -s" output shows some paging activity, which should be investigated. Issue the same command over the next days once a day and compare the numbers in it. If they remain constant there is nothing to worry, if they increase then paging is happening somewhere and it will be worth it to find out what causes it.

You might also want to issue "vmstat -v" and watch if there are I/O-buffers lacking. (there are also several threads here discussing exactly this)

The CPU part of your vmstat output shows relatively high idle values (id). "us", "sy", "id" and "wa" are percentage values, depicting the time the CPU(s) spend working on user code (basically programs), system code, idling and waiting (for I/O). High waiting numbers mean there are I/O-bottlenecks, because there would be programs ready to do something, which they cannot do because they cannot read their data. There is no such thing in your output, which is a sign of healthiness in this regard.

Notice the "ec" column: this is also a percentage value and signifies the "entitled [CPU] capacity consumed". Your system is allowed to use 2 CPUs and of these about 25% on average (0.5 CPUs) is used. If this value is constantly below 50% you might want to reduce the entitlement to 1 CPU, if it constantly nears 100% you might want to add a CPU to the LPAR configuration. But before suggesting something in this direction first monitor closely over a longer time. There is no sense in doing performance optimization from a single seconds-long snapshot.

I hope these tips help.

bakunin

johnf · February 2, 2011, 7:18am

The piece by ZXMAUS is one of the best explanations of comp / non-comp memory on AIX I have ever seen.