High Paging when lots of free memory AIX 5.3

bibish · August 27, 2014, 6:00am

I am new to AIX, I have few AIX 5.3 servers and I could see there are significant difference in paging space utilization on servers even though they are running same applications

below server is working fine which shows 2-5 % paging usage throuh out the day

cpu_scale_memp = 8
data_stagger_interval = 161
defps = 1
force_relalias_lite = 0
framesets = 2
htabscale = n/a
kernel_heap_psize = 4096
kernel_psize = 16777216
large_page_heap_size = 0
lgpg_regions = 0
lgpg_size = 0
low_ps_handling = 1
lru_file_repage = 0
lru_poll_interval = 10
lrubucket = 131072
maxclient% = 90
maxfree = 1088
maxperm = 7222844
maxperm% = 90
maxpin = 6766853
maxpin% = 80
mbuf_heap_psize = 65536
memory_affinity = 1
memory_frames = 8388608
mempools = 4
minfree = 960
minperm = 240760
minperm% = 3
pinnable_frames = 6739105
psm_timeout_interval = 5000
pta_balance_threshold = n/a
relalias_percentage = 0
rpgclean = 0
rpgcontrol = 2
scrub = 0
scrubclean = 0
soft_min_lgpgs_vmpool = 0
spec_dataseg_int = 512
strict_maxclient = 1
strict_maxperm = 0
v_pinshm = 0
vm_modlist_threshold = -1
vmm_fork_policy = 1
vmm_mpsize_support = 1
wlm_memlimit_nonpg = 1

we have another server in which the paging utilization keep incresing and we have to restart the application everyday to make it normal

here is the tunable parameters

cpu_scale_memp = 8
data_stagger_interval = 161
defps = 1
force_relalias_lite = 0
framesets = 2
htabscale = n/a
kernel_heap_psize = 4096
kernel_psize = 16777216
large_page_heap_size = 0
lgpg_regions = 0
lgpg_size = 0
low_ps_handling = 1
lru_file_repage = 0
lru_poll_interval = 10
lrubucket = 131072
maxclient% = 50
maxfree = 1088
maxperm = 4524244
maxperm% = 50
maxpin = 7611036
maxpin% = 80
mbuf_heap_psize = 65536
memory_affinity = 1
memory_frames = 9437184
mempools = 5
minfree = 960
minperm = 1809696
minperm% = 20
pinnable_frames = 7640488
psm_timeout_interval = 20000
pta_balance_threshold = n/a
relalias_percentage = 0
rpgclean = 0
rpgcontrol = 2
scrub = 0
scrubclean = 0
soft_min_lgpgs_vmpool = 0
spec_dataseg_int = 512
strict_maxclient = 1
strict_maxperm = 0
v_pinshm = 0
vm_modlist_threshold = -1
vmm_fork_policy = 1
vmm_mpsize_support = 1
wlm_memlimit_nonpg = 1

let me know if requre any more details to find the issue

Bibi

zaxxon · August 27, 2014, 10:54am

Hi,

out of the blue I would try to set minperm% to 5 and maxperm% to 90 ( maxclient% as well to 90) and check if the behaviour changes.
lru_file_repage is already set to 0 so you usually go with something above.

What type of application(s) is running on the box, a DB, ...?

It would also be interessting to see which processes are using the most paging space:

svmon -P -O sortentity=pgsp

Can you also please post the outputs of:

vmstat -w -t 1 10          # When you notice paging
vmstat -vs                 # This one anytime you want

And the most important of all:
Use code tags for the output

bibish · August 27, 2014, 3:53pm

Thanks for the reply and sorry for not using the code tag.
This server is running java/weblogic application

$ vmstat -w -t 1 10
System configuration: lcpu=24 mem=36864MB ent=5.20
 kthr          memory                         page                       faults                 cpu             time
------- --------------------- ------------------------------------ ------------------ ----------------------- --------
  r   b        avm        fre    re    pi    po    fr     sr    cy    in     sy    cs us sy id wa    pc    ec hr mi se
  9   0    8514838       9984     0     6     9   514   1866     0  2285  12154  3000 72  2 27  0  3.83  73.6 15:48:23
  3   0    8514838      11198     0     6     3  1288   4091     0   771   9390  2266 41  2 57  0  2.23  43.0 15:48:24
  9   0    8514852      10921     0    35    12    19    109     0  3209  16445  5013 77  2 20  0  4.15  79.8 15:48:25
  4   0    8514853      10525     0     8     4     9     42     0  1789  14661  3543 72  2 26  0  3.88  74.7 15:48:26
  5   1    8514853      10049     0    21     7     9     69     0  2699  28619  5436 87  3 10  0  4.70  90.5 15:48:27
  6   0    8514868      10014     0    10    10   276    851     0  2631  23858  6503 90  2  7  0  4.83  93.0 15:48:28
  3   0    8514871      11310     0    17    16  1674   3382     0  2399  23227  4980 73  3 23  0  4.00  77.0 15:48:29
  6   0    8514873      11132     0    10     0     9     43     0  2256  17267  4099 76  3 21  0  4.10  78.9 15:48:30
  3   0    8514875      10865     0    22     3     9     53     0  2190  20454  4908 76  2 22  0  4.06  78.1 15:48:31
  3   0    8514877      10337     0     6     0     0      0     0  2645  14323  4269 69  2 28  0  3.75  72.1 15:48:32

$ vmstat -vs
          26718300070 total address trans. faults
           4324247173 page ins
           3782710509 page outs
             82381547 paging space page ins
             76413162 paging space page outs
                    0 total reclaims
           8121975552 zero filled pages faults
             11657654 executable filled pages faults
          15016368991 pages examined by clock
                25189 revolutions of the clock hand
           5277096998 pages freed by the clock
             65041712 backtracks
               171316 free frame waits
                    0 extend XPT waits
            281450154 pending I/O waits
           8082022155 start I/Os
            846120866 iodones
          42694017510 cpu context switches
          16834766912 device interrupts
           1952030653 software interrupts
          19827299303 decrementer interrupts
            165373147 mpc-sent interrupts
            165373098 mpc-received interrupts
           1330553110 phantom interrupts
                    0 traps
         190748815496 syscalls
              9437184 memory pages
              9048488 lruable pages
                11461 free pages
                    5 memory pools
              1796762 pinned pages
                 80.0 maxpin percentage
                 20.0 minperm percentage
                 50.0 maxperm percentage
                 19.3 numperm percentage
              1753772 file pages
                  0.0 compressed percentage
                    0 compressed pages
                 19.3 numclient percentage
                 50.0 maxclient percentage
              1753772 client pages
                    0 remote pageouts scheduled
                 4760 pending disk I/Os blocked with no pbuf
             14981367 paging space I/Os blocked with no psbuf
                 1972 filesystem I/Os blocked with no fsbuf
                27374 client filesystem I/Os blocked with no fsbuf
               409645 external pager filesystem I/Os blocked with no fsbuf
                    0 Virtualized Partition Memory Page Faults
                 0.00 Time resolving virtualized partition memory page faults

$ svmon -P -O sortentity=pgsp
Unit: page

Total Paging Space   Percent Used
      30720MB              24%

DGPickett · August 27, 2014, 4:31pm

Lots of free memory on a well warmed up server is just an indication of lots of exiting processes as part of the app. Exit frees the RAM footprinnt of the local parts of the process (heap and stack -- code is usually shared pages), bumping up free RAM until paging in eats it away. Sometimes this indicates too much shell programming, or bad shell programming, as processes are executed and discarded over and over. Sometimes these are spun off from server functions. The JAVA webserver process is probably not really freeing RAM to the OS, as free() is just putting it back in the free pool of the memory allocation arena awaiting the next malloc(), or the JAVA gc equivalent.

zaxxon · August 28, 2014, 8:04am

Thanks for the data.

The svmon did not show much, seems this sort is an > AIX 5.3 thing; try this one and check which has/have by far the highest value in the column "Pgsp":

svmon -P | awk '/Pid C/ {h=$0; next} NF == 9 && /^[0-9]/ {if($5 != 0) l[c++]=$0} END{print h; for(x=0; x<c; x++){print l[x]} }'

I would set parameters as recommended in my former post and see if the paging stops, hopefully.
This may take some time since there is still a lot out on the disk that needs at least to be paged in when it's needed. To speed this up, reboot the box after setting the following command...
You can set them online with the following command:

vmo -p -o minperm%=5 -o maxclient%=90 -o maxperm%=90

The -p makes it permanent and survives a boot.

gull04 · August 28, 2014, 8:12am

Hi,

I've seen this type of thing before although it was quite a number of years ago, it actually turned out to be a memory leak from badly written code. It was hard to track it down, we had four P695's alll running the exact same application suite - just turned out to be that the one used for reporting had the problem.

So possible looking at the usage of eack of the boxes may provide a pointer.

Regards

Dave

DGPickett · August 28, 2014, 1:54pm

Looking for leaks, find processes with high and growing total VM using 'ps -o vsz'. (This is on an hp-ux, but ps should work similarly on AIX.):

$ (export UNIX95=true ; ps -exo 'pid,sz,vsz,args')|(line;sort -nr +2 -3)|pg
  PID   SZ     VSZ COMMAND
 8081 18275   74692 mad -u root -g bin
15976    0   52124 xterm -T a-5 -n 5 -geometry 80x25 -fn 12x24 -sb -sl 99999 -vb -bg white -fg black
 1878 12576   51328 /appl/banktools/APPQcime/jre/bin/PA_RISC/java -Dprogram.name=../tools/start . . . org.tanukisoftware.wrapper.WrapperSimpleApp com.appiq.cxws.main.HpMain
 1683 5938   24052 /opt/perf/bin/scopeux
 1856 7096   22784 ./EpicCore
 4602 6810   21632 ./EpicShadow
 .
 .
 .
 .

bibish · August 29, 2014, 4:35pm

Thanks, is there any calculation possible to find out the optimum value for the tunable parameters, or if I tweak these values how much it is going to contribute to my current high paging utilization issue.

$ $ svmon -G
               size       inuse        free         pin     virtual
memory      9437184     9424496       12688     1796740     8576772
pg space    7864320     2000371

               work        pers        clnt       other
pin         1490348           0           0      306392
in use      7671239           0     1753257

PageSize   PoolSize      inuse       pgsp        pin    virtual
s   4 KB          -    3481600     346035    1655556    2022452
m  64 KB          -     371431     103396       8824     409645
$ lsps -s
Total Paging Space   Percent Used
      30720MB              26%
$ vmstat -vs
          26994260366 total address trans. faults
           4361829552 page ins
           3820265318 page outs
             84045532 paging space page ins
             77865859 paging space page outs
                    0 total reclaims
           8236127852 zero filled pages faults
             11712277 executable filled pages faults
          15174415073 pages examined by clock
                25517 revolutions of the clock hand
           5331955580 pages freed by the clock
             65710515 backtracks
               175898 free frame waits
                    0 extend XPT waits
            284609239 pending I/O waits
           8157163408 start I/Os
            855241712 iodones
          43057036717 cpu context switches
          16995406850 device interrupts
           1964838815 software interrupts
          20009814065 decrementer interrupts
            165822097 mpc-sent interrupts
            165822048 mpc-received interrupts
           1353839635 phantom interrupts
                    0 traps
         192398322217 syscalls
              9437184 memory pages
              9048488 lruable pages
                11106 free pages
                    5 memory pools
              1796740 pinned pages
                 80.0 maxpin percentage
                 20.0 minperm percentage
                 50.0 maxperm percentage
                 19.3 numperm percentage
              1750906 file pages
                  0.0 compressed percentage
                    0 compressed pages
                 19.3 numclient percentage
                 50.0 maxclient percentage
              1750906 client pages
                    0 remote pageouts scheduled
                 4760 pending disk I/Os blocked with no pbuf
             15264944 paging space I/Os blocked with no psbuf
                 1972 filesystem I/Os blocked with no fsbuf
                27374 client filesystem I/Os blocked with no fsbuf
               411174 external pager filesystem I/Os blocked with no fsbuf
                    0 Virtualized Partition Memory Page Faults
                 0.00 Time resolving virtualized partition memory page faults

gull04 · August 29, 2014, 7:41pm

Hi Bibish,

Have had a look back here several times and as this is beginning to intrigue me, I'd like to investigate this a little further.. It has been some time since I worked on AIX, but I did do a fair bit with 5.3 and 6.1 back in the day.

There are some things that you should be aware of, as far as the page and swap goes along with the run queue and the CPU I/O wait. These parameters are heavily interdependent in AIX, quite often giving rise to apparent performance issues where there aren't any.

So having read over a report that I prepared about 30 months ago, for a group of AIX systems with around 15K users - I'd like to clarify a couple of things. These are just simple info gathering steps!

Is there a perceivable degradation in performance when the swapping happens?

Is the problem repeatable in any shape or form?

Does the reboot always clear the problem?

Regards

Dave

bibish · September 1, 2014, 1:08am

Hello Dave,

We havnt noticed any performance impact but the paging utilization keep on increasing and eventually we have to restart the application every day to bring it back to normal.

This issue we are monitoring every day for couple of weeks now but with same application with other AIX 5.3 server even with less physical memory utlizing only 2-3% ram in the environment.

yes restarting the application or restarting the system always clear this issue.

Bibish

jlliagre · September 1, 2014, 2:13am

There looks to be a memory leak in the application you restart. If the OS, patches and hardware are the same on the affected server than the ones where the issue doesn't show up, there is unlikely anything you can fix with kernel tuning.

Please post statistics about the leaking process, not just global ones.

gull04 · September 1, 2014, 3:34am

Hi Bibish.

As Jllagre says this is almost certainly application related, what I think you'll have to do here is check the applications are at the same versions. If that is the case - you'll probably have to investigate whey one is holding leaked memory and the other isn't. Most likely down to the way the application is being used if they are both the same.

If these systems are at the same MU level then I think you'll need application support to resolve.

Regards

Dave

zxmaus · September 7, 2014, 5:04am

Well basically no matter how much 'free' memory the system has - with minperm set to 20% the host will inevitably page when it uses 34GB memory computational out of 36GB available - especially as long as numperm is at 20%.
As zaxxon said, fix the minperm setting (to 3% or 5%) and figure out if you really need to keep cached content in memory even though the corresponding IO has been completed. If not mount your filesystems with rbrw option. Not sure if you mentioned if this is an app or a DB server - but if its i.e. oracle, you may want to consider cio and/or filesystem_io_options=SETALL
Regards
zxmaus