fr and sr (from vmstat output) values are very high

Beginer0705 · January 30, 2011, 10:58pm

Hi AIX Expert,

the fr (page freed/page replacement) and sr (pages scanned by page-replacement algorithm) values from the vmstat output (see below please) are very high. I usually see this high value during the oracle database backup. In addition, the page scan/page steal/ page faults values also very high..

Is this meaning that the server memory is maxed out?
Is there any tuning opportunity that we need to do?

Any advise is greatly appreciated. Thanks!

kthr    memory              page              faults        cpu    
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 3  0 3448445  3898   0   0   0 1927 1928   0 298 3070 3700 68  3 19 10
 4  0 3448448  4429   0   0   0 3081 3087   0 316 4751 3797 72  4 18  6
 3  0 3448447  4184   0   0   0 1542 1543   0 319 3189 3707 70  3 20  7
 5  0 3449110  4690   0   0   0 3484 3484   0 350 13162 3832 69  5 20  6
 3  0 3449188  4121   0   0   0 1824 1824   0 302 3945 3684 66  3 25  7
 4  0 3449178  4003   0   0   0 1933 1934   0 324 3851 3784 72  3 18  8

to File System 0.0 2558.5
Page Scans 1251.0
Page Steals 1219.
Page Faults 756.0

zxmaus · January 31, 2011, 1:16am

Hi,

it doesn't matter how high the numbers of sr/fr are by themselves - the only important thing here is the ratio - sr:fr - it should genuinely not be higher than 4:1 - as that is about where performance issues start - if it goes to 10:1 or higher than your system is spending more time freeing up memory than doing real work. In your case it is pretty much 1:1 what is just fine.

Your output above would be more helpful if we would know how much physical memory you have - ideally avm (in 4k pages) should not exceed 75% on an oracle box for best performance since oracle is a process based DB and every process / connection needs some memory on top of the SGA that is set within oracle - if it goes above 85% or if your free list drops against 0 you are in serious trouble as well - or if you start seeing pi/po values ...

Page faults are no reason for concern after all - they only mean that your box is doing work (and using memory pages).
What I would consider reason for concern is your very very low free list. If this is NOT an asm system, than I would recommend to mount your /dumps filesystem with noatime and rbrw option, all other oracle related filesystems with noatime option as well - and to have your DBAs switch oracle to SETALL. This will give your system a lot of desperately needed memory back.

You should as well consider exporting AIXTHREAD_SCOPE=S .... either as system wide variable in /etc/environment or at least in the oracle .profile

If you want more tips please post the vmstat -Iwt output again including your resources, vmstat -s and vmstat -v outputs.

Hope that helps
kind regards
zxmaus

Beginer0705 · January 31, 2011, 1:49am

It's an Oracle ASM enviornment. It has 15.5GB of RAM where SGA is taking 9GB out of that. The Oracle DBA suggests to set lock_sga=TRUE, and there is additional settings from AIX side but I'm not sure what it is to make lock_sga=TRUE.

You're so right. The free list is very very low..and that caused a lot of performance problem.

Also In Oracle, the SGA_TARGET parameter manages memory inside the database. Do you know what is the parameter on AIX that automatic manage memory for the I/O buffer cache and application cache?

I'm a beginner to AIX. Thanks for your insight!

Sam -

System configuration: lcpu=8 mem=15424MB
   kthr            memory                         page                       faults           cpu       time  

----------- --------------------- ------------------------------------ ----------- ------- ----------- --------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa hr mi se
  1   1   0    3394322      67820    33    77     0     0   103    197   347  16926  3702  7  2 89  3 00:36:52

---------- Post updated at 01:48 AM ---------- Previous update was at 01:44 AM ----------

By the way, the system is now a little quite. It's extremely busy with very little "free list" when the RMAN database backup running.

---------- Post updated at 01:49 AM ---------- Previous update was at 01:48 AM ----------

vmstat -s
           2371477560 total address trans. faults
             32145399 page ins
             72931135 page outs
                 1897 paging space page ins
                 2707 paging space page outs
                    0 total reclaims
            905517288 zero filled pages faults
             62779256 executable filled pages faults
            186955953 pages examined by clock
                  110 revolutions of the clock hand
             97559480 pages freed by the clock
             23360510 backtracks
                    0 free frame waits
                    0 extend XPT waits
              2245184 pending I/O waits
            105074996 start I/Os
              6878603 iodones
           3508959323 cpu context switches
            329345963 device interrupts
             55512985 software interrupts
           1691341351 decrementer interrupts
                46873 mpc-sent interrupts
                46873 mpc-receive interrupts
               181933 phantom interrupts
                    0 traps
          16044009781 syscalls
 
vmstat -v
              3948544 memory pages
              3743595 lruable pages
                66387 free pages
                    4 memory pools
               607209 pinned pages
                 80.0 maxpin percentage
                  3.0 minperm percentage
                 90.0 maxperm percentage
                  8.2 numperm percentage
               307386 file pages
                  0.0 compressed percentage
                    0 compressed pages
                  8.2 numclient percentage
                 90.0 maxclient percentage
               307386 client pages
                    0 remote pageouts scheduled
                    0 pending disk I/Os blocked with no pbuf
                   72 paging space I/Os blocked with no psbuf
                 2228 filesystem I/Os blocked with no fsbuf
                 3602 client filesystem I/Os blocked with no fsbuf
                 1801 external pager filesystem I/Os blocked with no fsbuf
                    0 Virtualized Partition Memory Page Faults
                 0.00 Time resolving virtualized partition memory page faults
 
System configuration: lcpu=8 mem=15424MB
   kthr            memory                         page                       faults           cpu       time  
----------- --------------------- ------------------------------------ ------------------ ----------- --------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa hr mi se
  1   1   0    3393363      66250    33    76     0     0   102    197   347  16928  3702  7  2 89  3 00:49:16

zxmaus · January 31, 2011, 3:04am

Hi,

first of all - if your oracle SGA is 9 GB than your system will hardly ever be happy with less than 18 GB memory. You are paging even though your tuning is fine - that means that you should physically have more memory to satisfy the needs of the box ... a DB server should never have to page.

I am not a fan of locking the SGA just because you are too low in memory. If its a single instance database and you are not going to use huge pages, than the better option is to add the memory the system needs and leave the memory unpinned. Pinning memory on a memory-constrained system will cause more paging - of your user processes what makes queries take longer and batches to overrun. It will not benefit your backups either. And - if the amount of memory you are going to pin is large relatively to the total physical memory, than you are running additionally the risk of a system crash when your system reaches the magical 83% threshold. AIX cannot pin more than a little over 80% in total - and the kernel pins depending on the workload a significant amount of memory over time as its a dynamic (learning) kernel - if your system is doing a lot of different things, than this can be easily be 25% after a week - though I have never seen a kernel pinning more than 35% in total no matter how long it's up, that still might lead to problems when you are pinning more than 50% from scratch to oracle.

If you still insist in doing it ...

I am not sure what you mean with that - basically vmm is responsible for managing all memory on AIX except what is taken away by the SGA and therefor made unaccessible for the system. It is well known that backups are big memory consumers as each IO obviously needs to be buffered. The command vmo -r -o v_pinshm=1 would allow oracle to do the lock_sga but as said before - it is a lot better and safer for the system to add the memory it needs and leave the SGA unlocked.

Now some good news - from the above I can see that your free list NEVER dropped to 0 - that means that lrud is doing its job scanning and freeing properly. If we now can get the paging under control by adding more memory you should be good.

I can see as well that your system would only start paging out Oracle related processes when your computational memory (avm x 4k) would exceed 97% what doesnt seem to be the case on your box (at least in the outputs you have pasted) - but I am quite sure as soon as rman kicks in this is pushing you over the edge.

Since you are running asm, do you still use a /dumps filesystem for the backups or does the DB do it directly to tape ?

I still would love to see a vmstat -Iwt 2 10 output from a timeframe when your system is really busy with normal work - and one from when rman runs ...

BTW - are running AIX 5.3 or 6.1 - and which oracle version ?

Regards
zxmaus

Beginer0705 · February 1, 2011, 10:43pm

Thanks zxmaus. We are running 5.3 TL 9 and Oracle 10.2.0.4.

Are there additional OS tuning opportunities do we need to do?

   kthr            memory                         page                       faults           cpu       time  
----------- --------------------- ------------------------------------ ------------------ ----------- --------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa hr mi se
  0   0   4    3603088      78543     0     0     0     0     0      0  2039 162218  9485 12  7 59 22 21:36:39
  2   0   3    3603070      78561     0     0     0     0     0      0  2268  26937  8614 13  5 61 21 21:36:41
  1   0   3    3603065      78566     0     0     0     0     0      0  2730  32741  9945 12  5 60 22 21:36:43
  1   0   3    3603149      78482     0     0     0     0     0      0  2099  25234  8246 14  4 60 22 21:36:45
  1   0   3    3603838      77776     0     3     0     0     0      0  2209  31075  8496 14  6 57 22 21:36:47
  2   0   3    3603677      77916     0    16     0     0     0      0  2184  26788  8543 16  5 58 22 21:36:49
  0   0   4    3603056      78526     0     0     0     0     0      0  2249  27634  8504 15  4 61 20 21:36:51
  1   0   4    3603068      78514     0     0     0     0     0      0  1847  22577  7417 31  4 43 22 21:36:53
  1   0   4    3603055      78527     0     0     0     0     0      0  1892  23173  7613 14  4 59 23 21:36:55
  1   0   4    3603063      78519     0     0     0     0     0      0  2167  25862  8387  8  4 64 23 21:36:57

zxmaus · February 2, 2011, 12:25am

for asm and sybase which both use rawdevices of some kind, I usually set ioo -p -o lvm_bufcnt=16
Apart from that you surely could do with a few more gigs of memory as your computational usage is really high for an oracle DB. Apart from that - I think your cpu waits are very high - this usually points to problems with the disk subsystem which could have all kinds of reasons - maybe your async IO settings are too low what usually is the case ... set the maxreqs number to 65536 (smitty aio). Check with iostat -Dl if your disks wait queues are running full and your disk response times. Is the above output from while you are running rman ?

Regards
zxmaus

Beginer0705 · February 4, 2011, 5:38pm

This is when RMAN or export data pump running. I notice that fi/fo and the ratio sr:fr are higher than 4:1.

   kthr            memory                         page                       faults           cpu       time  
----------- --------------------- ------------------------------------ ------------------ ----------- --------
  r   b   p        avm        fre    fi    fo    pi    po    fr     sr    in     sy    cs us sy id wa hr mi se
  1   0   2    3590288       4057    31  4512     0     0  4765   4765   882   5640  5127  5  6 75 14 16:30:16
  1   0  10    3590298       3985   268  3562     1     0  3740   3738  1005  15151  5520  6  6 73 15 16:30:17
  10  0   2    3590313       4509    18  7104     0     0  3352  28570  1105  81185  5870 17 10 63  9 16:30:25
  0   0   2    3590301       3952    37  6656     0     0  6058   6058   769  18433  5921 24 10 56  9 16:30:29
  1   0   2    3590288       3863     1  4544     0     0  4632   4629   854   5742  5026  5  6 78 11 16:30:30
  2   0   2    3590322       4062     0  6270     0     0  6567   6566   883   4577  5180  5  7 78 10 16:30:31
  1   0   2    3590310       3859     0  6080     0     0  5796   5795   724   4407  4769  5  7 77 11 16:30:32
  0   0   0    3590284       4192     0  5632     0     0  6193   6246   833   4654  5076  5  7 77 10 16:30:33
  1   0   2    3590277       3905     0  3648     0     0  3097   3127   595   3908  4409  4  4 81 12 16:30:34
  0   0   2    3590283       4008     0  6144     0     0  6057   6054   834   4831  5090  5  7 78 10 16:30:35

zxmaus · February 5, 2011, 11:21pm

fairly normal picture during backups as you are having naturally lots of IO (that is what the backups do ).
So far the only thing I would be concerned of is the page in - but not the fi/fo - these are just your reads and writes - and there is no ratio
The ratio is between sr/fr and that is pretty much 1:1 in your output what is ok - as rman is a very IO intense process.

Regards
zxmaus