Zones and memory resource control issues

boekhold · December 8, 2010, 9:43am

Hi all,

This is a cross-post from the Sun/Oracle forums (I would include the URL here, but the forum doesn't allow me), cos quite frankly, this forum seems to be more active...

I am maintaining an in-house Sun/Oracle x86 server (x4275) running Solaris 10 with zones for testing and development purposes. Basic specs: 2x4Core Xeon CPU & 72GB RAM. More detailed specs at the end of this mail. The global zone isn't running anything of relevance. The other zones are all running a combination of our own products together with Sybase ASE 15 servers. Our own products are a mix of C++ and Java (both stand-alone java processes as well as App Servers/JBoss). I've also got a single instance of VirtualBox running in the global zone just to play around with.

I'm looking for some assistance on properly setting up (shared) memory resource controls on this server, because we currently have issues with this.

After we installed the 5th or 6th zone, we started getting (shared) memory allocation errors. Processes fail to start with �mmap failed� errors, Java processes won't start due to �Error occurred during initialization of VM\n Could not reserve enough space for object heap�, Sybase fails to start due to shared memory allocation failures. If we take another zone down, we are able to start these processes in the first zone (but then we can't start them in the zone we just took down).

I can't really figure out why though. The server itself doesn't seem to be out of memory at all. Based on vmstat output it looks to me as if there's enough memory free still, and that there's hardly any swap space used:

# vmstat
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s1 s2 s3 s4 in sy cs us sy id
0 0 0 2723604 15251616 14 51 0 0 0 0 0 0 0 10 11 13099 67426 8142 2 1 98

That looks like 15GB free still to me.

Because of the Sybase shared memory requirements, I've (tried) to use Solaris projects to set resource controls on �system� on the global zone:

# cat /etc/project
system:0::::project.max-shm-memory=(privileged,107374182400,deny)
user.root:1::::
noproject:2::::
default:3::::
group.staff:10::::

Individual zones have a default /etc/project file, except for one zone, which has group.staff:10::::project.max-shm-memory=(privileged,10737418240,deny) (all applications run under user accounts that are part of staff).

/etc/system doesn't have any memory related entries in it (just set rlim_fd_cur=1024; set rlim_fd_max=8192, cos I don't think you can set that with project controls yet).

Any suggestions for things to check out?

Kind regards, Maarten

==== Processor Sockets ====================================

Version Location Tag
--------------------------
Intel(R) Xeon(R) CPU E5520 @ 2.27GHz CPU 1
Intel(R) Xeon(R) CPU E5520 @ 2.27GHz CPU 2

Memory size: 73720 Megabytes (72GB)

# zpool status
pool: pool1
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c0t2d0 ONLINE 0 0 0
c0t3d0 ONLINE 0 0 0
c0t4d0 ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
c0t6d0 ONLINE 0 0 0
c0t7d0 ONLINE 0 0 0

errors: No known data errors

pool: rpool
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t0d0s0 ONLINE 0 0 0
c0t1d0s0 ONLINE 0 0 0

errors: No known data errors
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
pool1 5.44T 891G 4.57T 16% ONLINE -
rpool 278G 9.18G 269G 3% ONLINE -

# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool1 740G 3.72T 197G /pool1
pool1/zone7 51.6G 3.72T 51.6G /pool1/zones/zone7
pool1/u_boekhold 15.8G 3.72T 15.8G /pool1/home/boekhold
pool1/zone1 47.7G 3.72T 47.7G /pool1/zones/zone1
pool1/zone2 99.5G 400G 99.5G /pool1/zones/zone2
pool1/zone3 87.5G 3.72T 87.5G /pool1/zones/zone3
pool1/zone4 143G 881G 143G /pool1/zones/zone4
pool1/zone5 97.5G 3.72T 97.5G /pool1/zones/zone5
rpool 11.2G 262G 34K /rpool
rpool/ROOT 7.18G 262G 21K legacy
rpool/ROOT/s10x_u8wos_08a 7.18G 262G 7.18G /
rpool/dump 2.00G 262G 2.00G -
rpool/export 44K 262G 23K /export
rpool/export/home 21K 262G 21K /export/home
rpool/swap 2G 264G 16K -

# zoneadm list -vc
ID NAME STATUS PATH BRAND IP 
0 global running / native shared
4 zone3 running /pool1/zones/zone3 native shared
5 zone6 running /pool1/zones/zone6 native shared
7 zone7 running /pool1/zones/zone7 native shared
9 zone8 running /pool1/zones/zone8 native shared
10 zone4 running /pool1/zones/zone4 native shared
11 zone2 running /pool1/zones/zone2 native shared
12 zone1 running /pool1/zones/zone1 native shared
13 zone5 running /pool1/zones/zone5 native shared

(zones zone6 and zone8 are also supposed to be on separate nested ZFS filesystems, but whoever installed these made a mistake... will need to correct that later. Using nested ZFS filesystems so I can put a quota on each).

Any help appreciated,

Kind regards, Maarten

jlliagre · December 8, 2010, 10:33am

What says

swap -s

?

boekhold · December 10, 2010, 6:15am

Sorry for the delay, have been out of the office...

# swap -s
total: 38506512k bytes allocated + 11278672k reserved = 49785184k used, 3538980k available
# swap -l
swapfile             dev  swaplo blocks   free
/dev/zvol/dsk/rpool/swap 181,2       8 4194296 4194296

Not sure how to read the output of swap -s, but the output of swap -l seems to tell me that I have a swap space of (only) 2GB, and that nothing of that is used.

I don't really understand the "available" column from swap -s: that only shows 3.5G available if I count correctly. If currently 47.5GB is in use on a 72GB server, surely there should be around 20-25GB space left?

Rgds, Maarten

jlliagre · December 10, 2010, 6:40am

2 GB of swap space is likely too low for your virtual memory usage, especially for 72 GB of RAM server. I would suggest you to increase the swap size on your system. eg:

swap -d /dev/zvol/dsk/rpool/swap
zfs set volsize=16g rpool/swap
swap -a /dev/zvol/dsk/rpool/swap

Make sure your swap is not currently used with "swap -l" before running these commands.

For a detailed RAM usage display to see where the remaining 20-25 GB are used (likely in ZFS related kernel stuff), you can run that command:

# echo ::memstat|mdb -k

boekhold · December 10, 2010, 6:51am

Hi,

Just added 32G swap space. Let's see how that turns out (after the weekend though).

that memstat|mdb command gives me (after adding the swap space):

# echo ::memstat | mdb -k
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     834125              3258    4%
ZFS File Data             2074033              8101   11%
Anon                      9626639             37604   51%
Exec and libs              490864              1917    3%
Page cache                 281713              1100    1%
Free (cachelist)           912592              3564    5%
Free (freelist)           4650073             18164   25%

Total                    18870039             73711
Physical                 18362748             71729

How do I translate that into english? I assume that "Anon" is the memory used by my applications. The freelist seems to indicate to me that there's still 18G available though...

If there is still 18G free, how should the swap space become involved in this already? Don't get me wrong, I'd be thrilled if just adding some swap solves my issue, but I'd like to explain to my colleagues why that's necessary. I was the one insisting on dumping as much memory as possible into this server, and I would feel a bit silly now if it turns out we could just have created huge swap as well

Btw. I created the extra swap on my rpool, which is a mirror of 2 300GB SAS drives. My other pool is a raidz1 array of 6 1TB SATAII drives. Any reason why I should prefer one pool over the other for swap space?

Rgds, maarten

jlliagre · December 10, 2010, 7:16am

As I edited the command while you were replying, just make sure you delete/add the swap for the change to be taken immediatly.

Indeed, Anon is basically including all memory allocated by processes at run time (heap, stack and the likes).

Because there are different in concept. RAM is what you physically install in your box but (virtual) memory is what processes are living with. The virtual memory is the sum of a part of RAM and the whole swap area(s). If that sum is too small for all the memory reservations to fit, you'll have serious trouble. That doesn't mean you swap space will be used at all in term of disk I/Os. It is just that space must be here "just in case" it is needed later.

Both are fine. Adding RAM allows your applications and your OS to run at optimal speed, adding swap allows more of your applications run concurrently. If you have not enough RAM, your applications performance will likely degrade very significatively. If you have not enough swap, your applications either won't start or randomly crash ...

As long as the RAM is large enough for the swap not to be used (swap -l), you can use the slower disks for swap. On the other hand, if your system start thrashing because of RAM exhaustion, the faster disks would be (slightly) better ...

boekhold · December 10, 2010, 7:41am

Decided not to wait and try it out immediately. Adding swap space seems to have solved by issue. The zone that we had to stop earlier could be started again now, with all its applications.

Strange thing is that even after starting that zone, there is no actual swap space used!


# swap -l
swapfile             dev  swaplo blocks   free
/dev/zvol/dsk/rpool/swap 181,2       8 4194296 4194296
/dev/zvol/dsk/rpool/swap1 181,3       8 67108856 67108856
#

See? blocks == free...

Is Solaris doing something like "eager ahead of time allocation", in the anticipation that it might need that memory in the future? Otherwise I have no explanation for this myself...

Maarten

---------- Post updated at 16:41 ---------- Previous update was at 16:22 ----------

Actually I just added a second swap device (swap -l now has 2 lines output). Seemed to be the less risky option.

The point for me is that I don't see any memory reservations that would exceed my physical RAM. Well, actually, that's not entirely true. Going back to the "memstat | mdb" output:

# echo ::memstat | mdb -k
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     834125              3258    4%
ZFS File Data             2074033              8101   11%
Anon                      9626639             37604   51%
Exec and libs              490864              1917    3%
Page cache                 281713              1100    1%
Free (cachelist)           912592              3564    5%
Free (freelist)           4650073             18164   25%

Total                    18870039             73711
Physical                 18362748             71729

I see that Total is larger than Physical. I just don't understand why that would be the case. Couldn't Solaris "make this fit" by reducing the freelist a tiny bit in size? Why does the above list of zfs/anon/exec/page/freecache/freelist have to add up to something that exceeds physical RAM size?

Maarten

jlliagre · December 10, 2010, 9:15am

There must be a reason but I wouldn't spend time to investigate the Total vs Physical values discrepancy. It is unrelated to your issue anyway.
You are confusing RAM and virtual memory in your statement:

Allocation is done on virtual memory, not RAM. When a process is asking for 2 GB of memory, no RAM is involved. Only when that memory is written to or read from is does it need to be in RAM, and that won't be 2 GB but the memory pages accessed, not more.
Unlike Linux, AIX and others, Solaris doesn't overcommit memory so it is making sure all allocations are backed by either RAM or swap at allocation time.
On the opposite, Gnu/Linux doesn't do it so in case of memory shortage, is just killing otherwise healthy processes to free RAM which might not be what people expect from their OS.

bluescreen · December 10, 2010, 12:16pm

I am experiencing the same issues on SPARC systems now. I have two M5000 with 32GB RAM in each and could "boot" more than 3-4 zones. Once 3 zones were running, the amount of available memory would drop to 2-4G and I could not start another zone.

I have increased the swap on one of the M5000's and am going to attempt to boot the other zone.

This thread is a good read. Thanks jlliagre for explaining the memory configurations and ZFS swap file changes.

jim_mcnamara · December 10, 2010, 1:15pm

Have you enabled rcapd - resource capping?

bluescreen · December 10, 2010, 2:10pm

I tried to configure that before and fubar'd the global zone server and had to rebuild it. :o