Entire server unresponsive

frum · February 12, 2015, 1:11pm

Hi guys,

I have a SUN M5000 server running several Solaris zones (whole root). In all the zones, I have SAP systems running. Recently, one of the SAP systems got stuck (hanged), I suppose was a memory issue. I was not able to log into the zone at all. In fact, I observed that I was not able to log onto the server (global) also. I started halting the zones one by one and then at some stage, I was able to log onto the global zone.

Is it possible due to one particular zone, the entire server gets hung? What can be done to avoid this?
What commands other than prstat -Z will help identify the issue/symptoms etc?

Of course, I'm also looking at SAP side in terms of memory fine-tuning so as to prevent this happening again.

regards.

vbe · February 12, 2015, 2:25pm

Client-server? between what?
You should have reacted earlier if a zone created such a situation, because, after we can only guess few reasons
1) It can happen if badly designed...
2) I cant remember

But more what did you find in your logs? What caused the hang not the application, the system side? overload? etc...

If I were asked at a first glance a reason, if client-server box we lets say multiple (many hundreds...) concurrent access from PCs I would say look with netstat for *FiNWAIT and alterego stuff because it would think badly tuned you run out of sockets explaining you can open new connections...
I let others give you a better explanation than I can at the moment

Good Luck in your investigation

jim_mcnamara · February 12, 2015, 2:39pm

You have to use zone resource management to prevent that problem. This is dummied-up output from prctl -i zone [zonename]

zone.max-swap
        system          16.0EB    max   deny                                 -
zone.max-locked-memory
        system          16.0EB    max   deny                                 -
zone.max-shm-memory
        system          20.0GB    max   deny                                 -
zone.max-shm-ids
        system            1.8M     max   deny                                 -
zone.max-sem-ids
        system          16.8M     max   deny                                 -
zone.max-msg-ids
        system          16.8M     max   deny                                 -
zone.max-lwps
        system            8.4K     max   deny                                 -
zone.cpu-cap
        privileged        200       -   deny                                 -
        system          4.29G     inf   deny                                 -
zone.cpu-shares
        privileged          1       -   none                                 -
        system          65.5K     max   none

You can control these settings with zonecfg or dynamically with prctl

Examining the running system requires using iostat , prstat , fsstat , netstat -s , and
echo '::memstat' | mdb -k # from global zone

to get a BASIC idea. Advanced probing usually requires dtrace.

frum · February 13, 2015, 3:00am

Thanks Vbe and Jim for yr replies.

I only detected the problem a bit late.

Jim, can you please briefly interpret the output of (what do I need to look for) :
1)

prtctl -i zone <zonename>

2) In my case, output of

echo '::memstat' | mdb -k # from global zone

is:

Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                    1536666             12005    5%
ZFS File Data            18056512            141066   55%
Anon                     11687507             91308   35%
Exec and libs              559302              4369    2%
Page cache                  76990               601    0%
Free (cachelist)            47223               368    0%
Free (freelist)           1010477              7894    3%

Total                    32974677            257614
Physical                 32952795            257443

How to interpret this?

regards.

Peasant · February 13, 2015, 5:26am

I would start by limiting zfs arc cache maximum value inside global zone as well as in kernel zones to some sane value, depending on the workload.

Depending what you run in zones, might want to limit ZFS arc cache to couple of GB max (leave everything to service in question).

This will, of course, limit the read performance of a host (cache is smaller, less cache hits more physical reads).

Do you run everything on ZFS filesystems (applications, databases) or some other combination ?

hicksd8 · February 13, 2015, 6:38am

Yes, it is possible for one zone to eat enough resources to grossly affect other zones and global.

The tools are there to cap the memory usage of this zone in the zone configuration (zonecfg) if its eating of physical memory is definitely the problem.

Oracle document the options here:
http://docs.oracle.com/cd/E19253-01/817-1592/z.config.ov-1/index.html

Of course, users of this zone may experience new limitations. If that's a problem consider increasing the overall RAM in the system (again assuming your prognosis is correct about the problem being memory).

---------- Post updated at 11:38 AM ---------- Previous update was at 11:36 AM ----------

Sorry - just realized jim_mcnamara has already said this (but I'll leave this post now anyway).

jim_mcnamara · February 13, 2015, 9:11am

Peasant is spot on - ZFS cache and databases do not play well together, limit the arc cache size.

The link hicksd8 gave you explains those resource limits show by prctl , I believe.

Be sure FSS (Fair share scheduling) is enabled. dispadmin does that.

from the global zone.