Solaris 10 VMs hang after about a week

derekivey · April 30, 2012, 10:30pm

Hi all,

We're been having issues with quite a few Solaris 10 VMs hanging after about a week of uptime. These VMs are running on VMware ESXi 4.1 U1 hosts and the issue does not occur on any specific host. We also running CentOS VMs and are not experiencing any issues with those VMs. The VMs that are experiencing this issue are running a few different patch levels. I've seen it occur on 147441-15, 144489-17, and 142910-17.

These VMs are running Tomcat (5.5.33 and some run 6.0.18) and PostgreSQL 9.0.3. The Tomcat apps use JDK version 1.6.0_26. When the hang occurs, all network services stop responding and the console echos what I type but does not respond or give me any prompt.

I booted one of the affected VMs with the -k parameter to enable the kernel debugger. When the hang occured, I followed the instructions at http : // x86: How to Force a Crash Dump and Reboot of <meta http-equiv="refresh" content="0;url='http://www.oracle.com/pls/topic/lookup?ctx=solaris11'"> the System (System Administration Guide: Basic Administration) to invoke a system dump.

I analyzed the dump with the Solaris Crash Analysis Tool (SCAT) and this was the output:

bash-4.1# ./scat /var/crash/unknown/unix.3 /var/crash/unknown/vmcore.3

Oracle Solaris Crash Analysis Tool
Version 5.3 (SV5415, Jan 31 2012) for Oracle Solaris 10 64-bit x64

Copyright � 1989, 2011, Oracle and/or its affiliates. All rights reserved.

Please note: Do not submit any health, payment card or other sensitive
production data that requires protections greater than those specified in
the Oracle GCS Security Practices. Information on how to remove data from
your submission is available at:

Oracle proprietary - DO NOT RE-DISTRIBUTE!

opening /var/crash/unknown/unix.3 /var/crash/unknown/vmcore.3 ...dumphdr...symtab...core...done
loading core data: modules...symbols...CTF...done

core file: /var/crash/unknown/vmcore.3
user: Super-User (root:0)
release: 5.10 (64-bit)
version: Generic_147441-15
machine: i86pc
node name: test-server
domain: mydomain.com
system type: i86pc
hostid: 351d7bc
dump_conflags: 0x10000 (DUMP_KERNEL) on /dev/zvol/dsk/rpool/dump(1G)
boothowto: 0x20040 (DEBUG|KMDB)
time of crash: Mon Apr 30 20:14:43 EDT 2012
age of system: 9 days 3 hours 53 minutes 23.76 seconds
panic CPU: 0 (1 CPUs, 1.99G memory)
panic string: BAD TRAP: type=e (#pf Page fault) rp=fffffe80000b3890 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference

sanity checks: settings...
NOTE: /etc/system: module nfssrv not loaded for "set nfssrv:nfs_portmon=0x1"
vmem...
WARNING: CPU0 has cpu_intr_actv for 5
WARNING: PIL5 interrupt thread 0xfffffe80000b3c60 on CPU0 pinning SYS thread 0xfffffe8000351c60
WARNING: CPU0 has 9 threads in its dispatch queue
sysent...clock...misc...
WARNING: needfree is 80 pages
WARNING: freemem_wait is 80 (threads)
WARNING: page_create() throttled (freemem < throttlefree)
WARNING: hard swapping (avefree < minfree)
NOTE: nscan is 44505
NOTE: push_list_size is 256
WARNING: 15 expired realtime (max -16.924085995s) callouts (17 on expired lists)
done
CAT(/var/crash/unknown/vmcore.3/10X)> meminfo
pages bytes
physinstalled 524175 2147020800 (1.99G)
physmem 521102 2134433792 (1.98G)
total_pages 521102 2134433792 (1.98G)

freemem 1754 7184384 (6.85M)
avefree 1752 7176192 (6.84M)
avefree30 1734 7102464 (6.77M)
needfree 80 327680 (320K)
freemem_wait 80 threads
availrmem (nonswapable) 137232 562102272 (536M)
availrmem_initial 521102 2134433792 (1.98G)
swapfs_minfree 65137 266801152 (254M)
sw_pending_size 4096 (4K)

lotsfree 8142 33349632 (31.8M)
desfree 4071 16674816 (15.9M)
minfree 2035 8335360 (7.94M)
throttlefree 2035 8335360 (7.94M)

pp_kernel(calculated) 375038 1536155648 (1.43G)
obp_pages 1536 6291456 (6M)
kcage_on: 0

shared memory (SM) 0 (0)
intimate SM (ISM) 37289984 (35.5M)
dynamic ISM (DISM) 0 (0)
locked DISM 0 0 (0)
total locked SM 37289984 (35.5M) (1.73% of memory)
spt_used (ISM) 9104 37289984 (35.5M)
segspt_minfree 21184 86769664 (82.7M)

WARNING: page_create() throttled (freemem < throttlefree)
WARNING: hard swapping (avefree < minfree)

anoninfo: (physical == disk-backed)
ani_max - total reservable physical swap 524287 pages (1.99G)
ani_free - unallocated physical and memory 337747 pages (1.28G)
ani_phys_resv - reserved physical 458542 pages (1.74G)
ani_mem_resv - reserved memory 9104 pages (35.5M)
ani_locked_swap - swap locked in reserved mem swap 9104 pages (35.5M)

initial virtual swap available for reservation 980252 pages (3.73G)
ani_max + MAX((availrmem_initial - swapfs_minfree), 0)
current virtual swap available for reservation 137840 pages (538M)
(ani_max - ani_phys_resv) + MAX((availrmem - swapfs_minfree), 0)

swap device pages free
/dev/zvol/dsk/rpool/swap 524287 (1.99G) 458362 (1.74G)

tmpfs:
tmount size mount point
0xffffffff8315d018 684K /etc/svc/volatile
0xffffffff848274c0 8K /tmp
0xffffffff841fc1e8 28K /var/run

ramdisk: (none)
CAT(/var/crash/unknown/vmcore.3/10X)>

I'm thinking the hangs are memory related, based on the output from SCAT. These VMs have 2 GB of memory. Would a lack of memory cause Solaris to completely hang? Shouldn't it be reserving some for the kernel? There is no useful information in /var/adm/messages when the hang occurs.

Thanks for any help you can provide.

Derek

vmcore · May 1, 2012, 10:10am

Hello, your assumption is potentially good one: freemem 1754 7184384 (6.85M)
Also, your system is doing a lot of swapping, which is not good for performance.

WARNING: needfree is 80 pages
WARNING: freemem_wait is 80 (threads)
WARNING: page_create() throttled (freemem < throttlefree)
WARNING: hard swapping (avefree < minfree)
NOTE: nscan is 44505
NOTE: push_list_size is 256

Can you provide output from:

SCAT> thread summary 
SCAT> dev busy

derekivey · May 1, 2012, 11:52am

Hi vmcore,

Here is the information you requested. Thanks for your help with this!

CAT(/var/crash/unknown/vmcore.3/10X)> thread summary
        reference clock = panic_lbolt: 0x4b7e438, panic_hrtime: 0x2d11bb0cecb6b
   27   threads ran since 1 second before current tick (19 user, 8 kernel)
  104   threads ran since 1 minute before current tick (91 user, 13 kernel)

   12   TS_RUN threads (7 user, 5 kernel)
    0   TS_STOPPED threads
    8   TS_FREE threads (0 user, 8 kernel)
   54*  !TS_LOAD (swapped) threads (54 user, 0 kernel)
    3*  !TS_LOAD (swapped) but TS_RUN threads (3 user, 0 kernel)

    4*  threads trying to get a mutex (3 user, 1 kernel)
          longest sleeping 7 minutes 15.56 seconds earlier
    0   threads trying to get an rwlock
  505   threads waiting for a condition variable (320 user, 185 kernel)
    1   threads sleeping on a semaphore (0 user, 1 kernel)
          longest sleeping 9 days 3 hours 53 minutes 23.76 seconds earlier
   53   threads sleeping on a user-level sobj (53 user, 0 kernel)
   39   threads sleeping on a shuttle (door) (39 user, 0 kernel)

    0   threads in biowait()
    1*  threads in zio_wait() (0 user, 1 kernel)

    9   threads in dispatch queues (4 user, 5 kernel)
    1*  interrupt threads running (0 user, 1 kernel)

  631   total threads in allthreads list (428 user, 203 kernel)
    1   thread_reapcnt
    1   lwp_reapcnt
  633   nthread

CAT(/var/crash/unknown/vmcore.3/10X)> dev busy

Scanning for busy devices:
No busy/hanging devices found
Scanning for threads in biowait:

   no threads in biowait() found.

Scanning for procs with aio:

Derek

vmcore · May 1, 2012, 12:09pm

It is for sure issue with amount of memory this system has.
Bottom line is lack of memory, but if you want to dig in further, few more outputs will help.

CAT> tlist findcall zio_wait
CAT> tlist sobj mutex
CAT> swapinfo
CAT> tlist findcall pageout

Reason for this last few outputs is to confirm or rule out following bug: 6898318: ZFS root system can hang swapping to zvol

derekivey · May 1, 2012, 1:30pm

Thanks for your help! Here are the outputs of those commands:

CAT(/var/crash/unknown/vmcore.3/10X)> tlist findcall zio_wait
==== kernel thread: 0xfffffe80007f9c60  PID: 0 ====
cmd: sched
t_wchan: 0xffffffffb4959fc8  sobj: condition var (from zfs:zio_wait+0x53)  
t_procp: 0xfffffffffbc276e0(proc_sched)
  p_as: 0xfffffffffbc293c0(kas)
  zone: global
t_stk: 0xfffffe80007f9c60  sp: 0xfffffe80007f99e0  t_stkbase: 0xfffffe80007f2000
t_pri: 60(SYS)  pctcpu: 0.000000
t_lwp: 0x0  psrset: 0  last CPU: 0  
idle: 43557 ticks (7 minutes 15.57 seconds)
start: Sat Apr 21 16:11:11 2012
age: 792212 seconds (9 days 4 hours 3 minutes 32 seconds)
tstate: TS_SLEEP - awaiting an event
tflg:   T_TALLOCSTK - thread structure allocated from stk
tpflg:  none set
tsched: TS_LOAD - thread is in memory
        TS_DONT_SWAP - thread/LWP should not be swapped
pflag:  SSYS - system resident process

pc:      unix:_resume_from_idle+0xfb resume_return:  addq   $0x8,%rsp
startpc: zfs:txg_sync_thread+0x0:  pushq  %rbp

unix:_resume_from_idle+0xfb resume_return()
unix:swtch+0x135()
genunix:cv_wait+0x68()
zfs:zio_wait+0x53()
zfs:dsl_pool_sync+0xba()
zfs:spa_sync+0x2d7()
zfs:txg_sync_thread+0x1d2()
unix:thread_start+0x8()
-- end of kernel thread's stack --


   1 thread with that call found.

CAT(/var/crash/unknown/vmcore.3/10X)> tlist sobj mutex
  thread             pri pctcpu           idle   PID              wchan command
  0xfffffe800054ac60  99  0.044       7m15.56s     5 0xffffffff800a2250 zpool-rpool
  0xfffffe8000544c60  99  0.014       7m15.56s     5 0xffffffff800a2250 zpool-rpool
  0xfffffe8000532c60  99  0.062       7m15.56s     5 0xffffffff800a2250 zpool-rpool
  0xfffffe8000339c60  60  0.237       7m15.34s     3 0xffffffff800a2250 fsflush

   4 threads with that sobj found.

top mutex/rwlock owners:
count   thread
    4   0xfffffe8000538c60  state: run   wchan: 0x0                 sobj: undefined

CAT(/var/crash/unknown/vmcore.3/10X)> swapinfo
swap device: /dev/zvol/dsk/rpool/swap
  vp: 0xffffffff849849c0 (181(zfs),1)  si_soff: 0x1000  si_eoff: 0x80000000  si_allocs: 122
  flags: 0x0  pages: 524287 (1.99G)  free pages: 458362 (1.74G)
  map size: 65536  si_swapslots: 0xffffffff88b3a000
CAT(/var/crash/unknown/vmcore.3/10X)> tlist findcall pageout
==== kernel thread: 0xfffffe8000351c60  PID: 2  on CPU: 0 ====
cmd: pageout
t_procp: 0xffffffff802a3998(proc_pageout)
  p_as: 0xfffffffffbc293c0(kas)
  zone: global
t_stk: 0xfffffe8000351b70  sp: 0xfffffe80003519d0  t_stkbase: 0xfffffe800034d000
t_pri: 97(SYS)  t_tid: 2  pctcpu: 25.949932
t_lwp: 0xffffffff800b1080  lwp_regs: 0xfffffe8000351b70
  mstate: LMS_SYSTEM  ms_prev: LMS_SYSTEM
  ms_state_start: 17.263364083 seconds earlier
  ms_start: 9 days 4 hours 4 minutes 35.349767316 seconds earlier
psrset: 0  last CPU: 0  
idle: 28 ticks (0.28 seconds)
start: Sat Apr 21 16:11:14 2012
age: 792209 seconds (9 days 4 hours 3 minutes 29 seconds)
tstate: TS_ONPROC - thread is being run on a processor
tflg:   T_TALLOCSTK - thread structure allocated from stk
tpflg:  TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
        TS_DONT_SWAP - thread/LWP should not be swapped
        TS_SIGNALLED - thread was awakened by cv_signal()
pflag:  SSYS - system resident process
        SNOWAIT - children never become zombies

pc:      unix:_resume_from_idle+0xfb resume_return:  addq   $0x8,%rsp

unix:_resume_from_idle+0xfb resume_return()
genunix:pageout_scanner+0x26d()
unix:thread_start+0x8()
-- end of kernel thread's stack --

==== kernel thread: 0xfffffe8000333c60  PID: 2 ====
cmd: pageout
t_wchan: 0xffffffff82e2fb8e  sobj: condition var (from zfs:txg_wait_open+0x73)  
t_procp: 0xffffffff802a3998(proc_pageout)
  p_as: 0xfffffffffbc293c0(kas)
  zone: global
t_stk: 0xfffffe8000333b70  sp: 0xfffffe8000333790  t_stkbase: 0xfffffe800032f000
t_pri: 98(SYS)  t_tid: 1  pctcpu: 1.389232
t_lwp: 0xffffffff800b1e00  lwp_regs: 0xfffffe8000333b70
  mstate: LMS_SLEEP  ms_prev: LMS_SYSTEM
  ms_state_start: 7 minutes 31.321261393 seconds earlier
  ms_start: 9 days 4 hours 4 minutes 35.350472035 seconds earlier
psrset: 0  last CPU: 0  
idle: 43414 ticks (7 minutes 14.14 seconds)
start: Sat Apr 21 16:11:14 2012
age: 792209 seconds (9 days 4 hours 3 minutes 29 seconds)
tstate: TS_SLEEP - awaiting an event
tflg:   T_TALLOCSTK - thread structure allocated from stk
tpflg:  TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
        TS_DONT_SWAP - thread/LWP should not be swapped
pflag:  SSYS - system resident process
        SNOWAIT - children never become zombies

pc:      unix:_resume_from_idle+0xfb resume_return:  addq   $0x8,%rsp

unix:_resume_from_idle+0xfb resume_return()
unix:swtch+0x135()
genunix:cv_wait+0x68()
zfs:txg_wait_open+0x73()
zfs:dmu_tx_wait+0xc4()
zfs:dmu_tx_assign+0x38()
zfs:zvol_strategy+0x267()
genunix:bdev_strategy+0x54()
specfs:spec_startio+0x81()
specfs:spec_pageio+0x29()
genunix:fop_pageio+0x28()
genunix:swap_putapage+0x1ed()
genunix:swap_putpage+0x26c()
genunix:fop_putpage+0x28()
genunix:pageout+0x281()
unix:thread_start+0x8()
-- end of kernel thread's stack --


   2 threads with that call found.

This specific VM is on patch level 147441-15, so I would assume it is not affected by that bug, unless a regression was introduced.

Thanks,
Derek

---------- Post updated at 01:30 PM ---------- Previous update was at 12:53 PM ----------

I see that they also suggest setting the primarycache ZFS property for the swap dataset to metadata. I just checked and mine is already set to that :(.

vmcore · May 1, 2012, 1:32pm

Actually, you should try workaround for above CR, as it has not been fixed yet (in your kernel version).
Official:

There is no resolution to this issue at this time. The Solaris 10 official fix for CR# 6898318 "ZFS root system can hang swapping to zvol" went into patch 147440-04 and was later backed out due to CR# 7108029 "with 147440-04 installed on vxvm root system panics in swapify()". CR# 6898318 is now being tracked by CR# 7106883 "swap zvol preallocation does not work on Solaris 10" for resolution through a patch.

workaround:

Use a raw swap partition instead of swapping to a zvol. or
zfs set primarycache=metadata {poolname}/{swapvol}
This workaround mitigates the issue from happening. To make sure it is properly applied,
1. execute swap -l, to get the names of the zpools and volumes involved.
2. The primarycache property can be viewed by:
zfs get primarycache {poolname}/{swapvol}
and changed by:
zfs set primarycache=metadata {poolname}/{swapvol}

Also, this is only from experience, I would not run ZFS on a system with such limited resources. 2GB of RAM is simply not enough for performance reasons.