Unusual system bog down

Solaris 10 10/09 s10s_u8wos_08a SPARC 16cpus 128MB, uptime 150+ days,
2 db zones (Oracle 9 & 10), 3 application zones.

This is from a system that was literally crawling, 60 seconds to execute a
single command. I had to reboot to clear it. Data is from runs of
prstat and top, and iostat. The system is fine after the reboot.

Most of the waits were for oracle remote user processes in a
single db zone.

I ran dtrace and mdb to find cpu issues and file locks, found very few.
We lost a SAN controller (for a Windows fileserver SAN absolutely
not attached at all to this box) and this occurred as well - several hours
later.

Note: cpu is not occupied actually occupied but the load averages
are absurd. Context switches were low, less than 100/sec, per dtrace.

iostat shows two disks with excessively high svc_t times, but not that
much transfer of data.

Low priority processes are often in waits, this is normal.
I have historical sar data, sarcheck does not see any problems other than
ssd18 and ssd27 have excessive waits.

I had to reboot so this is what I now have to work with....

Any ideas? What would cause this:

PRSTAT
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 20125 oracle   3772M 3769M wait    59    0   0:00:32 0.1% oracle/1
 18435 oracle   3762M 3759M wait    59    0   0:13:35 0.1% oracle/1
 18430 appworx    50M   47M sleep   59    0   0:06:27 0.1% uzpplpl/1
  7264 oracle   3781M 3764M wait     1    0   0:07:45 0.1% oracle/11
 12839 oracle   2551M 2535M wait    47    0   0:03:52 0.1% oracle/11
 16458 root     7688K 4864K cpu10   59    0   0:00:00 0.0% prstat/1
 18337 oracle   3762M 3759M sleep    1    0   0:04:54 0.0% oracle/1
 25080 vssrt     170M  157M sleep   59    2   0:00:44 0.0% MrepApp/1
 13886 oracle   2566M 2535M wait    38    0   0:00:06 0.0% oracle/1
 25011 oracle   3772M 3769M wait     1    0   0:00:30 0.0% oracle/1
 18334 appworx    15M   12M sleep   59    0   0:03:44 0.0% uapsogn/1
  7480 oracle   2584M 2554M wait    59    0   0:07:52 0.0% oracle/11
  7470 oracle   2584M 2556M wait    59    0   0:07:51 0.0% oracle/11
  5488 oracle   3772M 3769M wait    55    0   0:00:22 0.0% oracle/1
  8591 oracle   3762M 3759M wait    59    0   0:00:00 0.0% oracle/1
 23924 vssrt     206M  193M wait     1    2   0:04:24 0.0% DrepApp/1
 25129 oracle   3768M 3765M wait    59    0   0:00:02 0.0% oracle/1
 12857 oracle   2551M 2534M wait     1    0   0:03:53 0.0% oracle/11
  3803 oracle   3777M 3773M wait     1    0   0:00:11 0.0% oracle/15
  3751 oracle   3772M 3769M wait     1    0   0:00:28 0.0% oracle/1
 26066 oracle   2550M 2534M wait    21    0   0:06:54 0.0% oracle/1
 20904 oracle   3768M 3765M wait     1    0   0:00:05 0.0% oracle/1
  7464 oracle   2549M 2532M wait     1    0   0:06:42 0.0% oracle/1
  7266 oracle   3781M 3764M wait     1    0   0:04:45 0.0% oracle/11
  7256 oracle   3769M 3752M wait     1    0   0:06:39 0.0% oracle/1
 23930 oracle   2554M 2538M wait    59    0   0:03:07 0.0% oracle/11
 19553 oracle   3772M 3769M wait    59    0   0:00:10 0.0% oracle/1
  4058 oracle   3768M 3765M wait    60    0   0:00:14 0.0% oracle/1
 14899 oracle   3768M 3765M wait    59    0   0:00:05 0.0% oracle/1
  8670 oracle   2554M 2537M wait    58    0   0:01:35 0.0% oracle/11
 25086 oracle   2553M 2537M wait    59    0   0:00:29 0.0% oracle/11
 15891 oracle   3762M 3758M wait    57    0   0:00:00 0.0% oracle/1
 17399 oracle   3772M 3769M wait    59    0   0:00:19 0.0% oracle/1
 18260 oracle   3772M 3769M wait    59    0   0:02:05 0.0% oracle/1
  4805 oracle   3772M 3769M wait    60    0   0:00:04 0.0% oracle/1
 23116 oracle   3772M 3769M wait     1    0   0:00:14 0.0% oracle/1
 15228 oracle   3765M 3749M cpu11   59    0   0:04:44 0.0% oracle/1
  4946 oracle   3772M 3769M sleep    1    0   0:00:34 0.0% oracle/1
 29429 oracle   3772M 3769M sleep   55    0   0:00:11 0.0% oracle/1
 12875 oracle   2551M 2534M sleep   59    0   0:04:21 0.0% oracle/11
 12632 oracle   2552M 2535M sleep    1    0   0:02:30 0.0% oracle/14
 12594 oracle   2549M 2532M sleep   59    0   0:02:11 0.0% oracle/1
 11515 vssrt     196M  180M wait     1    0   0:01:57 0.0% TbApp/1
 21481 vssrt      76M   62M wait     1    2   0:01:37 0.0% BmanApp/1
 24837 vssrt     178M  165M sleep   59    2   0:01:13 0.0% MrepApp/1
 20360 oracle   3772M 3769M wait     1    0   0:00:22 0.0% oracle/1
 21726 oracle   3777M 3773M wait    57    0   0:00:34 0.0% oracle/11
Total: 1425 processes, 8621 lwps, load averages: 142.80, 134.91, 144.84

top
last pid: 18794;  load avg: 144.64,  133.78,  144.80;  up 154+00:35:28 12:16:18
1425 processes: 601 waiting, 801 sleeping, 3 on cpu                                                                          
CPU states: 95.6% idle,  3.0% user,  1.4% kernel,  0.0% iowait,  0.0% swap
Memory: 128G phys mem, 78G free mem, 32G total swap, 32G free swap

   PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
 25326 oracle     1  59    0 3768M 3765M wait     0:10  0.42% oracle
 12632 oracle    14  59    0 2552M 2535M wait     2:31  0.25% oracle
 18435 oracle     1  59    0 3762M 3759M wait     3:47  0.15% oracle
 23924 vssrt      1   1    2  206M  193M wait     4:26  0.12% DrepApp
 18260 oracle     1  59    0 3772M 3769M wait     2:06  0.12% oracle
  7264 oracle    11   1    0 3781M 3764M wait     7:47  0.11% oracle
 18337 oracle     1   1    0 3762M 3759M wait     4:56  0.10% oracle
  8670 oracle    11  58    0 2554M 2537M wait     1:35  0.09% oracle
 25011 oracle     1   1    0 3772M 3769M wait     0:31  0.08% oracle
 23930 oracle    11  59    0 2554M 2538M wait     3:09  0.08% oracle
  8674 oracle    11  51    0 3770M 3753M wait     1:16  0.08% oracle
 13886 oracle     1  38    0 2564M 2535M wait     0:08  0.08% oracle
 18783 oracle     1  59    0 3762M 3758M wait     0:00  0.08% oracle
  7262 oracle     1   1    0 3960M 3943M wait     4:23  0.08% oracle
 18430 appworx    1  59    0   50M   47M sleep    6:30  0.08% uzpplpl
 

ssdnn devices are SAN Luns

 iostat -xm
 device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
 sd0       0.4    0.5   19.1    2.0  0.0  0.0   25.2   0   0 
 sd1       0.4    0.7   19.1    2.1  0.0  0.0   24.0   0   1 
 sd2       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
 ssd0      0.9    0.3   36.5    1.5  0.0  0.0    3.8   0   0 
 ssd1      1.0    0.3   38.9    1.6  0.0  0.0    4.0   0   0 
 ssd2      1.3    0.3   44.6    5.8  0.0  0.0    3.4   0   0 
 ssd3      0.9    0.3   37.6    2.3  0.0  0.0    3.7   0   0 
 ssd5     88.0   27.0 3181.2  311.0  0.0  0.3    2.7   0   8 
 ssd7      0.0    0.0    0.0    0.0  0.0  0.0    0.9   0   0 
 ssd8      0.1    0.0    0.5    0.0  0.0  0.0    2.1   0   0 
 ssd9      0.1    0.0    0.6    0.0  0.0  0.0    2.1   0   0 
 ssd10     0.5    1.2   14.2   49.1  0.0  0.0    2.6   0   0 
 ssd11     0.1    0.0    0.8    0.0  0.0  0.0    2.0   0   0 
 ssd12     0.3    0.0    5.8    0.1  0.0  0.0    3.5   0   0 
 ssd13     5.1    2.5  395.8  270.8  0.0  0.1    8.7   0   1 
 ssd14     2.4   23.7   46.2  121.9  0.0  0.0    1.4   0   2 
 ssd15     0.0    0.0    0.0    0.0  0.0  0.0    0.6   0   0 
 ssd16     0.1    0.0    0.2    0.0  0.0  0.0    1.9   0   0 
 ssd17     0.0    0.0    0.0    0.0  0.0  0.0    1.1   0   0 
 ssd18    73.5   12.0 13469.7  132.1  0.0  1.5   17.1   0  10
 ssd19     2.0    1.7  133.5   18.9  0.0  0.0    4.7   0   0 
 ssd23     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
 ssd24     0.0    0.0    0.0    0.0  0.0  0.0    1.1   0   0 
 ssd25     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
 ssd26     0.0    0.0    0.0    0.0  0.0  0.0    0.8   0   0 
 ssd27   594.9   65.9 12204.8  669.7  0.0  4.3   86.6   0  74
 ssd28     0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
 ssd29     0.1    0.2    3.1    0.4  0.0  0.0    2.5   0   0 
 ssd30     0.1    0.0    1.8    0.0  0.0  0.0    2.2   0   0 
 ssd31   140.6   25.2 11266.5  315.0  2.9  5.4   60.3   2  15
 

Thanks for any comments.

What says "prstat -Z" from the global zone ?

Well in situations like this (reboot performed) one can only offer suggestions from experience.

With uptime at +150, multiple zones and multiple Oracle instances I would be looking at two things.

  1. Check the content of /tmp directories on all zones to see if one of them has five million files in it. If so, do we know why? Cleaning them up often clears the issue. If this is the problem (an O/S problem) then I would expect the problem to recur in the short term.

  2. What is the setting of the parameter "pg_contig_disable" in the /etc/system files? On a long running uptime and Oracle instances, memory can become very fragmented and if Oracle dB requests contiguous memory then the system virtually hangs whilst working sets are shuffled to give Oracle what it wants. The cure is either to increase memory size or allow Oracle to use non-contiguous memory. If this is the problem (an Oracle problem) then I would expect the problem not to recur in the short term.

This really isn't very helpful I know, just thinking aloud.

Thanks!
@jlliagre - system was rebooted and the problem cleared. Back then prstat -Z did not show any one zone using cpu resources. Nobody had cpu. as you saw sys % time was low, too. So the kernel was not thrashing AFAIK.

/tmp gets cleaned up monthly, so maybe 200 files were out there.

@hicksd8 - pg_contig_disable = 0. I think this may have precipitated the problem. OTN has some similar information, we knew about it but decided against setting it. We rebooted, it is now set to 1. We also forced mgt to acquiesce to a periodic off-time reboot. We now are allowed reboots on the weekend. The whole thing is political, no technical person is allowed input in decisions like this until something goes South.