Intermittent "slowdowns" on Solaris 2.6

We have a Sun Enterprise 450 running Solaris 2.6 that is giving us intermittent performance issues. These problems happen several times a day, usually lasting for a few minutes each time. sar shows that when this occurs, CPU idle time is 0% or close to it, and sys and wio are both high. iostat shows 100% disk busy during these "slowdowns", but it also shows a drop in writes during this time (reads stay about the same). My question is, what is the disk doing to make it 100% busy, if the writes are actually less and reads about the same? I suspect some sort of hardware problem, but Sun have investigated this for us and do not believe it's a hardware fault.

Any ideas?

Thanks

Sean

More information would be required - and even then it's hard to completely diagnose your problem.

  1. What application(s) are running on the server?
  2. How are the filesystems laid out?
  3. How much memory (physical) in the server?
  4. How much swap space?
  5. What (if any) disk management software is being used?
  6. Do you have top installed? What does it show as the top processes during the 'problem'?
  7. Has the problem always occurred?
  8. What changed recently - more users, more data, more cron jobs, more applications, ....

It comes down to knowing your system - something that may be hard to do if you never looked before. If you don't know how it ran before, you may not know what could have caused the change. Another thing - is this temporary slowdown seen as a problem to the users? If they don't complain, don't fix it. Just start keeping history of how it runs. Then when they do complain, you can show how it's out-grown the server in use.

Remember, adding memory can solve one problem and cause another - same with adding cpu and disk space. Rearranging how your data is laid out may help without buying anything.

  1. What application(s) are running on the server?

We have around 100/150 users at any one time running an in-house app written in c, that access's various CISAM database files ranging in size from 100KB to 70MB.

  1. How are the filesystems laid out?

The various database files are spread between three file systems. We use Solstice Disk Suite to stripe and mirror the file systems.

  1. How much memory (physical) in the server?

2 GB

  1. How much swap space?

4GB

  1. What (if any) disk management software is being used?

SDS.

  1. Do you have top installed? What does it show as the top processes during the 'problem'?

top doesn't show any process using more than 1 or 2 percent CPU during the "slowdown". Also, the %usr figures from sar during these times are very low, so it doesn't seem to be related to a user process.

  1. Has the problem always occurred?

The problems has been occuring for a few months now.

  1. What changed recently - more users, more data, more cron jobs, more applications, ....

There have been some changes to the applications and increased data over the last 6 months or so. However, my feeling is that these changes have not caused the problem. We have another identicle machine that has had all the same changes, but is not
experiencing any performance issues.

The users experience extremely slow response times when the performance drops. BAsically the system is almost unusable for them for a few minutes.

The main thing I can't understand is why the disks are indicating 100% busy while the performance problems occur, yet there is very little (less than the normal amount) being written during these times.

Sorry, more questions!

The 'identical' machine - how many users does it have on it?

(back to the original server) Are you running NFS on it sharing out drives?

What does swap look like during normal times versus slow times?

Do the users on the normal server have the same paths as on the 'second' server? Do the two server reside on the same network?

Are there any backup processes running (both our Sybase and Oracle DBA run hourly backups - found that the Sybase one ran gzip across a NFS mount which was killing our server for about 3 minutes each hour - combination of the program running from a NFS mount, doing the compression, and the fact it was on the same drive as the data - top didn't show it except for once in a while - it was more guessing than knowing that it was the problem and we turned it off to every two hours during the normal work day )

Are /etc/system the same on both servers?

Do you have any 3rd party monitoring software (such as BMC Patrol or Landmark) that shows anything ?

Are you the only admin who can change things? Are you SURE no one added something - changed when things run -