Poor disk performance however no sign of failure

s_ladd · January 4, 2012, 10:47am

Hello guys,
I have two servers performing the same disk operations. I believe one server is having a disk's impending failure however I have no hard evidence to prove it. This is a pair of Netra 210's with 2 drives in a hardware raid mirror (LSI raid controller). While performing intensive reads and writes, the system gets backed up with data that previously it was able to handle without issue.

Raidctl -l shows the array is still in optimal condition however iostat shows the wsvc_t and asvc_t being much more excessive in comparision on the server with the potential problem. iostat -Exn only shows 2 soft errors, 0 hard errors, and 0 transport errors.

The load is ranging between 1.25 and 2.0 and cpu utilization is not going above 40%. The server is not heavily using memory at this time either.

What else can I look at to help identify this problem. Thanks for looking.

rbatte1 · January 4, 2012, 11:36am

Plenty to look for here, not that this is an easy answer:-

Using vmstat (see your man page for what your output shows you) is your server paging a lot? Consider the placement of the page volumes/files. If you have matching memory and potentially there is a process consuming lost of memory, have a look with something like ps el|sort -n +9 based on the AIX version of ps so you will need to carefully read your man page for that. Take care to check if you want the flags with or without the leading hyphen.
Is there a process you don't expect running disk sync all the time? We have users of SQL tools forgetting where they are an initiating /usr/bin/update by mistake and that cripples us sometimes.
Are the disks actually comparable?
Are you the only user of both servers or is something else skewing your results?
Have you recently replaced a disk and one server is still mirroring? Are the RAID controller status displays showing that you are fully operational?
Is anything else hitting your network card and causing the server to spend some time responding to that?

Sorry to be soooo vague, but it's one of the less fun things you have to do as the system manager (more than just an administrator) in tracing what's going on and looking for contention. It can prove a costly time investment.

I hope that this helps somewhere, but I'm sure there will be other suggestions to trawl through too.

Robin
Liverpool/Blackburn
UK

s_ladd · January 4, 2012, 4:12pm

Thanks for the suggestions. I have checked most of that with no avail. I found that format has some disk analyzing tools built in however I have questions about whether this will damage data on the system.. These are the options I am interested in. Now, how can one option not harm SunOS and the other two not harm data? Has anyone performed these tests?

format> analyze

ANALYZE MENU:
read - read only test (doesn't harm SunOS)
refresh - read then write (doesn't harm data)
test - pattern testing (doesn't harm data)

rbatte1 · January 4, 2012, 5:32pm

Eeek! Time to be careful.

On Solaris/SunOS, the format command is the disk slicing tool mainly. You can destroy the system pretty easily with it, as I know to my cost :o

The analyze tests you have found shoud be okay to run and they will look for dodgy disk blocks and perhas will flush something out, but they will really hurt performance whilst you run them. I'm not sure if you have to have all the filesystems on those disks unmounted first. It's been such a long time :rolleyes:

Just because I'm paranoid, make sure you get a good backup before you start, then read that manual pages several times to be sure.

Robin
Liverpool/Blackburn
UK

s_ladd · January 6, 2012, 3:07pm

Thanks Robin,
I was able to run the command against our server in the lab without any problems for a read test. In order to run the other commands you have to have the disk unmounted.