iostat output vs TPC output (array layer)

arizah · April 26, 2011, 8:35am

Hi Guys,

I've been having some arguments with my colleagues about one thing. Always my thought was that as as far as disk performance is concern by looking at the output of the iostat command (AIX) you would be able to identify if you have a hot disk and then by moving some files out that disk or by making sure that the same disk is not share at the array level by another busy application would be good enough and I think it had worked for me for quite a few years. I think according to IBM and I'm saying IBM because we use IBM storage if a disk shows more that 35% time active then it could be be a sign of a performance degradation. Then assuming that I can shift around some files then I might be able to spread the I/O across multiple disks. If so do I still need to go the array level (raid) and check the performance stats (in out case TPC) or the output of the iostat would be more than enough. Basically I would like to know if the output of the iostat is accurate enough to determine if we are suffering a I/O bottleneck or if I still need to check the statistics/performance reports at the array raid level to be sure. Thanks in advance for your comments..

DGPickett · April 26, 2011, 2:59pm

Well, it is a complex world, with security and speed in opposition. One oddity of expanding sidk sizes is that one new big disk may be overwhelmed with the level of I/O that used to be handled by 8 disks, so size attracting query and churn is a negative! Striping allows the bandwidth of many drives to be applied to the combined storage, with supports more buffering with faster buffer fills, if things are sequential often enough. If everything was sequential and failure was no worry, you could stipe all together for max bandwidth, but you might do better with 2 or more virtual volumes so copying, database joining and such can be sequential on each virtual device. So, there are sometimes ways to force smart parallelism, the ability to join huge sets without seeks. However, RAM and 64 bit VM have made buffering so ample that it may dilute that sort of approach. RAID has not entirely freed us from failure worry, since with all the layers of software and hardware and vendors, it seems RAID errors often never get heard until they are 2 devices down. Rebuild time is not inconsequential, either. So, your approach should go beyond hot spots to maximizing the bandwidth of a managable number of virtual volumes. Along the way, look at the pathways and how they figure into the redundancy and striping. If a controller handles both sides of a mirror, and goes wonky . . . . If striping runs across all controllers, scsi cables, then any controller or cable bottleneck is diluted. Intellegent use of simple mirror for high churn and RAID-N for low churn is nice, too! Sometimes, this discussion can be extended down into the app, as DB2 append tables with insert never update or delete are churn free except at the end. Disk is cheap and 100% history is wise. Churn-free data might even migrate to some hierarchical read-only store like DVD arrays. Assuming control of chaos is someone else's job can be a luxury.

But, yeah, it seems like it is still good, but might not be sufficient, and an approach with sufficiency might not make it necessary.

arizah · April 27, 2011, 1:27am

Thanks for the lines. Day by day becomes trickier. We are moving from IBM DS800 storage to a new Hitachi Storage and it looks like performance is a lot better. I suppose bigger cache at the Storage subsystem make a big difference but then you may face other bottlenecks like network latency or fiber channels congestions. Also in the near futture we will be migrating to a new Power7 server where most of these bottleneck suppose to disapear. Thanks again.

DGPickett · April 27, 2011, 11:17am

The SAN adds latency, which can be a problem for things like RDBMS commit where files are sync'd to media. Sometimes they just pretend it is on oxide and rely on battery backup, which might be OK, but be aware.

You can flow a lot of data per second in and out, but random might be a problem unless RAM cached. Some apps get more RAM caching by using mmap()/mmap64() to map files to VM -- no swap hit, just RAM or page fault to disk, which is how most OS do dynamic libraries and some OS have started doing this in kernel caching and inside standard library calls for all I/O. Huge random sets traversed once still run slow and waste RAM in the bargain, never mind SAN cache. If you have such latency problems, you may want to put those files on some sort of more local volume, even SSM. One way to have SAN backup is to have a hierarchical bit like the Sun ClientFS that backs all modified pages on the local disk to the SAN, using the local disk as a intermediate level cache. This handles larger volumes than most RAM budgets with less worry about right sizing, since you cannot run out of space on the local disk. The SAN load of backing it up can be tuned, so you do not have to write the same page twice if it gets only two mods close together.

However, while this solves random woes, for RDBMS commit sync, you may want either:

a solid (not cache) local allocation (mirrored on two controllers, not RAIDN) striped fast disk or SSM for speed or
to take the hit on sync to SAN, for low churn very random sets.

Luckily, many OS and RDBMS utilities, turned on or frequently applied, can optimize things so they are more often serial in a fast way, like range scans in an index or table scans in smaller tables. Be careful to apply them in the right order, as one might undo the other. Usually, RDBMS are on raw partitions, so there is no conflict; you just need to do both OS utilities for the non-raw and RDBMS utilities for their files/partitions. As I said, tricks like the APPEND table, or similar physical keying, ensure no holes and no churn of old pages, immediately or once defragged.