IBM AIX I/O Performance Tuning

I have a IBM Power9 server coupled with a NVMe StorWize V7000 GEN3 storage, doing some benchmarks and noticing that single thread I/O (80% Read / 20% Write, common OLTP I/O profile) seems slow.

./xdisk -R0 -r80 -b 8k -M 1 -f /usr1/testing -t60 -OD -V
    BS Proc AIO read% IO  Flag    IO/s   MB/s rMin-ms rMax-ms rAvg-ms   WrAvg wMin-ms wMax-ms wAvg-ms   WwAvg
    8K    1   0    80 R     -D    7177   56.1   0.090    2.58   0.118   0.116   0.001    2.97   0.216   0.212

Are there parameters in AIX we can tune to push the IO/s and MB/s higher?

STORWIZE V7000 GEN3
IBM Power9

Made sure that the V7000 that is a IBMSVC device is using the recommended AIX_AAPCM driver. I have a 1TB volume (hdisk2) mapped as a JFS2 file system.

# manage_disk_drivers -l
Device              Present Driver        Driver Options
2810XIV             AIX_AAPCM             AIX_AAPCM,AIX_non_MPIO
DS4100              AIX_APPCM             AIX_APPCM
DS4200              AIX_APPCM             AIX_APPCM
DS4300              AIX_APPCM             AIX_APPCM
DS4500              AIX_APPCM             AIX_APPCM
DS4700              AIX_APPCM             AIX_APPCM
DS4800              AIX_APPCM             AIX_APPCM
DS3950              AIX_APPCM             AIX_APPCM
DS5020              AIX_APPCM             AIX_APPCM
DCS3700             AIX_APPCM             AIX_APPCM
DCS3860             AIX_APPCM             AIX_APPCM
DS5100/DS5300       AIX_APPCM             AIX_APPCM
DS3500              AIX_APPCM             AIX_APPCM
XIVCTRL             MPIO_XIVCTRL          MPIO_XIVCTRL,nonMPIO_XIVCTRL
2107DS8K            NO_OVERRIDE           NO_OVERRIDE,AIX_AAPCM,AIX_non_MPIO
IBMFlash            NO_OVERRIDE           NO_OVERRIDE,AIX_AAPCM,AIX_non_MPIO
IBMSVC              AIX_AAPCM             NO_OVERRIDE,AIX_AAPCM,AIX_non_MPIO

# lsdev -Cc disk
hdisk0 Available 01-00    NVMe 4K Flash Disk
hdisk1 Available 02-00    NVMe 4K Flash Disk
hdisk2 Available 05-00-01 MPIO IBM 2076 FC Disk

# lsdev | grep "fw"
sfwcomm0   Available 05-00-01-FF Fibre Channel Storage Framework Comm
sfwcomm1   Available 05-01-01-FF Fibre Channel Storage Framework Comm
sfwcomm2   Available 07-00-01-FF Fibre Channel Storage Framework Comm
sfwcomm3   Available 07-01-01-FF Fibre Channel Storage Framework Comm
sfwcomm4   Available 0A-00-01-FF Fibre Channel Storage Framework Comm
sfwcomm5   Available 0A-01-01-FF Fibre Channel Storage Framework Comm

# lsdev | grep "fcs"
fcs0       Available 05-00       PCIe3 2-Port 16Gb FC Adapter (df1000e21410f103)
fcs1       Available 05-01       PCIe3 2-Port 16Gb FC Adapter (df1000e21410f103)
fcs2       Available 07-00       PCIe2 8Gb 2-Port FC Adapter (77103225141004f3)		(not used)
fcs3       Available 07-01       PCIe2 8Gb 2-Port FC Adapter (77103225141004f3)		(not used)
fcs4       Available 0A-00       PCIe3 2-Port 16Gb FC Adapter (df1000e21410f103)
fcs5       Available 0A-01       PCIe3 2-Port 16Gb FC Adapter (df1000e21410f103)

# lsattr -l fcs0 -E
DIF_enabled   no         DIF (T10 protection) enabled                       True
bus_mem_addr  0x80108000 Bus memory address                                 False
init_link     auto       INIT Link flags                                    False
intr_msi_1    46         Bus interrupt level                                False
intr_priority 3          Interrupt priority                                 False
io_dma        256        IO_DMA                                             True
lg_term_dma   0x800000   Long term DMA                                      True
max_xfer_size 0x100000   Maximum Transfer Size                              True
msi_type      msix       MSI Interrupt type                                 False
num_cmd_elems 1024       Maximum number of COMMANDS to queue to the adapter True
num_io_queues 8          Desired number of IO queues                        True

# lsattr -El hdisk2
PCM             PCM/friend/fcpother                                 Path Control Module              False
PR_key_value    none                                                Persistant Reserve Key Value     True+
algorithm       fail_over                                           Algorithm                        True+
clr_q           no                                                  Device CLEARS its Queue on error True
dist_err_pcnt   0                                                   Distributed Error Percentage     True
dist_tw_width   50                                                  Distributed Error Sample Time    True
hcheck_cmd      test_unit_rdy                                       Health Check Command             True+
hcheck_interval 60                                                  Health Check Interval            True+
hcheck_mode     nonactive                                           Health Check Mode                True+
location                                                            Location Label                   True+
lun_id          0x0                                                 Logical Unit Number ID           False
lun_reset_spt   yes                                                 LUN Reset Supported              True
max_coalesce    0x40000                                             Maximum Coalesce Size            True
max_retry_delay 60                                                  Maximum Quiesce Time             True
max_transfer    0x80000                                             Maximum TRANSFER Size            True
node_name       0x5005076810000912                                  FC Node Name                     False
pvid            00c2f8708ab7845e0000000000000000                    Physical volume identifier       False
q_err           yes                                                 Use QERR bit                     True
q_type          simple                                              Queuing TYPE                     True
queue_depth     20                                                  Queue DEPTH                      True+
reassign_to     120                                                 REASSIGN time out value          True
reserve_policy  single_path                                         Reserve Policy                   True+
rw_timeout      30                                                  READ/WRITE time out value        True
scsi_id         0x20101                                             SCSI ID                          False
start_timeout   60                                                  START unit time out value        True
timeout_policy  fail_path                                           Timeout Policy                   True+
unique_id       332136005076810818048900000000000001A04214503IBMfcp Unique device identifier         False
ww_name         0x5005076810180912                                  FC World Wide Name               False

# lspath -l hdisk2
Enabled hdisk2 fscsi0
Enabled hdisk2 fscsi1
Enabled hdisk2 fscsi4
Enabled hdisk2 fscsi5
# fcstat -D fcs1

FIBRE CHANNEL STATISTICS REPORT: fcs1

Device Type: PCIe3 2-Port 16Gb FC Adapter (df1000e21410f103) (adapter/pciex/df1000e21410f10)
Serial Number: 1A8270057B

ZA: 11.4.415.10
World Wide Node Name: 0x200000109B4CE35E
World Wide Port Name: 0x100000109B4CE35E

FC-4 TYPES:
  Supported: 0x0000010000000000000000000000000000000000000000000000000000000000
  Active:    0x0000010000000000000000000000000000000000000000000000000000000000

FC-4 TYPES (ULP mappings):
  Supported ULPs:
        Small Computer System Interface (SCSI) Fibre Channel Protocol (FCP)
  Active ULPs:
        Small Computer System Interface (SCSI) Fibre Channel Protocol (FCP)
Class of Service: 3
Port Speed (supported): 16 GBIT
Port Speed (running):   16 GBIT
Port FC ID: 0x020200
Port Type: Fabric
Attention Type:   Link Up
Topology:  Point to Point or Fabric

Seconds Since Last Reset: 446027

        Transmit Statistics     Receive Statistics
        -------------------     ------------------
Frames: 681823195               395468348
Words:  298416592384            152800398336

LIP Count: 0
NOS Count: 0
Error Frames:  0
Dumped Frames: 0
Link Failure Count: 1
Loss of Sync Count: 6
Loss of Signal: 3
Primitive Seq Protocol Error Count: 0
Invalid Tx Word Count: 118
Invalid CRC Count: 0
AL_PA Address Granted:   0
Loop Source Physical Address:   0
LIP Type:   L_Port Initializing
Link Down N_Port State: Active AC
Link Down N_Port Transmitter State: Reset
Link Down N_Port Receiver State: Reset
Link Down Link Speed:   0 GBIT
Link Down Transmitter Fault:   0
Link Down Unusable:   0
Current N_Port State: Active AC
Current N_Port Transmitter State: Working
Current N_Port Receiver State: Synchronization Acquired
Current Link Speed:   0 GBIT
Current Link Transmitter Fault:   0
Current Link Unusable:   0
Elastic buffer overrun count:   0

Driver Statistics
  Number of interrupts:   35576060
  Number of spurious interrupts:   0
  Long term DMA pool size:   0x800000
  I/O DMA pool size:  0

  FC SCSI Adapter Driver Queue Statistics
    Number of active commands:   0
    High water mark  of active commands:   20
    Number of pending commands:   0
    High water mark of pending commands:   20
    Number of commands in the Adapter Driver Held off queue:  0
    High water mark of number of commands in the Adapter Driver Held off queue:  0

  FC SCSI Protocol Driver Queue Statistics
    Number of active commands:   0
    High water mark  of active commands:   20
    Number of pending commands:   0
    High water mark of pending commands:   1

FC SCSI Adapter Driver Information
  No DMA Resource Count: 0
  No Adapter Elements Count: 0
  No Command Resource Count: 0

FC SCSI Traffic Statistics
  Input Requests:   32627778
  Output Requests:  20804443
  Control Requests: 2490
  Input Bytes:  605283225091
  Output Bytes: 1191956455792

Adapter Effective max transfer value:   0x100000

Using XDISK 8.6 for AIX 7.2 from here with -OD parameter to open file with O_DIRECT to bypass OS caching and benchmark the storage.

Additional runs with different block/thread settings.

### 8K Block, 1 Thread, Random I/O Test
./xdisk -R0 -r80 -b 8k -M 1 -f /usr1/testing -t60 -OD -V
    BS Proc AIO read% IO  Flag    IO/s   MB/s rMin-ms rMax-ms rAvg-ms   WrAvg wMin-ms wMax-ms wAvg-ms   WwAvg
    8K    1   0    80 R     -D    7177   56.1   0.090    2.58   0.118   0.116   0.001    2.97   0.216   0.212

### 8K Block, 1 Thread, Sequential I/O Test
./xdisk -S0 -r80 -b 8k -M 1 -f /usr1/testing -t60 -OD -V
    BS Proc AIO read% IO  Flag    IO/s   MB/s rMin-ms rMax-ms rAvg-ms   WrAvg wMin-ms wMax-ms wAvg-ms   WwAvg
    8K    1   0    80 S     -D    6461   50.5   0.001    12.1   0.133   0.116   0.001    9.88   0.238   0.213
	
### 16K Block, 1 Thread, Random I/O Test
./xdisk -R0 -r80 -b 16k -M 1 -f /usr1/testing -t60 -OD -V
    BS Proc AIO read% IO  Flag    IO/s   MB/s rMin-ms rMax-ms rAvg-ms   WrAvg wMin-ms wMax-ms wAvg-ms   WwAvg
   16K    1   0    80 R     -D    6796  106.2   0.001    2.63   0.126   0.124   0.179    2.89   0.223   0.219

### 16M Block, 1 Thread, Random I/O Test
./xdisk -R0 -r80 -b 16M -M 1 -f /usr1/testing -t60 -OD -V
    BS Proc AIO read% IO  Flag    IO/s   MB/s rMin-ms rMax-ms rAvg-ms   WrAvg wMin-ms wMax-ms wAvg-ms   WwAvg
   16M    1   0    80 R     -D      70   1120    12.9    34.1    14.0    14.2    12.9    15.6    13.2    13.5

### 32M Block, 1 Thread, Random I/O Test
./xdisk -R0 -r80 -b 32M -M 1 -f /usr1/testing -t60 -OD -V
    BS Proc AIO read% IO  Flag    IO/s   MB/s rMin-ms rMax-ms rAvg-ms   WrAvg wMin-ms wMax-ms wAvg-ms   WwAvg
   32M    1   0    80 R     -D      39   1248    23.9    65.0    25.0    24.7    23.8    26.0    24.1    24.3

### 64M Block, 1 Thread, Random I/O Test
./xdisk -R0 -r80 -b 64M -M 1 -f /usr1/testing -t60 -OD -V
    BS Proc AIO read% IO  Flag    IO/s   MB/s rMin-ms rMax-ms rAvg-ms   WrAvg wMin-ms wMax-ms wAvg-ms   WwAvg
   64M    1   0    80 R     -D      20   1280    46.4     128    47.7    47.5    46.5    49.3    47.6    47.5

### 8K Block, 2 Thread, Random I/O Test
./xdisk -R0 -r80 -b 8k -M 2 -f /usr1/testing -t60 -OD -V
    BS Proc AIO read% IO  Flag    IO/s   MB/s rMin-ms rMax-ms rAvg-ms   WrAvg wMin-ms wMax-ms wAvg-ms   WwAvg
    8K    2   0    80 R     -D   10059   78.6   0.001    3.36   0.172   0.130   0.001    3.35   0.298   0.260

### 8K Block, 4 Thread, Random I/O Test
./xdisk -R0 -r80 -b 8k -M 4 -f /usr1/testing -t60 -OD -V
    BS Proc AIO read% IO  Flag    IO/s   MB/s rMin-ms rMax-ms rAvg-ms   WrAvg wMin-ms wMax-ms wAvg-ms   WwAvg
    8K    4   0    80 R     -D   11914   93.1   0.001    4.22   0.295   0.182   0.001    3.60   0.487   0.431

### 8K Block, 8 Thread, Random I/O Test
./xdisk -R0 -r80 -b 8k -M 8 -f /usr1/testing -t60 -OD -V
    BS Proc AIO read% IO  Flag    IO/s   MB/s rMin-ms rMax-ms rAvg-ms   WrAvg wMin-ms wMax-ms wAvg-ms   WwAvg
    8K    8   0    80 R     -D   13081  102.2   0.001    4.76   0.568   0.478   0.001    4.18   0.775   0.898

### 8K Block, 16 Thread, Random I/O Test
./xdisk -R0 -r80 -b 8k -M 16 -f /usr1/testing -t60 -OD -V
    BS Proc AIO read% IO  Flag    IO/s   MB/s rMin-ms rMax-ms rAvg-ms   WrAvg wMin-ms wMax-ms wAvg-ms   WwAvg
    8K   16   0    80 R     -D   13302  103.9   0.001    6.57    1.15    1.29   0.001    5.10    1.42    1.45

Can you change the hdisk max_transfer size from 0x80000 to 0x40000 and re-test? You will have to unmount the filesystem, varyoff the VG and run the chdev command.

As an alternative can you use the ndisk64 tool (part of the nstress tools from Nigel Griffiths)

Do a web search for "nstress" and download the tar file from the IBM wiki site.

In the /usr1 filesystem create 10 * 1GB files:

cd /usr1
for f in 0 1 2 3 4 5 6 7 8 9
do
echo "Creating file: f${f}"
dd if=/dev/zero of=f${f} bs=1m count=1024 >/dev/null 2>&1
done

Run(){
ndisk64 -f f1 -C -r 100 -R -b 1m -t 20 -M 4 |grep TOTALS|awk '{print $2,$3,$5}'
ndisk64 -f f1 -C -R -b 1m -t 20 -M 4|grep TOTALS|awk '{print $2,$3,$5}'
ndisk64 -f f1 -C -S -b 1m -t 20 -M 4|grep TOTALS|awk '{print $2,$3,$5}'
ndisk64 -f f1 -C -R -r 0 -b 1m -t 20 -M 4|grep TOTALS|awk '{print $2,$3,$5}'
ndisk64 -f f1 -C -S -r 0 -b 1m -t 20 -M 4|grep TOTALS|awk '{print $2,$3,$5}'
ndisk64 -f f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 -C -r 100 -R -b 1m -t 20 -M 4|grep TOTALS|awk '{print $2,$3,$5}'
ndisk64 -f f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 -C -r 100 -R -b 1m -t 20 -M 10|grep TOTALS|awk '{print $2,$3,$5}'
ndisk64 -f f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 -C -r 100 -S -b 1m -t 20 -M 10|grep TOTALS|awk '{print $2,$3,$5}'
ndisk64 -f f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 -C -S -b 1m -t 20 -M 10|grep TOTALS|awk '{print $2,$3,$5}'
ndisk64 -f f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 -C -R -b 1m -t 20 -M 10|grep TOTALS|awk '{print $2,$3,$5}'
ndisk64 -f f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 -C -R -r 0 -b 1m -t 20 -M 10|grep TOTALS|awk '{print $2,$3,$5}'
ndisk64 -f f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 -C -S -r 0 -b 1m -t 20 -M 10|grep TOTALS|awk '{print $2,$3,$5}'
}
Run

Please post your results

I re-wrote my ndisk64 script to use xdisk so after creating the 10 files you can run this:

Run(){
xdisk -f f1 -OC -r 100 -R0 -b 1m -t 20 -M 4 -V
xdisk -f f1 -OC -R0 -b 1m -t 20 -M 4 -V|tail -1
xdisk -f f1 -OC -S0 -b 1m -t 20 -M 4 -V|tail -1
xdisk -f f1 -OC -R0 -r 0 -b 1m -t 20 -M 4 -V|tail -1
xdisk -f f1 -OC -S0 -r 0 -b 1m -t 20 -M 4 -V|tail -1
xdisk -f f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 -OC -r 100 -R0 -b 1m -t 20 -M 4 -V|tail -1
xdisk -f f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 -OC -r 100 -R0 -b 1m -t 20 -M 10 -V|tail -1
xdisk -f f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 -OC -r 100 -S0 -b 1m -t 20 -M 10 -V|tail -1
xdisk -f f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 -OC -S0 -b 1m -t 20 -M 10 -V|tail -1
xdisk -f f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 -OC -R0 -b 1m -t 20 -M 10 -V|tail -1
xdisk -f f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 -OC -R0 -r 0 -b 1m -t 20 -M 10 -V|tail -1
xdisk -f f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 -OC -S0 -r 0 -b 1m -t 20 -M 10 -V|tail -1
}

Run

Please post your results

There are two things i noticed which might affect performance negatively:

You can increase this to help especially larger transfers. Use the -R switch of lsattr to see legal values you can use.

These two:

are also not optimal. Basically the multipath drivers (can) use multiple pathes (FC connections from the LUN to the system) at once. These multiple pathes can be used for two purposes: the first is redundancy, so that if one connection fails it uses another. Connection failure - temporarily - happens rather frequently for reasons i don't fully understand in FC-connections. The other purpose multiple pathes can be used to is performance: using several pathes in parallel speeds things up. This is basically controlled by using the "algorithm" property. I have no test system at hand to tell you the value you need to use but there are only two of them and you need the other one - again, use the lsattr -R switch to list all legal values for the property.

The reserve_policy should be "no_reserve" but this matters mostly in clusters where disks are accessed from several systems at once.

I hope this helps.

bakunin

Sorry for the delayed response, I have ran the xdisk bench and the output is below.

root@xxxxx:./bench.sh
    BS Proc AIO read% IO  Flag    IO/s   MB/s rMin-ms rMax-ms rAvg-ms   WrAvg wMin-ms wMax-ms wAvg-ms   WwAvg
    1M    4   0   100 R     -C    3071   3071   0.857    8.41    1.30    1.27     0.0     0.0     0.0     0.0
    1M    4   0    80 R     -C    2780   2780   0.846    3.94    1.39    1.30    1.16    3.71    1.56    1.51
    1M    4   0    80 S     -C    2753   2753   0.837    6.12    1.40    1.30    1.15    3.75    1.57    1.51
    1M    4   0     0 R     -C    1878   1878     0.0     0.0     0.0     0.0    1.19    4.43    2.06    2.03
    1M    4   0     0 S     -C    1768   1768     0.0     0.0     0.0     0.0    1.15    6.00    2.19    2.21
    1M    4   0   100 R     -C    3017   3017   0.816    13.9    1.32    1.27     0.0     0.0     0.0     0.0
    1M   10   0   100 R     -C    3164   3164   0.806    22.5    3.15    3.17     0.0     0.0     0.0     0.0
    1M   10   0   100 S     -C    3105   3105   0.001    17.4    3.21    3.18     0.0     0.0     0.0     0.0
    1M   10   0    80 S     -C    3426   3426   0.001    16.0    3.03    2.96    1.32    6.13    2.38    2.22
    1M   10   0    80 R     -C    3423   3423   0.971    19.1    3.02    2.96    1.32    7.93    2.43    2.24
    1M   10   0     0 R     -C    1890   1890     0.0     0.0     0.0     0.0    2.42    12.5    5.20    5.20
    1M   10   0     0 S     -C    1659   1659     0.0     0.0     0.0     0.0    2.54    11.0    5.94    5.89

The settings on the hdisk2 were updated to use shortest_queue with no_reserve policy.

root@xxxxx:lsattr -El hdisk2
algorithm       shortest_queue                                      Algorithm                        True+
reserve_policy  no_reserve                                          Reserve Policy                   True+

I did not modify the hdisk max_transfer size during this test as there was a suggestion made to change max_transfer (which is part of the hdisk) and max_xfer_size which is part of the fc hba - do I change both?

Below are current values and possible values.

root@xxxxx:lsattr -El hdisk2 | grep "max_transfer"
max_transfer    0x80000

root@xxxxx:lsattr -Rl hdisk2 -a max_transfer
0x20000
0x40000
0x80000
0x100000
0x200000
0x400000
0x800000
0x1000000

lsattr -Rl fcs2 | grep "max_xfer_size"
root@xxxxx:lsattr -El fcs2
max_xfer_size 0x100000   Maximum Transfer Size

root@xxxxx:lsattr -Rl fcs2 -a max_xfer_size
0x100000
0x200000
0x400000
0x800000
0x1000000

Thanks for your help.

I would not (at least not without imminent pressure to do so) change the "max_transfer" property of the hdisk. If you use disks (that includes virtual disks) AIX (or, rather, the driver it uses) recognizes this is alreay set up fairly. You should change the HBAs property "max_xfer_size", though: optimal values depend on the "environment" of the other components involved so that the driver cannot estimate it.

I have successfully used a value of 0x400000 in a similar surrounding so i'd start in your case with probably a value of 0x800000 , test that thoroughly and eventually go back to 0x400000 if it doesn't work out.

I hope this helps.

bakunin

To change max_xfer_size does one require to reboot AIX? I was reading this blog which mentions the command to run followed by a reboot, is there any way around the reboot?

Hi - as you can see in your results the xdisk tool with 1 MB block size drove the V7000 up to 3.4 Gigabytes per second.
The reason I mentioned using 0x40000 for the hdisk max_transfer is because that is what the SDDPCM driver used by default and I have seen better V7000 performance with this value compared to the 0x80000 default that AIX MPIO uses.
You can leave the FC adapter at 0x100000 or you could change it to 0x200000. This tunable is independent of the hdisk one.

If you do reduce the hdisk max_transfer from 0x80000 to 0x40000 you can then compare the xdisk results now that you have already run it once.

Thanks
Dean

--- Post updated at 09:05 PM ---

Regarding your question of tuning max_xfer_size you can do it dynamically assuming your LUN's have multiple paths. (lspath to confirm)

rmdev -l fscsiX -R
chdev -l fcsX -a max_xfer_size=0x200000
cfgmgr