Reboot VIO - OK?

DiViN3 · January 15, 2016, 1:55am

We've got two datacenters and in every datacenter 2 VIOs.
The VIO manages the I/O of the LPARs. So: Is it possible to reboot the VIO without shuting down an LPAR:wall:

agent.kgb · January 15, 2016, 5:45am

yes, it is possible. but you will get after it, depends on your LPAR configuration.

DiViN3 · January 15, 2016, 6:28am

Hi agent.kgb

Thanks for your answer!

What do you mean with:

How can I check the config of the LPAR to determine if a reboot would work without any trouble?

agent.kgb · January 15, 2016, 7:45am

Let's say you have a managed system with two VIOs (vio1 and vio2) and a lot of fully-virtualized LPARs (lpar1, lpar2, ...).

If an LPAR has VSCSI connection, it must be connected to both VIOs.

$ lsdev -l vscsi*
vscsi2 Available  Virtual SCSI Client Adapter
vscsi3 Available  Virtual SCSI Client Adapter
$ lsattr -El vscsi2
rw_timeout      0         Virtual SCSI Read/Write Command Timeout True
vscsi_err_recov fast_fail N/A                                     True
vscsi_path_to   30        Virtual SCSI Path Timeout               True
$ lsattr -El vscsi3
rw_timeout      0         Virtual SCSI Read/Write Command Timeout True
vscsi_err_recov fast_fail N/A                                     True
vscsi_path_to   30        Virtual SCSI Path Timeout               True

VIO mappings:

$ echo "cvai" | kdb
NAME       STATE    CMDS_ACTIVE  ACTIVE_QUEUE       HOST
vscsi2     0x000007 0x0000000000 0x0                vio1->vhost1
vscsi3     0x000007 0x0000000000 0x0                vio2->vhost1

If you have VSCSI disks:

$ lsdev -Cc disk -s vscsi
hdisk6 Available  Virtual SCSI Disk Drive

check their healthcheck parameters:

$ lsattr -El hdisk6
PCM             PCM/friend/vscsi                 Path Control Module        False
algorithm       fail_over                        Algorithm                  True
hcheck_cmd      test_unit_rdy                    Health Check Command       True+
hcheck_interval 60                               Health Check Interval      True+
hcheck_mode     nonactive <-- VERY BAD!                        Health Check Mode          True+
max_transfer    0x40000                          Maximum TRANSFER Size      True
pvid            00000000000000000000000000000000 Physical volume identifier False
queue_depth     32                               Queue DEPTH                True
reserve_policy  no_reserve                       Reserve Policy             True+

and that the disks are available through both vscsi adapters:

$ lspath -l hdisk6
Enabled hdisk6 vscsi2
Enabled hdisk6 vscsi3

Check priorities of the paths:

$ lspath -AHE -l hdisk6 -p vscsi2
attribute value description user_settable

priority  2     Priority    True
$ lspath -AHE -l hdisk6 -p vscsi3
attribute value description user_settable

priority  1     Priority    True

If you use VFC (NPIV), check that you have at least 2 VFC adapters from both of VIOs:

$ lsdev -Cc adapter -t IBM,vfc-client
fcs2 Available C4-T1 Virtual Fibre Channel Client Adapter
fcs3 Available C5-T1 Virtual Fibre Channel Client Adapter

Check their mappings:

$ echo vfcs | kdb
NAME      ADDRESS             STATE   HOST      HOST_ADAP  OPENED NUM_ACTIVE
fcs2      0xF1000A00001E8000  0x0008  vio1 vfchost14 0x01    0x0000
fcs3      0xF1000A00001EA000  0x0008  vio2 vfchost14 0x01    0x0000

Check that dyntrk and fast_fail for fscsi devices set to yes and fast_fail:

$ lsattr -El fscsi2
attach       switch    How this adapter is CONNECTED         False
dyntrk       yes       Dynamic Tracking of FC Devices        True
fc_err_recov fast_fail FC Fabric Event Error RECOVERY Policy True
scsi_id      0x340a01  Adapter SCSI ID                       False
sw_fc_class  3         FC Class for Fabric                   True
$ lsattr -El fscsi3
attach       switch    How this adapter is CONNECTED         False
dyntrk       yes       Dynamic Tracking of FC Devices        True
fc_err_recov fast_fail FC Fabric Event Error RECOVERY Policy True
scsi_id      0x330b01  Adapter SCSI ID                       False
sw_fc_class  3         FC Class for Fabric                   True

Then check that every disk device is available through both of the adapters. This check depends on your drivers. If you use AIX MPIO, you can look this information using lspath command. For EMC - powermt , for Hitachi - dlnkmgr .

Then, when you are sure, that all of your disks are available through both VIOs, you must check network connection.

$ lsdev -Cc adapter -t IBM,l-lan -s vdevice
ent3 Available  Virtual I/O Ethernet Adapter (l-lan)

You have to go on your VIO servers and check their, if you have SEA (shared ethernet adapter) or some other type of configuration. With SEA Fail-over configuration, you have to check which VIO server is active:

vio1$ lsdev -type sea
name             status      description
ent20            Available   Shared Ethernet Adapter
ent21            Available   Shared Ethernet Adapter
ent22            Available   Shared Ethernet Adapter
ent23            Available   Shared Ethernet Adapter
vio1$ entstat -all entSEA | grep Active

After you've checked everything, you can rely that there are no known bugs in your AIX and VIO versions, which could prevent switching from one VIO to another, but you can also switch everything manual - change priorities for every disk, or switch off devices (make them "defined") which use vio1 resources, and then reboot vio1. After it comes up, check everything again. If you changed priorities or removed some devices, restore the configuration before rebooting vio2.

I hope I didn't forget something very critical.

bakunin · January 15, 2016, 10:25am

A very concise description by agent.kgb, thank you for that!

@thread-o/p: notice that what agent.kgb described is in fact "best practice" and even if you would not have the need to reboot the VIOS right now you should set things up like agent.kgb described it anyway!

If you find out that this is not the case it would be a perfect occasion to get some downtime URGENTLY and correct this! Otherwise these problems will come back to haunt you and most probably will raise their ugly heads in the most inconvenient moment possible.

I hope this helps.

bakunin