[ASK] - AIX Fibre Channel behavior

Hello all,

Let me introduce about the context and my environment.
We have an AIX 6.1 system, it has 4 FC channels

[root@xxx] / > lsdev -Cc adapter | grep fcs
fcs0 Available 23-T1 Virtual Fibre Channel Client Adapter
fcs1 Available 23-T1 Virtual Fibre Channel Client Adapter
fcs2 Available 23-T1 Virtual Fibre Channel Client Adapter
fcs3 Available 23-T1 Virtual Fibre Channel Client Adapter
  • 2 virtual FC fcs0, fcs2 comes from VIOS_A --> mapped to the only 1 physical FC
  • 2 virtual FC fcs1, fcs3 comes from VIOS_B --> mapped to the only 1 physical FC

--> We can say we have 2 physical FC path.

There is a chance that I reboot the machine, and it cannot boot up. It said that the boot partition is not found. In the SMS mode, I have checked and found that the fcs2 is failed, and fcs3 is partially worked

WorldWidePortName: c050760941350104
 1.  202700a0b86e87a4,0                 0 MB Disk drive - reserved
 2.  202700a0b86e87a4,1000000000000     107 GB Disk drive
 3.  202700a0b86e87a4,2000000000000     0 MB Disk drive - reserved
 4.  202700a0b86e87a4,3000000000000     0 MB Disk drive - reserved
 5.  202700a0b86e87a4,4000000000000     0 MB Disk drive - reserved
 6.  202700a0b86e87a4,5000000000000     0 MB Disk drive - reserved
 7.  202700a0b86e87a4,6000000000000     0 MB Disk drive - reserved
 8.  202700a0b86e87a4,7000000000000     0 MB Disk drive - reserved
 9.  202700a0b86e87a4,8000000000000     107 GB Disk drive
10.  202700a0b86e87a4,9000000000000     107 GB Disk drive
11.  202700a0b86e87a4,a000000000000     107 GB Disk drive
12.  202700a0b86e87a4,b000000000000     0 MB Disk drive - reserved

First action, I have asked the storage guy to remove the fcs3 WWPN from the mapping, try to detect the boot device, then asked again to remove fcs2 WWPN, the both case didn't help.

Second action, I asked the storage guy to map back the fcs2 & fcs3 WWPN back to the machine. Try to detect and get the positive results. Now fcs3 can see all the LUN and detect the boot device.

Select Attached Device
  Pathname: /vdevice/vfc-client@300001a7
  WorldWidePortName: c050760941350104
 1.  202700a0b86e87a4,0                 107 GB Disk drive - bootable
 2.  202700a0b86e87a4,1000000000000     107 GB Disk drive
 3.  202700a0b86e87a4,2000000000000     107 GB Disk drive
 4.  202700a0b86e87a4,3000000000000     107 GB Disk drive
 5.  202700a0b86e87a4,4000000000000     107 GB Disk drive
 6.  202700a0b86e87a4,5000000000000     107 GB Disk drive
 7.  202700a0b86e87a4,6000000000000     107 GB Disk drive
 8.  202700a0b86e87a4,7000000000000     107 GB Disk drive
 9.  202700a0b86e87a4,8000000000000     107 GB Disk drive
10.  202700a0b86e87a4,9000000000000     107 GB Disk drive
11.  202700a0b86e87a4,a000000000000     107 GB Disk drive
12.  202700a0b86e87a4,b000000000000     107 GB Disk drive

At the end I can boot up the AIX machine back to normal.

Check further with multipath to verify fcs2, found that the LUN are missing on fcs2. This is matched with fcs2 is failed from the beginning.

Enabled hdisk7  fscsi1
Enabled hdisk8  fscsi1
Enabled hdisk9  fscsi1
Enabled hdisk10 fscsi1
Enabled hdisk11 fscsi1
Enabled hdisk12 fscsi1
Missing hdisk2  fscsi2
Missing hdisk3  fscsi2
Missing hdisk4  fscsi2
Missing hdisk5  fscsi2
Missing hdisk6  fscsi2
Missing hdisk7  fscsi2
Missing hdisk8  fscsi2

So my concern here is:
I repeat:

- 2 virtual FC fcs0, fcs2 comes from VIOS_A  --> mapped to the only 1 physical FC
- 2 virtual FC fcs1, fcs3  comes from VIOS_B  --> mapped to the only 1 physical FC

With the first action, the fcs2 & fcs3 were removed. We still have fcs0 & fcs1 (mapped to 2 different physical FCs) can see the LUN, but not see the bootable partition.

With the second action, fcs2 & fcs3 were re-added, this action makes fcs3 refreshed and see the LUN with bootable partition.

Why in the first action, the LUN & boot partition is not detected? we still have the full visibility to the LUN.
Why in the second action, we can see the LUN and boot partition?

As I know, FC card has 2 ports. if 1 port failed, the rest can continue to work. Please correct me if I'm wrong.

Here in reality, we have 2 physical FCs with 1 port failure per each, and still not boot the server until 1 port failure come up again.

Please advise.

At first a few general remarks: If you want to analyse aspects of a system configuration in a (virtualised) AIX environment what you posted is not helpful at all. You need to look elsewhere, especially:

1) The LPARs profile on the HMC, either via the Web GUI or the commandline ( lssyscfg and lshwres )

2) VIOS profile on the HMC

3) from the VIOS commandline the various aspects of virtualised resources ( lsdev , lsvdev , lsmap , ...)

What you posted simply won't tell you (or us) anything useful (that is: useful on its own) information about the system. Its like if i would ask you how to repair my car and when you ask back "which car?" i'd say "a yellow one".

First: what do you mean by "mapping"?? Do you mean "zoning"?

Second: which storage do you use?

Third: how is your system connected to the storage? I mean: physically connected. How does the FC cabling layout look like? I.e. are both ports on the physical adapters connected? And, if yes, do they work both?

You are wrong: not wrong in that the card may have two ports but wrong in the assumption that if it has two ports both have to work. Maybe only one is connected, i don't know - but you don't know either, so i suggest you find out. See above - you (seem to) don't know some pretty relevant details about your environment.

This can have all sorts of reasons and then some: the zoning was wrong before and correct then, the zones where there but not correctly activated, there was a short outage of the FC connection - this happens frequently, which is why ones uses multipath drivers. (This is also the reason why i feel more comfortable not booting off a NPIV device.) And, and, ....

Sorry to have no better answer for you but you will have to learn how FC works, how zoning works, etc. - not to forget how IBM virtualisation works - to understand your environment. I am glad to explain some detail to you but i cannot teach you the job over the internet. And i am glad to help you but i cannot troubleshoot your system over the internet either.

I hope this helps.

bakunin

Hi Bakunin,

You are right. I'm not so understand well the FC work, zoning work, and IBM virtualization work.
Just want to know the FC port work and multipath(redundant) work.

We talk about another aspect. For example, if you have 2 FC cards, each has 2 ports. So 4 ports are connected to the LUN. Assuming that, 3 ports failed and we have only 1 port. So we still can see the LUN? and in case 2 ports failed, we can see the LUN? In regard redundant, with 4 line connected to LUN, the system is only down if 4 lines were down/broken, even if 1 line is still available, the system still running. Please correct me this.

I'm surprised is that in my environment I can see the LUN on 2 still-working FC cards, but not boot device. After 1 more FC is up again, I can see. So confused about this.

In principle: yes, you can. It depends on how your "zones" are configured. So, here is a short introduction to zoning:

When you plug a network card into a network you immediately have a "any to any" connection. For instance, you plug a network card (and an accompanying computer) in and you start a ssh to some other computer on this network. The connection itself is immediately possible and only the remote computer will decide if you are allowed to proceed - that is, by asking your password or whatever. But on the network level, as in exchanging packets, the connection is immediate.

In an FC network this is not the case. When you plug your FC adapter in it is NOT allowed to contact anybody. On the other hand there is no further authentication: when you can access something you can immediately use it. You need to create "zones" to allow (on a per-case-basis) access to other entities on the network.

Now, what is a "zone"? Every item on a FC network - FC adapters, switch ports, but also LUNs - have a "WWPN", which serves about the same role as a MAC-address in a normal network. It is a unique identifier. A zone now is a rule which WWPN is allowed to contact/access which other WWPN. You can have more than one zone for an item, i.e. you may want a certain adapter to work with two disks, so you create a zone stating adapter X is allowed to access disk A and another zone allowing adapter X to access disk B. You may also have several zones for the same disk meaning that several adapters (and therefore maybe different systems?) are allowed to acccess it. This is dangerous because you want to avoid two systems writing to the same disk but on the other hand you need that in clusters. The cluster software will in this case make sure that only one system at a time can write to the disk.

So, depending on how your environment is set up (ask your storage guy - he probably knows more about zones than i do) you may (or may not) have multiple pathes to access a disk because the zoning is set up this way.

Also, a multipath driver will be able to recognise that if you see a disk (=LUN) via such multiple pathes it is still one and the same disk. Like if you have 4 pictures of the same house from different directions you understand that there is one house, not four of them. In case of the driver that means that you may have different device entries for each path but there is a pseudo-device "above" these, which you use on the LVM level. Depending on the driver used this is done differently but the principle is always the same: you have several devices (oftenly, but not always "hdisk"s) which represent the different views (pathes) to a single LUN. Then you have a pseudo-device, which represents the LUN itself and the driver will, when you address this pseudo-device, use just one available path (or even several of them concurrently) to address it.

Also notice that each adapter (physical as well as virtual) in an IBM environment has TWO WWPNs, not one! This is necessary for LPM (live partition mobility) and both these WWPNs need to be zoned.

I hope this helps.

bakunin

1 Like

Thanks Bakunin, for that brilliant explanation, I wished I could be so clear...
I want to point out the very last paragraph is crucial! as more than 75% of the issues I encounter are related to that paragraph, usually when migrating disk bays or servers in a hurry ( time frame to respect...) you go to the vital first to get thing working fast and when the pressure drop remember that you havent finished.. to finalise you need those both WWPNs to be effective, only if you are unlucky you have an issue before and chances you find yourself in this very same situation... In other words when doing this sort of operations be sure you have check with the SAN team that all is configured and correct before lets say, reboot after a patching, or moving the VM, it sounds silly but often you having your schedules and issues, are not always aware of what the other team may have done meanwhile that may have a side effects...

Hi Bakunin,

Can you explain to me 1 thing. As you mentioned, each physical/virtual FC has 2 WWPN. As I check on HMC, I can see the virtual FC assigned to my LPAR has 2 WWPN as screenshot attached.

But when I check on LPAR, I see only 1:

[root@xxx] / > lscfg -vl fcs0 | grep -i network
        Network Address.............C050760671B10018
[root@xxx] / > lscfg -vl fcs1 | grep -i network
        Network Address.............C050760671B1001A
[root@xxx] / > lscfg -vl fcs2 | grep -i network
        Network Address.............C050760671B1001C
[root@xxx] / > lscfg -vl fcs3 | grep -i network
        Network Address.............C050760671B1001E

If it's not zoning, I guess we still see the WWPN info on the AIX.

Just one short question: you have posted the (correct) dialog from the HMCs web GUI where both addresses are shown. How about - instead of grep ping the output of lscfg - looking at its unfiltered output, hmm? I mean, grep is for filtering information. To filter something you need to understand what is vital and what is not. So, instead of filtering, look at the whole first. What does i.e.

[root@xxx] / > lscfg -vl fcs0

tell you? Do you see the WWPN somewhere, which you see in the HMCs GUI? So, there is your answer.

Your guess is correct. The WWPN is - see the explanation above - a property of the adapter and the basis for zoning. So it has to be there prior to any zoning taking place.

Absolutely. But a lifelong experience as a software developer and a systems administrator has taught me one important lesson:

do it as fast as possible - but not faster!

When you start to sacrifice security or quality or reliability for getting "finished" it is time to call it quits. Step back one step, question your objectives and the objectives of the project and ask yourself how to GLOBALLY do the most good. Globally - as opposed to locally - is important here: if you finish a 3-day schedule in 2 days but in a way so that it breaks down in short time and needs another 3-day overhaul you haven't sped up the process, but in fact slowed it down. Knowing that, i take 3 days if it takes 3 days and to hell with some unrealistic promise some non-technical dimwit made just because it looked good on his spreadsheet. If he think it can be done in two days - show me. Chances are he can't.

I remember well the opportunity when i heard Seamour Cray - from Cray Computers fame - talking about developing. It was about the lines of ...you go by taking one step after the other. If you try to take several steps at once you just hop up and down rapidly but in fact get nowhere. To this i have nothing to add.

I hope this helps.

bakunin

Hi Bakunin,
Here the full details. So I don't see any WWPN elsewhere.

[root@xxx] / > lscfg -vl fcs0
  fcs0             U9117.MMD.06528F7-V17-C112-T1  Virtual Fibre Channel Client Adapter

        Network Address.............C050760671B10018
        ROS Level and ID............
        Device Specific.(Z0)........
        Device Specific.(Z1)........
        Device Specific.(Z2)........
        Device Specific.(Z3)........
        Device Specific.(Z4)........
        Device Specific.(Z5)........
        Device Specific.(Z6)........
        Device Specific.(Z7)........
        Device Specific.(Z8)........C050760671B10018
        Device Specific.(Z9)........
        Hardware Location Code......U9117.MMD.06528F7-V17-C112-T1

And how about this quote and the rest :slight_smile: you want to emphasize full details are very crucial???

Absolutely. But a lifelong experience as a software developer and a systems administrator has taught me one important lesson:

It is same for fcs1, ... and so on

[root@xxx] / > lscfg -vl fcs2
  fcs2             U9117.MMD.06528F7-V17-C312-T1  Virtual Fibre Channel Client Adapter

        Network Address.............C050760671B1001C
        ROS Level and ID............
        Device Specific.(Z0)........
        Device Specific.(Z1)........
        Device Specific.(Z2)........
        Device Specific.(Z3)........
        Device Specific.(Z4)........
        Device Specific.(Z5)........
        Device Specific.(Z6)........
        Device Specific.(Z7)........
        Device Specific.(Z8)........C050760671B1001C
        Device Specific.(Z9)........
        Hardware Location Code......U9117.MMD.06528F7-V17-C312-T1

Well, i have to apologize. Having no AIX system to check at hand i was relying on memory and i thought the other WWPN would be displayed in the Z8 field. Seeing the whole display now i see that it is only the first WWPN.

You may be lucky using the fcstat command, because it will always show the active WWPN, which may or may not be the first one, depending on LPM status.

A sure-fire way to get both WWPNs is to query the HMC, either by GUI as you did or by commandline:

lshwres -r virtualio --rsubtype fc --level lpar -m <Managed System> -F lpar_name,wwpns --header --filter lpar_names=<LPAR>

First off: please, when you quote other peoples text, use QUOTE-tags, not CODE-tags. The formatting is different and formatting flow text as code (that is, among other things: without line breaks) is ridiculous.

What i meant is that you can speed up things only to a point: it takes 9 months to deliver a (working) baby and if you try to "optimize" or "streamline" (or whatever the newspeak of quacks with a degree in business is today) the process you are likely not to end up with a baby delivered sooner but an abomination delivered later. Corollary: managers subjugating the mother to do daily status reports won't help either.

I try to be as circumspect as possible when doing crucial things (and configuring a system that should then run for some years is crucial) and under pressure i can speed up a bit. But once this cutting corners is endangering the success of the whole operation i usually stop and just make clear that ridiculous timelines set by clueless managers based on the size of their expected bonus is not relevant for me - and is not for my work either. If someone happens to have the "idea" (for lack of a more fitting word) to have a server installed and configured by noon at 11:59 - he is just going to be gravely disappointed. An he not having the server he expected by the time he phantasized he'd have it doesn't make my coffee tasting less good or my sleep being any less deep. As a technician i am paid for doing technics - if you need a wonder hire the god of your choice.

I hope this helps.

bakunin

ADDENDUM: in my introduction to FC networks i forgot to mention that new adapters need to be "logged in" to the network before they can be used and their WWPNs are recognised. If you have an LPAR without NPIV you can simply start the system and the FC adapters will be logged in automatically. But with NPIV and a boot disk from the SAN directly (and not via VIOS as a SCSI-disk) you need to take this into account when discovering the disks for the first time.

bakunin

Hi Bakunin,

From the HMC, I can list all the wwpn of lpar as below

yyy@xxx:~> lshwres -r virtualio --rsubtype fc --level lpar -m zzzzz -F lpar_name,wwpns --header --filter lpar_names=nnn
lpar_name,wwpns
xxx,"c050760671b1001e,c050760671b1001f"
xxx,"c050760671b1001c,c050760671b1001d"
xxx,"c050760671b1001a,c050760671b1001b"
xxx,"c050760671b10018,c050760671b10019"

We use VIOS and the boot disk from the SAN directly. From VIOS, I can see the fcs is logged already

Name          Physloc                            ClntID ClntName       ClntOS
------------- ---------------------------------- ------ -------------- -------
vfchost2      U9117.MMD.06528F7-V200-C212            17 xxx          AIX

Status:LOGGED_IN
FC name:fcs0                    FC loc code:U5802.001.9K83854-P1-C3-T1
Ports logged in:2
Flags:a<LOGGED_IN,STRIP_MERGE>
VFC client name:fcs1            VFC client DRC:U9117.MMD.06528F7-V17-C212

Name          Physloc                            ClntID ClntName       ClntOS
------------- ---------------------------------- ------ -------------- -------
vfchost3      U9117.MMD.06528F7-V200-C412            17 xxx          AIX

Status:LOGGED_IN
FC name:fcs2                    FC loc code:U5802.001.9K84360-P1-C3-T1
Ports logged in:2
Flags:a<LOGGED_IN,STRIP_MERGE>
VFC client name:fcs3            VFC client DRC:U9117.MMD.06528F7-V17-C412

So strange we can see only 1 WWPN. Or only this WWPN is used?

--- Post updated at 08:56 AM ---

I suspect it uses the zoning with type active-standby

Yes, i would suspect the zoning too. You should investigate with the storage guy, especially if he has zoned both of the WWPNs on both of the virtual adapters. They often think that - one cable one WWPN - one of them is enough.

Only one of the two WWPNs is used - at a time. The second is needed for LPM: while the one WWPN is still used on the source system the second one is used on the target system and once the LPM move is complete the original WWPN is not used any more until you initiate another LPM move.

bakunin

Hi Bakunin,

I found this:
IBM How to Initiate Login/Logout Operation for virtual Fibre Channel Client Adapters in HMC - United States

Seem that by default, each virtual FC adapter is created but only 1 WWPN is active, and the other is used for LPM.

yyy@HMCxxx:~> lsnportlogin -m xxx --filter "lpar_names=abc"
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=112,wwpn=c050760671b10018,wwpn_status=1,logged_in=client,wwpn_status_reason=null
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=112,wwpn=c050760671b10019,wwpn_status=0,logged_in=none,wwpn_status_reason=null
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=212,wwpn=c050760671b1001a,wwpn_status=1,logged_in=client,wwpn_status_reason=null
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=212,wwpn=c050760671b1001b,wwpn_status=0,logged_in=none,wwpn_status_reason=null
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=312,wwpn=c050760671b1001c,wwpn_status=1,logged_in=client,wwpn_status_reason=null
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=312,wwpn=c050760671b1001d,wwpn_status=0,logged_in=none,wwpn_status_reason=null
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=412,wwpn=c050760671b1001e,wwpn_status=1,logged_in=client,wwpn_status_reason=null
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=412,wwpn=c050760671b1001f,wwpn_status=0,logged_in=none,wwpn_status_reason=null

0: is not logged.
1: is logged or active
2: unknown.

Even I try to login the port ( on test AIX), I still see the same WWPN, not more WWPN ( the number of FC card is not increased, still 4)

xxx@HMCxxxx:~> lsnportlogin -m xxxx --filter "lpar_names=abc"
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=112,wwpn=c050760671b10018,wwpn_status=1,logged_in=client,wwpn_status_reason=null
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=112,wwpn=c050760671b10019,wwpn_status=1,logged_in=vios,wwpn_status_reason=null
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=212,wwpn=c050760671b1001a,wwpn_status=1,logged_in=client,wwpn_status_reason=null
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=212,wwpn=c050760671b1001b,wwpn_status=1,logged_in=vios,wwpn_status_reason=null
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=312,wwpn=c050760671b1001c,wwpn_status=1,logged_in=client,wwpn_status_reason=null
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=312,wwpn=c050760671b1001d,wwpn_status=1,logged_in=vios,wwpn_status_reason=null
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=412,wwpn=c050760671b1001e,wwpn_status=1,logged_in=client,wwpn_status_reason=null
lpar_name=abc,lpar_id=17,profile_name=abc,slot_num=412,wwpn=c050760671b1001f,wwpn_status=1,logged_in=vios,wwpn_status_reason=null


[root@xxx] / > lsdev -Cc adapter | grep fcs
fcs0 Available 12-T1 Virtual Fibre Channel Client Adapter
fcs1 Available 12-T1 Virtual Fibre Channel Client Adapter
fcs2 Available 12-T1 Virtual Fibre Channel Client Adapter
fcs3 Available 12-T1 Virtual Fibre Channel Client Adapter

[root@xxx] / > lscfg -vl fcs0
  fcs0             U9117.MMD.06528F7-V17-C112-T1  Virtual Fibre Channel Client Adapter

        Network Address.............C050760671B10018
        ROS Level and ID............
        Device Specific.(Z0)........
        Device Specific.(Z1)........
        Device Specific.(Z2)........
        Device Specific.(Z3)........
        Device Specific.(Z4)........
        Device Specific.(Z5)........
        Device Specific.(Z6)........
        Device Specific.(Z7)........
        Device Specific.(Z8)........C050760671B10018
        Device Specific.(Z9)........
        Hardware Location Code......U9117.MMD.06528F7-V17-C112-T1

--- Post updated at 10:29 AM ---

I guess this is the behavior of power machine, as only 1 WWPN is active and recognized on the OS side.

If we login the inactive ports, the zoning on SAN switch will see the second WWPN according to the article above.

I will verify with the storage guy tomorrow about this and update.

--- Post updated at 10:32 AM ---

1 question, LPM is used to migrate the AIX from one power host to another power host? Like migrate VM on VMware?

--- Post updated at 10:37 AM ---

Yes, I'm not so clear after all but this is still in the checklist.

Have 1 question, assuming 1 FC has 2 ports (port 1 & 2). I create 2 virtual FC cards on lpart. And map both virtual FCs to the same physical FC. So what will happen? They use the same port 1 or port 2 or first virtual FC uses port 1 and second virtual FC use port 2?