Receiving: 4B436A3D 0313233216 T H fscsi0 LINK ERROR

Hey All,

I'm receiving the following error off of a Power5 9133-55A after I write 2-5 files to the LUN:

4B436A3D 0313233216 T H fscsi0 LINK ERROR

I can create the filesystem, volume groups etc etc. All goes well until there is sustained activity to the LUN then the above error shows up with no messages on the target.

[ AIX root@mdsnim01:/htpc ] lsattr -El fcs0
bus_intr_lvl  277        Bus interrupt level                                False
bus_io_addr   0xdf800    Bus I/O address                                    False
bus_mem_addr  0xe8081000 Bus memory address                                 False
init_link     al         INIT Link flags                                    True
intr_priority 3          Interrupt priority                                 False
lg_term_dma   0x800000   Long term DMA                                      True
max_xfer_size 0x400000   Maximum Transfer Size                              True
num_cmd_elems 200        Maximum number of COMMANDS to queue to the adapter True
pref_alpa     0x1        Preferred AL_PA                                    True
sw_fc_class   2          FC Class for Fabric                                True
tme           no         Target Mode Enabled                                True
[ AIX root@mdsnim01:/htpc ] lsattr -El fscsi0
attach       al        How this adapter is CONNECTED         False
dyntrk       yes       Dynamic Tracking of FC Devices        True+
fc_err_recov fast_fail FC Fabric Event Error RECOVERY Policy True+
scsi_id      0x1       Adapter SCSI ID                       False
sw_fc_class  3         FC Class for Fabric                   True
[ AIX root@mdsnim01:/htpc ]
[ AIX root@mdsnim01:/ ] /hbainfo
Total Adapters:                 2
This Adapter Index:             0
Adapter Name:                   com.ibm-df1000fd-1
Manufacturer:                   IBM
SerialNumber:                   1B70704261
Model:                          df1000fd
Model Description:              FC Adapter
HBA WWN:                        20000000C9621B82
Node Symbolic Name:
Hardware Version:
Driver Version:                 7.1.3.0
Option ROM Version:             02C82774
Firmware Version:               271304
Vendor Specific ID:             0
Number Of Ports:                1
Driver Name:                    /usr/lib/drivers/pci/efcdd
Port Index:                     0
Node WWN:                       20000000C9621B82
Port WWN:                       10000000C9621B82
Port Fc Id:                     1
Port Type:                      Private Loop
Port State:                     Operational
Port Symbolic Name:
OS Device Name:                 fcs0
Port Supported Speed:           4 GBit/sec
Port Speed:                     4 GBit/sec
Port Max Frame Size:            2112
Fabric Name:                    0000000000000000
Number of Discovered Ports:     1
Seconds Since Last Reset:       5060
Tx Frames:                      938801
Tx Words:                       478609152
Rx Frames:                      35195
Rx Words:                       3098112
LIP Count:                      1
NOS Count:                      0
Error Frames:                   0
Dumped Frames:                  0
Link Failure Count:             0
Loss of Sync Count:             2
Loss of Signal Count:           0
Primitive Seq Protocol Err Cnt: 0
Invalid Tx Word Count:          4
Invalid CRC Count:              0
[ AIX root@mdsnim01:/ ]
[ AIX root@mdsnim01:/ ]
Error log information:
          Date: Sun Mar 13 23:32:52 EDT 2016
          Sequence number: 7007
          Label: FCP_ERR4

The initiator card on the above is an LP11002 card and the target card is a QLogic 2464 card. I tried all sorts of things over the last 2 months but no luck. Still I get the above error. The connection breaks each time a significant amount of data is being transferred (1-4 GB). I'm wondering how to debug that card further? I'm aware of an APAR on some AIX versions that throw the above but I upgraded the OS as suggested yet the error still remains. Any other way to debug the above? I tried P2P and the cards negotiate for a few seconds then the connection is dropped. Arbitrary loop seems to work best but the connection fails on sustained writes.

[ AIX root@mdsnim01:/ ] oslevel -s
7100-03-00-0000
[ AIX root@mdsnim01:/ ]

Cheers,
DH

Could you please post the full output of the error, including the sense information - errpt -j 4B436A3D -a

These two error messages below always accompany each other. So I'll post both to get feedback from others as I work through the solutions on this page IBM Technical support search - United States. However I tried to disable dynamic tracking already (Both I then I & T / I = Initiator and T = Target in this context), from that page, but that didn't help with the issue:

26623394   0314022216 T H fscsi0         COMMUNICATION PROTOCOL ERROR
4B436A3D   0314022216 T H fscsi0         LINK ERROR
[ AIX root@mdsnim01:/ ] errpt -aDj 4B436A3D
---------------------------------------------------------------------------
LABEL:          FCP_ERR4
IDENTIFIER:     4B436A3D

Date/Time:       Mon Mar 14 19:46:05 EDT 2016
Sequence Number: 8605
Machine Id:      000D6210D600
Node Id:         mdsnim01
Class:           H
Type:            TEMP
WPAR:            Global
Resource Name:   fscsi0
Resource Class:  driver
Resource Type:   efscsi
Location:        U787B.001.DNWCB61-P1-C1-T1


Description
LINK ERROR

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
0000 0010 0000 002C 0000 0000 0301 0000 0000 0000 0000 0000 0000 0000 0000 2000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0001 0000 0000 0000 0002 0000 0000 0000 0000
2101 001B 32A1 8121 2001 001B 32A1 8121 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 012F 0000 0002 0000 0100 0000 0000 0000 0000 0301 0100 0000 0000 0002 0000
0000 0000 0000 0001 0000 0000 0000 0061 0000 0412 0000 0000 0000 0000 2A58 A000
2400 0000 48E0 8B28 0000 0000 0001 0001 0000 0000 0000 0000 2022 0100 069C 0200
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
2400 0000 48E0 8B28 0000 0000 0001 0001 0000 0000 0000 0000 2022 0100 069C 0200
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
2000 0000 0000 8C28 0000 0000 1801 0000 0010 8C28 0000 0000 0000 0000 0000 2022
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
1000 0000 C962 1B82 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

Diagnostic Analysis
Diagnostic Log sequence number: 1031
Resource tested:        fscsi0
Menu Number:            2602902
Description:


Error Log Analysis has detected multiple communication
errors.  These errors can be caused by attached devices,
a switch, a hub, or a SCSI-to-FC convertor.

If connected to a switch, refer to the Storage Area
Network (SAN) problem determination procedures for
additional problem resolution.

If not connected to a switch, run diagnostics on the
attached devices.  If a hub or SCSI-to-FC convertor is
attached, refer to the product documentation for problem
resolution.


---------------------------------------------------------------------------
LABEL:          FCP_ERR4
IDENTIFIER:     4B436A3D

Date/Time:       Mon Mar 14 02:33:03 EDT 2016
Sequence Number: 7180
Machine Id:      000D6210D600
Node Id:         mdsnim01
Class:           H
Type:            TEMP
WPAR:            Global
Resource Name:   fscsi0
Resource Class:  driver
Resource Type:   efscsi
Location:        U787B.001.DNWCB61-P1-C1-T1


Description
LINK ERROR

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
0000 0010 0000 002C 0000 0000 0301 0000 0000 0000 0000 0000 0000 0000 0000 2000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0001 0000 0000 0000 0002 0000 0000 0000 0000
2101 001B 32A1 8121 2001 001B 32A1 8121 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 012F 0000 0002 0000 0100 0000 0000 0000 0000 0301 0100 0000 0000 0002 0000
0000 0000 0000 0001 0000 0000 0000 0061 0000 0412 0000 0000 0000 0000 2A58 A000
2400 0000 48E0 8B28 0000 0000 0001 0001 0000 0000 0000 0000 2015 0100 069C 0200
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
2400 0000 48E0 8B28 0000 0000 0001 0001 0000 0000 0000 0000 2015 0100 069C 0200
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
2000 0000 0000 8C28 0000 0000 1801 0000 0010 8C28 0000 0000 0000 0000 0000 2015
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
1000 0000 C962 1B82 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

Diagnostic Analysis
Diagnostic Log sequence number: 1025
Resource tested:        fscsi0
Menu Number:            2602902
Description:


Error Log Analysis has detected multiple communication
errors.  These errors can be caused by attached devices,
a switch, a hub, or a SCSI-to-FC convertor.

If connected to a switch, refer to the Storage Area
Network (SAN) problem determination procedures for
additional problem resolution.

If not connected to a switch, run diagnostics on the
attached devices.  If a hub or SCSI-to-FC convertor is
attached, refer to the product documentation for problem
resolution.


---------------------------------------------------------------------------
LABEL:          FCP_ERR4
IDENTIFIER:     4B436A3D

Date/Time:       Mon Mar 14 02:22:14 EDT 2016
Sequence Number: 7160
Machine Id:      000D6210D600
Node Id:         mdsnim01
Class:           H
Type:            TEMP
WPAR:            Global
Resource Name:   fscsi0
Resource Class:  driver
Resource Type:   efscsi
Location:        U787B.001.DNWCB61-P1-C1-T1


Description
LINK ERROR

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
0000 0010 0000 002C 0000 0000 0301 0000 0000 0000 0000 0000 0000 0000 0000 2000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0001 0000 0000 0000 0002 0000 0000 0000 0000
2101 001B 32A1 8121 2001 001B 32A1 8121 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 012F 0000 0002 0000 0100 0000 0000 0000 0000 0301 0100 0000 0000 0002 0000
0000 0000 0000 0001 0000 0000 0000 0061 0000 0412 0000 0000 0000 0000 2A58 A000
2400 0000 48E0 8B28 0000 0000 0001 0001 0000 0000 0000 0000 2008 0100 069C 0200
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
2400 0000 48E0 8B28 0000 0000 0001 0001 0000 0000 0000 0000 2008 0100 069C 0200
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
2000 0000 0000 8C28 0000 0000 1801 0000 0010 8C28 0000 0000 0000 0000 0000 2008
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
1000 0000 C962 1B82 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

Diagnostic Analysis
Diagnostic Log sequence number: 1019
Resource tested:        fscsi0
Menu Number:            2602902
Description:


Error Log Analysis has detected multiple communication
errors.  These errors can be caused by attached devices,
a switch, a hub, or a SCSI-to-FC convertor.

If connected to a switch, refer to the Storage Area
Network (SAN) problem determination procedures for
additional problem resolution.

If not connected to a switch, run diagnostics on the
attached devices.  If a hub or SCSI-to-FC convertor is
attached, refer to the product documentation for problem
resolution.


[ AIX root@mdsnim01:/ ]
[ AIX root@mdsnim01:/ ] errpt -aDj 26623394
---------------------------------------------------------------------------
LABEL:          FCP_ERR12
IDENTIFIER:     26623394

Date/Time:       Mon Mar 14 19:46:19 EDT 2016
Sequence Number: 8606
Machine Id:      000D6210D600
Node Id:         mdsnim01
Class:           H
Type:            TEMP
WPAR:            Global
Resource Name:   fscsi0
Resource Class:  driver
Resource Type:   efscsi
Location:        U787B.001.DNWCB61-P1-C1-T1


Description
COMMUNICATION PROTOCOL ERROR

        Recommended Actions
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
SENSE DATA
0000 0010 0000 00A1 0000 0013 0303 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0001 0000 0000 0000 0002 0000 0000 0000 0000
2101 001B 32A1 8121 2001 001B 32A1 8121 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 012F 0000 0002 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0001 0000 0000 0000 0040 0000 0412 0001 0000 0000 0000 2A58 A000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0602 8A13 0200
0019 0000 0000 0000 0000 0000 05D9 CBC8 0000 0001 0000 0000 0000 0000 0000 0000
0000 0001 636D 4643 F100 0A00 2BF5 80E8 F100 0A00 2BF5 815C F100 0A00 2BF5 706C
0000 0000 288B F0E8 0000 0000 288B F15C 0000 0000 288B E06C 0100 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0300 0000 0908 0000 8800 0800 00FF FFFF 0000 07D0 1000 0000 C962 1B82 2000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
1000 0000 C962 1B82 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0001 0000 0001 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
Duplicates
Number of duplicates
           2
Time of first duplicate
Mon Mar 14 02:22:27 EDT 2016
Time of last duplicate
Mon Mar 14 19:46:19 EDT 2016
[ AIX root@mdsnim01:/ ]

Cheers,
DH

---------- Post updated at 08:17 PM ---------- Previous update was at 08:15 PM ----------

Just checking the time and notice all of these ended up getting logged at the exact same time:

E86653C3   0314022216 P H LVDD           I/O ERROR DETECTED BY LVM
C62E1EB7   0314022216 P H hdisk2         DISK OPERATION ERROR
26623394   0314022216 T H fscsi0         COMMUNICATION PROTOCOL ERROR
4B436A3D   0314022216 T H fscsi0         LINK ERROR

Just let me know if you need to see the first two. They seem symptomatic however.

Cheers,
DH

---------- Post updated at 09:15 PM ---------- Previous update was at 08:17 PM ----------

DISPLAY MICROCODE LEVEL                                                                                               802111
fcs0    FC Adapter

The current microcode level for fcs0 is 271304.

Use Enter to continue.

the first error you receive - FCP_ERR4 4B436A3D - according to the sense information provided means, that AIX driver sent RESET command to the SAN device and didn't receive an answer. Usually it means, you have a SAN problem and you should open a case with your SAN switch or better - storage device vendor.

But as far as I see from the output of lsattr -El fscsi0 you don't have SAN. You have a direct-attached storage. If you have a SAN fabric, not a direct-attached storage, then you have a problem connecting to the fabric, mostly a broken cable is the cause.

If you really have a direct-attached storage, then I have some other question:

  • how many LDEVs/LUNs do you receive from the storage?
  • does the problem happen only with this LDEV (Nr. 00:00:00:00:00:00:00:02) or also with other LDEVs?
  • is the storage connected through multiple adapters or is it the only adapter to the storage?
  • how many different storages are connected using this adapter?

If it is a single storage directly connected through the single adapter, I would recommend:

  • to check the cable
  • to switch off dyntrk and fc_err_recov
  • to minimize max_xfer_size and corresponding parameters on the hdisk

It's fiber card to fiber card and I'm zoning a single FILEIO device, which itself is sitting on a RAID 6 / XFS storage ( 6 disk ). I tried disabling dynamic tracking, no luck. It tried to change the cable, no luck. I'll read about the other options you mentioned as well. There's only one LUN involved and I'm able to write to it fine until some large data is being written but failure is 100% in each case.

The target system is SCST (Apologies I thought I mentioned but as I read above, I haven't yet.). Funny thing is that on restart of that SCST subsystem, I can get a LUN back following a failure. (Maybe memory leak.) I might try LIO / targetcli next if the above doesn't work.

Cheers,
DH

Could you download the devscan tool and run it on your server?

https://www-304.ibm.com/support/docview.wss?uid=aixtoolsc9e095f

I see this thread has been open for over a day without resolution so, although I'm not qualified to answer the specifics, I thought I'd chip in anyway.

Firstly, my disclaimer. I'm not an AIX expert by any means and I have no knowledge of the LP11002. However, I do know the QL2464 very well and I was the technical director of a storage distributor many years ago and we shipped loads of fibre channel kit. So all I can do is tell you where I'd be looking in the first instance. I could well be completely wrong but here goes...........

The symptoms you describe indicate that everything is fine until the link gets really busy, then it screws up. Normal FC payload is 2112 giving a MTU of 2148 bytes total allowing for headers, etc. Some FC adapters support "jumbo" packets with a payload up to 9000 giving a MTU of 9036 bytes with headers. If the adapter supports jumbos, whether jumbo packets are enabled or not is a setting in the adapter BIOS. So if one adapter is set for jumbo and the other doesn't support jumbo then everything will work find with low traffic but when things really get going one of the adapters suddenly sends a jumbo packet that the other adapter cannot understand. So if I was fighting this issue I would look at both adapters and set the max payload to 2112 or the max MTU to 2148 or set the "support jumbo packets=no". Then test to see if the problem has gone away.

Needless to say, should you get to a known good working situation only change one thing at a time afterwards and fully test that it hasn't screwed up again.

I have no clue whether this will help you or not.

Good luck anyway.

2 Likes

The max transmit unit on the QLA2464 is 2048 (Per QLA Boot BIOS ) and the max on the Emulex LP11002 is 2112 as reported by the above utility. (Hopefully it's correct, that software seems to be bit dated so not 100% sure of it's accuracy.)

I really haven't seen Jumbo Packets appear anywhere in any of the configs so thinking I'm safe on that side.

Along these lines, so far no one can comment if there is any incompatability between Emulex and QLogic. Would be curious to tap into those experiences to see if that could be the case.

Cheers,
DH

---------- Post updated at 11:44 PM ---------- Previous update was at 11:19 PM ----------

Here we go:

[ AIX root@mdsnim01:/devscan ] /usr/local/bin/devscan

devscan v1.0.5
Copyright (C) 2010-2012 IBM Corp., All Rights Reserved

cmd: /usr/local/bin/devscan
Current time: 2016-03-16 03:40:10.336709 GMT
Running on host: mdsnim01

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Processing FC device:
    Adapter driver: fcs0
    Protocol driver: fscsi0
    Connection type: loop
    Link State: up
    Current link speed: 4 Gbps
    Local SCSI ID: 0x000001
    Local WWPN: 0x10000000c9621b82
    Local WWNN: 0x20000000c9621b82
    Device ID: df1000fd
    Microcode level: 271304

000002  0000000000000000 2101001b32a18121 2001001b32a18121
    Vendor ID: SCST_FIO     Device ID: MDSVIOro Rev:  311 NACA: no
    PDQ: Connected          PDT: Block (Disc)
    Name:           hdisk2  VG:           htpcvg
    Device already SCIOLSTARTed
    Status: Available

125 targets found, reporting 1 LUNs,
1 of which responded to SCIOLSTART.
Elapsed time this adapter: 01.044606 seconds

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Processing FC device:
    Adapter driver: fcs1
    Protocol driver: fscsi1
    Connection type: none
    Local SCSI ID: 0x000000
    Device ID: df1000fd
    Microcode level: 271304

No targets found
Elapsed time this adapter: 00.125680 seconds

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Processing iSCSI device:
    Protocol driver: iscsi0

No targets found
Elapsed time this adapter: 00.063027 seconds

Cleaning up...
Total elapsed time: 01.259561 seconds
Completed with error(s)

[ AIX root@mdsnim01:/devscan ]

Wondering why it's saying 125 targets found when only one exists and only one port of the Emulex is plugged into the QLogic port with only one file assigned as a backing device.

Cheers,
DH

---------- Post updated 03-16-16 at 12:07 AM ---------- Previous update was 03-15-16 at 11:44 PM ----------

Out of curiosity, what is the hex value of a 'target reset' that is sent to the target? Having difficulty finding that.

At the time that the failure occurs, there is no response from the target. The software doesn't log anything so it's hard to tell where the issue is.

Cheers,
DH

In my opinion it is very unlikely that the QL 2464 and Emulex have any compatibility issues unless either or both are running old firmware. They are both mainstream FC adapters and the FC standards are solid.

I would think your problem is due to the setup of the adapters BIOS wise so I would be checking all parameters for compatibility. As you explain, all is well until you put the adapters under pressure.

can you change the LDEV from 02 to 0? I've seen strange AIX behaviour, when LDEVs begin not with 0.

---------- Post updated at 09:50 AM ---------- Previous update was at 09:48 AM ----------

regarding scsi resets, look e.g. at https://kb.netapp.com/support/index?id=3012122&page=content&locale=en_US

@hicksd8
IRT the Max Transfer Sizes, the Emulex has 2112 and the QLogic is set to 2048. It was suggested to match these up. I don't see a way to change this on AIX and I can only see some presets on the QLogic card. Is that difference in the Max Transfer Size significant?

@agent.kgb
Can you elaborate a bit more on your suggestion? I'm thinking you're referring to the scsi_id but want to be 100% sure. And I'm curious why it's reporting 125 targets.

Cheers,
DH

---------- Post updated at 12:56 AM ---------- Previous update was at 12:06 AM ----------

I should also add that the standard / traditional path listing just isn't there that comes from other SAN storage target devices. That's an oddity that I need to dig in too:

[ AIX root@mdsnim01:/ ] mount
  node       mounted        mounted over    vfs       date        options
-------- ---------------  ---------------  ------ ------------ ---------------
         /dev/hd4         /                jfs2   Mar 17 00:31 rw,log=/dev/hd8
         /dev/hd2         /usr             jfs2   Mar 17 00:31 rw,log=/dev/hd8
         /dev/hd9var      /var             jfs2   Mar 17 00:31 rw,log=/dev/hd8
         /dev/hd3         /tmp             jfs2   Mar 17 00:31 rw,log=/dev/hd8
         /dev/hd1         /home            jfs2   Mar 17 00:33 rw,log=/dev/hd8
         /dev/hd11admin   /admin           jfs2   Mar 17 00:33 rw,log=/dev/hd8
         /proc            /proc            procfs Mar 17 00:33 rw
         /dev/hd10opt     /opt             jfs2   Mar 17 00:33 rw,log=/dev/hd8
         /dev/livedump    /var/adm/ras/livedump jfs2   Mar 17 00:33 rw,log=/dev/hd8
         /aha             /aha             ahafs  Mar 17 00:33 rw
         /dev/htpc00lv    /htpc            jfs2   Mar 17 00:45 rw,log=/dev/loglv00
[ AIX root@mdsnim01:/ ]
[ AIX root@mdsnim01:/ ]
[ AIX root@mdsnim01:/ ] errpt
[ AIX root@mdsnim01:/ ] lspath
Enabled hdisk0 scsi1
Enabled hdisk1 scsi1
[ AIX root@mdsnim01:/ ] lsdisk -Cc disk
ksh: lsdisk:  not found.
[ AIX root@mdsnim01:/ ] lsdev -Cc disk
hdisk0 Available 00-08-01-4,0 16 Bit LVD SCSI Disk Drive
hdisk1 Available 00-08-01-8,0 16 Bit LVD SCSI Disk Drive
hdisk2 Available 07-08-02     Other FC SCSI Disk Drive
[ AIX root@mdsnim01:/ ]

Cheers,
DH

every LUN has its ID, which is specified on the storage side. In this line:

the first number 000002 is this ID. AIX requires, that all IDs count from 0 upwards. If you don't have ID 0, it can cause some strange errors.

What about your lsdev output, it's ok - you don't have ODM entries for your storage, that's why you see Other SCSI device. The usual way is to ask storage vendor for a fileset with such entries, but I am not sure, that Linux has something like this.

Do you mean "2048" or is that a typo?

Standard packets are 2112 payload (ie, actual data) which becomes 2148 MTU (ie, packet size once header, footer, CRC, etc are added).

Therefore 2048 sounds wrong.

Does the QL board present you with a CTRL-Q option on boot to get inot its BIOS?
Is the anything similar on the LP11002?

I'm used to seeing a bank of settable options within the BIOS of FC cards although, as I said before, it's been a few years since. I seem to remember one of the options causing multiple duplicate LUN's to be seen in error.

@hicksd8
The QLogic card shows 2048 as the high end Max Transfer Size in the QLogic Boot menu (ctrl-q) you're saying it shouldn't? I'm not sure the LP11002 offers the same boot configuration menu. I could try to check but the ioinfo on Power5 doesn't work as it does on Power6/7.

@agent.kgb
This is interesting. I have it set to LUN 0 in the /etc/scst.conf file. There is no 2 but I wondered about this myself earlier since the SCST subsystem can't find 0x02 LUN and instead delivers it as LUN 0. I'll try to post the logs later tonight.

Is there any option to increase that 2048 to 2148??

Sometimes when a particular host does not present BIOS entry options for HBA's I've known people put the cards in another machine, set and save the BIOS parameters, and then reinstall the HBA into the required machine.

I may well be wrong but at the moment my money is on the HBA's being unable to communicate because of their configuration settings, and nothing to do with the O/S's, drivers, or anything like that. I say this because you get a "LINK ERROR" which I interpret as a low level communication screw up between the cards. I think that simple interactions work okay (perhaps sending smaller packets to each other) but not bigger operations (perhaps maxing out the packet size) also leads me to think the cards are the issue.

Yes, O/S provided buffer sizes being exceeded could also cause an issue but I wouldn't expect to see "LINK ERROR" in this case.

Also, please confirm that you don't see anything about jumbo packet support in the CTRL-Q setup?

Confirmed, nothing for jumbo packets and I'll reconfirm tonight about the max frame size again and post here.

---------- Post updated at 07:28 PM ---------- Previous update was at 04:19 PM ----------

@agent.kgb
How would I set the LDEV here?

I've never worked with SCST - I don't know.

@hicksd8
Only options for the Max Frame Size is 512, 1024 or 2048 on the QLogic card. However what you said makes perfect sense and fits in with the issue well. Need to get the details on the LDEV setting from agent.kgb next. What command do I run to set the LDEV?

---------- Post updated at 08:21 PM ---------- Previous update was at 08:13 PM ----------

Oh you meant on the target, thought you meant on AIX. Here's the config and it looks fine per the developers:

        TARGET 21:01:00:1b:32:a1:81:21 {
                HW_TARGET

                enabled 1
                rel_tgt_id 2

                GROUP IBM01 {
                        LUN 0 MDSVIOroot01

                        INITIATOR 10:00:00:00:C9:62:1B:82

                        INITIATOR 10:00:00:00:C9:62:1B:83
                }
        }

So it's set to LUN 0. Then what you are saying agent.kgb is that this line here:

000002 0000000000000000 2101001b32a18121 2001001b32a18121

should really read:

000000 .....

?

Cheers,
DH

---------- Post updated at 09:59 PM ---------- Previous update was at 08:21 PM ----------

Anyway to change the Max Frame Size on AIX then? If I can match the two I could find out if they were at play.

---------- Post updated at 10:05 PM ---------- Previous update was at 09:59 PM ----------

Perhaps the Emulex frame size reported of 2112 is 2048 + 64 bytes for the headers and the QLogic doesn't add the 64 bytes to the 2048 it shows in the QLogic Boot Menus? If so they would match. Or am I wrong about this math? The ./hbainfo utility is dated and the devscan doesn't display the frame size. So I'm not 100% here.

Cheers,
DH

---------- Post updated at 11:56 PM ---------- Previous update was at 10:05 PM ----------

Apologies, in case I didn't mention, this is a HBA to HBA FC configuration / test.

---------- Post updated 03-18-16 at 03:20 AM ---------- Previous update was 03-17-16 at 11:56 PM ----------

The PCI entries do not show up for me in the Power5 firmware any longer. I'm wondering if there is anything out of the norm with it. How can I check the Power5 PCI bus for any issues?

I was talking about the packet size at link level. The standard FC (original) max frame size is 2148 which is a 2112 payload (data handed over the bus from the OS) plus 36 bytes of frame construction (header, footer, checksum, et al) put on by the adapter in order to transmit to the target. That was always standard.

Then, because the link could become a bottle neck with packets queuing for transmission, the standard was amended to allow "jumbo" packets (basically the amalgamation of multiple packets) of up to 9000 bytes which after frame construction added 36 bytes to that giving a max frame size of 9036.

Some adapters support jumbo and some don't. Some allow you to switch on/off support in the HBA BIOS, and some don't.

Needless to say, if one HBA is supporting jumbo and the other not, then a frame can arrive that cannot be unpacked.

Now, further up the netstack (in the OS) the payload (just the data element) will be a lower number, perhaps 2048, I'm not sure. That in turn will be included inside a larger packet (perhaps 64 bytes larger) to construct the payload (2112) to be handed down to the HBA for transmission. Therefore, as I said before, the size and number of buffers allocated by the OS also plays a part in this. If the buffer size is configured too small then the incoming packets won't fit in and the packet cannot be captured and unpacked.

However, with a LINK ERROR reported (by the HBA through the hostbus to the OS) I think that the HBA's are having difficulty. Do you have any other cards of the same type on site? We cannot rule out a hardware fault with one of the HBA's. If you have spares I would certainly give it a shot.

So to answer your question, no, I don't understand how the QL is offering just 512, 1024 and 2048 as max frame size options. I'll do some research on that. Perhaps you should give QL support a call to discuss that point.

I'm actually chatting with QL support on this. They've been helping me out in last 1-2 months and we're having difficulty identifying.

I'm about to order some QL cards for the Power5 (older ones) just to even things out a bit and isolate better.

I'm still looking at the Jumbo Frame size on both cards to see if I haven't missed anything. That makes alot of sense.

---------- Post updated at 09:50 AM ---------- Previous update was at 09:39 AM ----------

I reverted to looking on the web and I just don't see 'Jumbo' listed alongside QLA2464 anywhere. Nor do I see anything for Emules LPE11002. I'm doing direct Fiber to Fiber here. There is no switch or iSCSI or FCoE here.