Misconfiguration detected Adapter interface name en 3 Adapter offset 0

trevian3969 · June 6, 2019, 9:33am

Hi,
We had a hardware problem with an IBM System p5 server, with AIX 5.2
We restore from a tape the last backup we had, but the server does not boot up as expected.

The server try to mount some directories from a storage, but could not comunicate with them, we check the FC and everything is fine.
A second alert is a misconfiguration in an ethernet adapter.
After this alerts the server starts a loop trying to find the storage.

Im stuck with this problem, any clues ?

Thank you in advance .

Here is the log.

Welcome to AIX.
                       boot image timestamp: 22:18 05/29
                 The current time and date: 18:02:15 05/31/2019
                number of processors: 2    size of memory: 1904Mb
boot device: /pci@800000020000003/pci@2,4/pci1069,b166@1/scsi@0/sd@5:2
closing stdin and stdout...
-------------------------------------------------------------------------------

Saving Base Customize Data to boot disk
Starting the sync daemon
Mounting the platform dump file system, /var/adm/ras/platform
Starting the error daemon
System initialization completed.
Setting tunable parameters...complete
Starting Multi-user Initialization
 Performing auto-varyon of Volume Groups
 Activating all paging spaces
0517-075 swapon: Paging device /dev/hd6 is already active.
/dev/rhd1 (/users): ** Unmounted cleanly - Check suppressed
/dev/rhd10opt (/opt): ** Unmounted cleanly - Check suppressed
 Performing all automatic mounts
mount: 0506-324 Cannot mount /dev/fslv00 on /home: A file or directory in the path name does not exist.
Replaying log for /dev/fslv01.
mount: 0506-324 Cannot mount /dev/fslv03 on /home/monitor_logs: A file or directory in the path name does not exist.
mount: 0506-324 Cannot mount /dev/archdwdg_S_01 on /usr/hacmp_fs/archdw_dg: A file or directory in the path name does not exist.
mount: 0506-324 Cannot mount /dev/archdg_S_01 on /usr/hacmp_fs/arch_dg: A file or directory in the path name does not exist.
Multi-user initialization completed
nsmb0 Available
Checking for srcmstr active...complete
Starting tcpip daemons:
0513-059 The syslogd Subsystem has been started. Subsystem PID is 27108.
0513-059 The portmap Subsystem has been started. Subsystem PID is 27546.
0513-059 The inetd Subsystem has been started. Subsystem PID is 24010.
0513-059 The snmpd Subsystem has been started. Subsystem PID is 24672.
May 31 14:03:57 localhost snmpd[24672]: EXCEPTIONS: open_device: Unable to connect to device driver.
May 31 14:03:57 localhost last message repeated 2 times
0513-059 The dpid2 Subsystem has been started. Subsystem PID is 24938.
0513-059 The hostmibd Subsystem has been started. Subsystem PID is 26714.
0513-059 The aixmibd Subsystem has been started. Subsystem PID is 25218.
0513-059 The muxatmd Subsystem has been started. Subsystem PID is 27162.
Finished starting tcpip daemons.
Starting NFS services:
May 31 14:04:09 localhost syslog: /usr/sbin/ifconfig -l
0513-059 The biod Subsystem has been started. Subsystem PID is 25658.
0513-059 The nfsd Subsystem has been started. Subsystem PID is 29758.
0513-059 The rpc.mountd Subsystem has been started. Subsystem PID is 28002.
0513-059 The rpc.lockd Subsystem has been started. Subsystem PID is 22564.
May 31 14:04:13 localhost syslog: /usr/sbin/ifconfig -l
May 31 14:04:13 localhost automountd[25508]: svc_create: no well known address for autofs on transport udp
May 31 14:04:13 localhost syslog: dlopen(/usr/ldap/lib/libibmldapn.a) failed: A file or directory in the path name does not exist.
May 31 14:04:13 localhost syslog: WARNING: ldap is not loaded
May 31 14:04:13 localhost syslog: WARNING: ldap is not configured
May 31 14:04:13 localhost unix:
May 31 14:04:13 localhost unix:
May 31 14:04:13 localhost unix:
May 31 14:04:13 localhost unix:
Completed NFS services.
May 31 14:04:20 localhost no[30460]: Network option tcp_keepinit was set to the value 40
May 31 14:04:20 localhost no[30462]: Network option tcp_keepidle was set to the value 20
May 31 14:04:23 localhost no[30208]: Network option tcp_keepintvl was set to the value 15
May 31 14:04:26 localhost no[30210]: Network option tcp_sendspace was set to the value 262144
May 31 14:04:29 localhost no[30212]: Network option tcp_recvspace was set to the value 262144
May 31 14:04:32 localhost no[30214]: Network option udp_sendspace was set to the value 65536
May 31 14:04:35 localhost no[30216]: Network option udp_recvspace was set to the value 262144
May 31 14:04:38 localhost no[30218]: Network option rfc1323 was set to the value 1
May 31 14:04:41 localhost no[30220]: Network option udp_sendspace was set to the value 65536
May 31 14:04:41 localhost no[30222]: Network option udp_recvspace was set to the value 262144
May 31 14:04:42 localhost su: from root to imdba at /dev/tty??
0513-059 The clcomdES Subsystem has been started. Subsystem PID is 31510.
0513-059 The nmbd Subsystem has been started. Subsystem PID is 31274.
0513-059 The smbd Subsystem has been started. Subsystem PID is 30226.
May 31 14:04:42 localhost last message repeated 6 times
May 31 14:04:50 localhost no[33086]: Network option routerevalidate was set to the value 1
May 31 14:05:03 localhost no[33864]: Network option nonlocsrcroute was set to the value 1
May 31 14:05:03 localhost no[33866]: Network option ipsrcroutesend was set to the value 1
May 31 14:05:03 localhost no[33866]: Network option ipsrcrouterecv was set to the value 1
May 31 14:05:03 localhost no[33866]: Network option ipsrcrouteforward was set to the value 1
May 31 14:05:06 localhost topsvcs[24034]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6OP0ZW0GnKwQ/fhG./...                                                                                              z/...................:::Reference ID: :::Template ID: a29426da::: Details File:  :::Location: rsct,nim_control.                                                                                             
C,1.39.1.2,4359              :::TS_MISCFG_ER Local adapter misconfiguration detected Adapter interface name en 3 Adapter offset 0 Adapter expected IP address 10.189.125.27
May 31 14:05:06 localhost topsvcs[24034]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6OP0ZW0GnKwQ/p/I./...                                                                                              z/...................:::Reference ID: :::Template ID: a29426da::: Details File:  :::Location: rsct,nim_control.                                                                                              
C,1.39.1.2,4359              :::TS_MISCFG_ER Local adapter misconfiguration detected Adapter interface name en 1 Adapter offset 1 Adapter expected IP address 10.189.126.27
May 31 14:05:06 localhost topsvcs[24034]: (Recorded using libct_ffdc.a cv 2):::Error ID: 6FYVDG0GnKwQ/GlJ./...                                                                                              z/...................:::Reference ID: :::Template ID: 923e1911::: Details File:  :::Location: rsct,nim_control.                                                                                             
 C,1.39.1.2,4428              :::TS_NIM_OPEN_ERROR_ER Failed to open NIM connection Interface name rhdisk29 Description 1 SYSCALL Description 2 openx() Value 1 19 Value 2 0
May 31 14:05:19 localhost clstrmgrES[36390]: Fri May 31 14:05:19 HACMP/ES Cluster Manager Started
May 31 14:05:19 localhost clstrmgrES[36390]: Fri May 31 14:05:19 IpcInit: called
0513-029 The ctrmc Subsystem is already active.
Multiple instances are not supported.
May 31 14:05:53 localhost HACMP for AIX: EVENT START: node_up hd2z
May 31 14:05:55 localhost HACMP for AIX: EVENT FAILED: 1: node_up hd2z 1
May 31 14:05:55 localhost HACMP for AIX: EVENT START: event_error 1 TE_JOIN_NODE
WARNING: Cluster his_cluster Failed while running event [JOIN], exit status was 1
                                                                                 
May 31 14:05:55 localhost HAC  MP for AIX: EVENT FAILED: -1: event_error 1 TE_JOIN_NODE -1

Fri May 31 14:10:39 AST 2019
                   Automatic Error Log Analysis for sysplanar0 has detected a problem.
                  The Service Request Number is B7006970: I/O subsystem (hub, bridge, bus) Unrecovered Error, general. Refer to the system service documentation for more information.
                   Additional Words: 2-00000062 3-00010002 4-24030 230 5-00000000 6-00001140 7-00020000 8-00000000 9-00000000.

May 31 14:11:50 localhost HACMP for AIX: EVENT START: config_too_long 360 /usr/es/sbin/cluster/events/node_up.rp
WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 360 seconds. Please check cluster status.
WARNING: Cluster his_cluster has been running recovery program '/usr/es /sbin/cluster/events/node_up.rp' for 390 seconds. Please check cluster status.
WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 420 seconds. Please check cluster status.
WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp'  for 450 seconds. Please check cluster status.
 WARNING: Cluster his_cluster has been running recovery program ' /usr/es/sbin/cluster/events/node_up.rp' for 480 seconds. Please check cluster status.

 Fri May 31 14:13:54 AST 2  019
   Automatic Error Log Analysis for sysplanar0 has detected a problem.
  The Service Request Number is BA210000 : Platform Firmware Unrecovered Error, general. Refer to the system  service documentation for more information.
        Additional Words: 2-20202020 3-20202020 4-20202020 5-20202020 6-20202020 7-20202020 8-20202020 9-20202020.

  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 540 seconds. Please check cluster status.
  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 600 seconds. Please check cluster status.
  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 660 seconds. Please check cluster status.
  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 720 seconds. Please check cluster status.
  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 780 seconds. Please check cluster status.
  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 900 seconds. Please check cluster status.
  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 1020 seconds. Please check cluster status.
  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 1140 seconds. Please check cluster status.

hicksd8 · June 7, 2019, 4:44am

Firstly, let me say that I'm not familiar with your specific hardware platform so this is a generic answer.

What kind of hardware problem did you have? Did it result in changing the motherboard (mobo)?

If so, did you put the add-on adapters into the same slots as on the old one?

Is there a BIOS or other configuration program that allows you to set the bus address and IRQs for those slots?

If so, and they're now set differently then that's probably your issue.

trevian3969 · June 10, 2019, 9:26am

Thank you hicksd8 for your answer

Basically the main hard disk failed (with the OS), so we changed for a new one, same model and type, and restore the info from a backup.
The backup was supossed to work, because we change only the hard disk.
We dont know why this isnt working.
Thanks again
trev

RecoveryOne · June 10, 2019, 11:21am

First off AIX 5.2 is NOT my cup of tea.
I'm limited to 6.1 and 7.1 so there might be some differences and I can't state all commands will work as expected on AIX 5.2.

Let's try to dive in to the first issue. Missing mount points:

 Performing all automatic mounts
mount: 0506-324 Cannot mount /dev/fslv00 on /home: A file or directory in the path name does not exist.
Replaying log for /dev/fslv01.
mount: 0506-324 Cannot mount /dev/fslv03 on /home/monitor_logs: A file or directory in the path name does not exist.
mount: 0506-324 Cannot mount /dev/archdwdg_S_01 on /usr/hacmp_fs/archdw_dg: A file or directory in the path name does not exist.
mount: 0506-324 Cannot mount /dev/archdg_S_01 on /usr/hacmp_fs/arch_dg: A file or directory in the path name does not exist.

So, you posted above that you had a disk failure. How many disks are attached to this system?
Looking above its unable to find some of your logical volumes/filesystems. Do you know where those underlying physical disks that supported those filesystems reside?

What product or how did you backup the system to tape? Are there any exclude statements?
Lets take a look at the output of the following commands:
lspv
lsvg

For each volume group defined in lsvg, lets do a lsvg <volume group name>, followed up by a lsvg -l <volume groupname>

lsfs

And lastly, lets see if there's an exclude file for your rootvg

cat /etc/exclude.rootvg

If all those happy logical volumes lived on SAN and everything is zoned correctly, there's a very good chance that they are there and fine, you just need to use importvg.

The adapter configuration errors appear to be related to the RSCT (HACMP) issues, and that might be related to your missing logical volumes.

trevian3969 · June 10, 2019, 5:11pm

RecoveryOne, Hi, and thank you.

How many disks ? 4
The filesystems resides in a storage. This storage is connected to a fiber optic switch, and connected to the server through fiber optic patch cords.
What product used to backup ? mksysb (Dont know if the people that did the bakcup used exlude statements)

Right know, Im using putty and connected to the console port of the server, and working from there. I dont have image in the monitor.

After this part

WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 540 seconds. Please check cluster status.
  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 600 seconds. Please check cluster status.
  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 660 seconds. Please check cluster status.
  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 720 seconds. Please check cluster status.
  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 780 seconds. Please check cluster status.
  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 900 seconds. Please check cluster status.
  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 1020 seconds. Please check cluster status.
  WARNING: Cluster his_cluster has been running recovery program '/usr/es/sbin/cluster/events/node_up.rp' for 1140 seconds. Please check cluster status.

I dont know how to stop it and let the OS continue and do the lspv or lsvg commands

Thanks

RecoveryOne · June 10, 2019, 7:57pm

Oh. Not good.
So, looks like TCPIP is up and running so you are pretty far along in the boot. A control-c/spam enter a few times might get you to a login prompt, if you time it between the messages printed to console.

If you do manage to get in, you will likely be spammed with all the happy fun AIX stuff it prints to console. I think in 5.3 TL7?? There's the swcons command, you could do a swcons /tmp/console.out to temporally send all messages printed to console to file. To revert it back just typing swcons.
If you have telnet/ssh it might be responsive on the network. Not sure however.
So, was this lpar part of a cluster?

Next lets go talk about single user mode!
Also, do you have any sort of HMC or is this a standalone frame?

If standalone, this should work (again 5.2 not my cup of tea). Lets bring this lpar up in single user mode. With your console still attached, bounce the lpar (nicely if hmc, or well...press the magical power button if you must). You should see the system start to IPL, and if you are sitting near it, it should beep on the frame (if so equipped) and beep/flash your console. Alternate, if you have a monitor/kbd hooked up it would likely flash/beep there so yeah.
Now, you should see an option of choosing another boot list during the time of the beeping/flashing. Should be boot selection. An option for 'Press 6 to start diagnostics'.
Normal AIX boot header of 'Welcome to AIX' will display, and you will be presented with a diagnostic screen which you can enter past. The next screen should have an option for Start Shell, or Single User mode.

IBM has some good docs about what can and cannot be done in that mode and how to exit single user and get into multiuser. Here's one now! IBM Booting AIX in Single-User Mode
From there, I can't say which commands will work and which wont. I only had to go into single user mode once for a really messed up inittab issue that a vendor left me with!

Again I could be way off the processes as I have 6.1 and 7.1 under my belt. The 5.3 systems and our lowly 4.3.3 box was retired as I came in.

Anyways, do you have other backups of this lpar?
Good news, it sounds like your filesystems are protected if they are on SAN, you just need to get to a point where you can bring them in.
Bad news, depending on if any excludes were done you may end up with a painful rebuild, if other data wasn't backed up via other means.

If you have the knowledge base of 5.2, I'd really look at setting up a mirrored rootvg once you are up and running. I've done both a local hdisk0 and a SAN provided hdisk1 as a mirror before with 6.1 TL7. Not sure if applicable with 5.2, I know a lot of things have changed.

Unfortunately a lot of information regarding 5.2 and even 5.3 is lost in the mix, so not to be 'you should look at updating guy', but....should look at updating A small Power9, or Power8, heck P7 lines will run rings around a p5. In fact, I'm pushing for that now, we have a few p5's still and they are dropping like flies. Lost two raid controllers (drives ok, but sissas0 AND sissas1 on same frame went belly up.) Another had sysplanar0 errors. And a 3rd I keep under my desk in the office as a heater in the winter months has been reporting predictive failure on the service processor for at least a year now if not longer.

Sorry trevian, I may not be much help in the long run.
Feel free to post what you can, and I'll try to assist when I can, but just all before my time. Perhaps one of the mods here who've worked with AIX for a really long time may have some other ideas?

bakunin · June 11, 2019, 7:13am

First off: AIX 5.2 is at least 10 years out of support, the support for its successor 5.3 ended 2012. The support for some Power5 systems (if they have been upgraded with Power6- and Power6plus-processors) also ended beginning this year, even the Power7-support will end in September.

So, whatever you do, you should URGENTLY consider updating to anything recent.

You said the LUNs you are missing come from a FC-connected storage: did you make sure the zoning is correct? If you restored the system from an mksysb image then chances are you use another FC-adapter with a different WWPN and hence you need to modify the zones accordingly. You *do* know what "zoning" means, yes? Everything else is, i suppose, just coming from the not-found disks.

By the way: you should always give your LVs, VGs and FSes meaningful names: names like /dev/fslv00 are not meaningful and will over time contribute to the overall confusion.

I hope this helps.

bakunin

RecoveryOne · June 11, 2019, 8:24am

bakunin, all very good points.
Still thinking single user mode if they can't get control of the console would be beneficial at least from a troubleshooting standpoint. Would let them see adapters, if any...See below:

Something else that concerned me from the boot log was this line:

The Service Request Number is B7006970: I/O subsystem (hub, bridge, bus) Unrecovered Error, general

I looked that up and that's coming out to be a PCI host bridge failure message. Likely the system doesn't see its adapters at all. I might be off, and my IBMer isn't available to confirm for me. Depending on whats causing the failure message, you may need a system backplane or I/O backplane replacement. Which would explain why your direct attached console isn't working.

There's other PCI I/O adapters that can cause that error, however I feel such efforts might be beyond the scope of this forum, and you'd best be off finding a 3rd party support or be willing to shell out some insane dollar amount to IBM for service.

bakunin · June 11, 2019, 9:45am

@RecoveryOne: You are right, this looks like a HW problem. Still, as long as we know nothing about his environment (is this an LPAR? is in an unvirtualised system? Is there a VIOS? had it errors during boot? ...) we have no chance of contributing anything purposeful to this problem.

To add to his problems: he has a HACMP cluster running and this will not recover itself from a "script error" like the one it has already run into. You need to use HACMP-recovery-procedures to get out of this state and this needs a sound understanding of how HACMP (=SystemMirror, =PowerHA) works. It is difficult to troubleshoot something as complex as a cluster over the internet and in fact it is bordering on criminal to have someone who is obviously way out of his league to administrate a system deemed valuable and mission-critical enough to warrant a (rather costly) HA-solution. This is like buying a Formula-One racing car and setting a 5-year-old behind the steering wheel.

@trevian3969: sorry, if this sounds like a personal attack - it is not! But from the phrasing of your questions i can tell that you never had any in-depth experience with AIX at all and a system with HACMP installed is most likely too valuable (otherwise why bother with high-availability at all?) to be administrated by someone who has no working experience with the system at all. You are attempting things even some seasoned AIX administrators would be uncomfortable with. Any newbie would be in exact the same situation as you are now.

I hope this helps.

bakunin

RecoveryOne · June 11, 2019, 11:16am

Yeah, the good news is depending on the other node the cluster should be up but unstable. And one can rebuild/reload and bring in a node to the cluster. Not a fun task by any means. That said, yes troubleshooting hacmp via forums would be a nightmare. I was hoping at the very least to help them get the system up enough to see what was going on.

@trevian3969:
Depending on the criticality of this system, and to parrot what bakunin said it very likely will be worth while to look at upgrading to something newer. If you have a good IBM rep, see if they can get you a discount on some AIX classes with purchase of new frames. The redbooks for HACMP, and AIX are also a wealth of knowledge. When I joined the sysadm team to work with AIX I knew next to nothing about it, other than it was COOL!. Still haven't been to an official class, but man pages, redbooks sites like this place here and other blogs of AIX have been indispensable to me. Sadly, I feel my coming into AIX may have been 10 years too late for all the hot sysadm stuff!

Feel free to post what questions you may have however. And I'll still try to help out as much as I can where I can, just keep in mind that in the end it's your environment. I don't know how your environment is setup. The OS is way out of date, and all that other fun stuff right? I figure if nothing else, bakunin smack me for being wrong! Hey, I still have a lot to learn!

Take care and good luck!

trevian3969 · June 13, 2019, 9:24am

Hi bakunin
You are right, our systems is very old, but for our boss this system could work at least one more year. This servers are 12 years old, and some of their parts we buy it as refurbished....

As you said there is a problems with the zoning. How can I check this,?
I know is a dummy question, but Im new in this area.

Thanks again

RecoveryOne · June 13, 2019, 9:50am

trevian,

You'll need to be able to get a console that you are able to use to get the wwn's from the adapters. The problem that I see, and posted about above, is that your PCI bus appears to have a fault. So it is very likely AIX won't be able to see your adapters to get the zoning information. IBM or someone with intimate knowledge of AIX/Power error codes could really decode the full errpt to know what else needs to be replaced on this frame.

bakunin · June 14, 2019, 3:46am

Not at all, in fact this is a very good question. Alas, one which can't be answered, because the zoning isn't done on the system.

Here is a general explanation about what "zoning" is

If you still have questions, just ask.

I hope this helps.

bakunin