IPMP group failed on Solaris 9

solaris_1977 · September 5, 2019, 9:41pm

Hi,

I have Solaris-9 server, V240.
I got alert that one of the interface on IPMP configuration, is failed. Found that two IPs (192.168.120.32 and 192.168.120.35) are not pingable from this server. These two IPs were plumbed on another server and that is decommissioned now. That is the reason, they are not pingable. For immediate fix, I plumbed both these IPs on another server and after that I was able to ping. I have seen this behaviour in other server, so I knew this may be the cause. But even after all IPs are pingable from routing table, I can't remove FAILED flag from ce0 interface.

# netstat -nr | grep 192.168.120.3
192.168.120.31 192.168.120.31 UGH 1 0
192.168.120.32 192.168.120.32 UGH 1 3
192.168.120.33 192.168.120.33 UGH 1 0
192.168.120.34 192.168.120.34 UGH 1 0
192.168.120.35 192.168.120.35 UGH 1 5
#
# ifconfig -a
lo0: flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
bge0: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2
inet 192.168.120.51 netmask ffffff00 broadcast 192.168.120.255
groupname sbprd_data
ether 0:3:flag_ba:c4:51:dd
bge0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
inet 192.168.120.50 netmask ffffff00 broadcast 192.168.120.255
ce0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 192.168.67.50 netmask ffffff00 broadcast 192.168.67.255
ether 0:3:flag_ba:85:5e:bd
ce2: flags=39040803<UP,BROADCAST,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,FAILED,STANDBY> mtu 1500 index 4
inet 192.168.120.52 netmask ffffff00 broadcast 192.168.120.255
groupname sbprd_data
ether 0:3:flag_ba:85:5e:bf
# if_mpadm -d bge0
Offline failed as there is no other functional interface available in the multipathing group for failing over the network access.
#
# snoop -d ce2
Using device /dev/ce (promiscuous mode)
^C
#
# cat /etc/hostname.bge0
sbprda-app1-bge0 group sbsd_data netmask + broadcast + -failover deprecated up \
addif sbprda-app1-prod netmask + broadcast + failover up
# cat /etc/hostname.ce2
sbprda-app1-ce2 group sbsd_data netmask + broadcast + deprecated -failover standby up
# cat /etc/hostname.ce0
sbprda-app1-ce0
#
# cat /etc/hosts| egrep "ce0|ce2|bge0" | grep -v "#"
192.168.120.51  sbprda-app1-bge0 sbprda-app1-bge0.xypoint.com
192.168.120.52  sbprda-app1-ce2 sbprda-app1-ce2.xypoint.com
192.168.67.50   sbprda-app1-ce0 sbprda-app1-ce0.xypoint.com sbprda-app1-bkp
#

I ran "pkill -HUP in.mpathd" on one terminal twice and checked /var/adm/messages on another session

Sep  5 18:26:25 sbprda-app1-prod in.mpathd[1290]: [ID 111610 daemon.error] SIGHUP: restart and reread config file
Sep  5 18:26:25 sbprda-app1-prod in.mpathd[18166]: [ID 215189 daemon.error] The link has gone down on ce2
Sep  5 18:26:25 sbprda-app1-prod in.mpathd[18166]: [ID 832587 daemon.error] Successfully failed over from NIC ce2 to NIC bge0

Sep  5 18:26:34 sbprda-app1-prod in.mpathd[18166]: [ID 111610 daemon.error] SIGHUP: restart and reread config file
Sep  5 18:26:34 sbprda-app1-prod in.mpathd[18347]: [ID 215189 daemon.error] The link has gone down on ce2
Sep  5 18:26:34 sbprda-app1-prod in.mpathd[18347]: [ID 832587 daemon.error] Successfully failed over from NIC ce2 to NIC bge0

Please suggest, what I am missing here and should check ?

Thanks

Neo · September 6, 2019, 11:21am

Is it a production / continuity problem if you simply clear by rebooting the server?

hicksd8 · September 6, 2019, 12:10pm

I have read your post#1 countless times and I must confess that I am at a loss to understand your question. Sorry about that I cannot give you a specific answer as a result.

So what I will do is bash some keys a provide some general network interface information as it pertains to Solaris 9. I apologize if you already know all this but we have to start somewhere. This might be a long post before I'm finished, I don't know, it's just going to be as it comes (into my head).

Why are you seemingly just plumbing missing IP addresses that you can't ping onto another system? With IPMP the same IP address is aggregated across two or more NICs (on the same machine).

If you want to configure IPMP you would do that BEFORE you 'plumb'. For example if you have interfaces bge0 and bge1, you would create an aggregate interface 'aggr1' for example and after that you would plumb and configure only aggr1. You would not try to configure bge0 and bge1 individually any more.

Now Solaris 9 will look for files /etc/hostname.<interface> at boot time and try to plumb those interfaces. If this system was restored from a different hardware platform, then you might for example have a file /etc/hostname.ce0 existing causing Solaris to try to plumb ce0 at boot-time when ce0 doesn't actually exist on this hardware. To stop Solaris from trying to plumb ce0 simply delete the /etc/hostname.ceo file.

When Solaris finds a file /etc/hostname.<interface> at boot-time, it reads the hostname from this file and then (assuming the interface is not configured for DHCP of course) goes to /etc/hosts and looks up the IP address it should use on this interface.

If you aggregate bge0 and bge1 into aggr1, then a file /etc/hostname.aggr1 is created which Solaris will try to plumb at boot-time.

Now, you are trying to get a FAIL message for ce0 to disappear, yes? I can think of only two possibilities why a system would complain about ce0 FAIL:

File /etc/hostname.ce0 exists but actual interface ce0 does not exist on this hardware. Delete the file.
The interface ce0 does not exist on this platform but is included in an aggregate IPMP configuration that has been restored from a different hardware platform. Down the aggregate interface and delete the IPMP configuration, then recreate the aggregate with interfaces that do exist on this platform and exclude ce0 which doesn't.

Aggregating interfaces has nothing to do with other systems on the LAN. Provided the network cables from the aggregated interfaces go to network switch(es) that understand multi-pathing then all should be well.

I'm going to stop there. If I've completely misunderstood your question then please give us a clue what this is about please.

Hope that helps in some way.

solaris_1977 · September 6, 2019, 5:11pm

I am sorry to have confused you. I clubbed two issues in one. I will re-word this issue.

IPMP is already configured on this server. Suddenly I got alert that IPMP group is failed over due to some error. When I logged into the server, I found that ce2 was in FAILED status, instead of the usual INACTIVE state.

/etc/hostname.ce2 file is there and the physical interface is also present. There was never any change in its setup. Physically I can see light blinking on network port behind the server. But since this interface is in FAILED states, IPMP is broken. Running snoop on ce2, is not giving me any result. To test this, I tried to detach bge0 and it is not working

# if_mpadm -d bge0
Offline failed as there is no other functional interface available in the multipathing group for failing over the network access.
#
# cat /etc/hostname.ce2
sbprda-app1-ce2 group sbsd_data netmask + broadcast + deprecated -failover standby up
#

---------------------UPDATE-----------------
Found that cable had problem. After replacing that, I was able to fix this issue

MadeInGermany · September 7, 2019, 5:34am

Thanks for the update.
If a NIC suddenly fails, and no admin did something to your system or to the LAN switch then the next idea is hardware.

The IPMP concept is quite different from the port aggregation concept.
Does the latter exist in Solaris 9 at all? In the early days you had to purchase SunTrunking software.

hicksd8 · September 7, 2019, 8:58am

@MadeInGermany........................That's an interesting point you make. AFAIR port aggregation was around long before multi-pathing (IPMP) as it's a simpler technology (isn't it??).

I assumed that since this is Solaris 9 we were talking aggregation and, from the posts, it sounded to me that one port going down (perhaps by unplugging the cable) stopped all communication thereby indicating that the other aggregated port was already down.

Perhaps I misunderstood the question in the first place. I had real difficulty getting a handle on it.

Yes, okay, I know that we techies are continuing a thread that's already tagged as solved.