IPMP NIC fails to come back up.

sparcman · April 30, 2009, 2:18pm

Hi,

I have recently configured IPMP on a Solaris 10 server. Everything was working ok when I removed the network cable from interface ce0. The floating IP failed over to ce1.

Unfortunately when I plugged the network cable back into ce0 about 2 minutes later it did not come back up. Is there a hardware setting within the prom that prevents a NIC from coming back up?

ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
ce0: flags=19040803<UP,BROADCAST,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,FAILED> mtu 1500 index 2
        inet 10.0.0.77 netmask ffffff00 broadcast 192.168.30.255
        groupname netgroup
        ether 0:3:ba:33:5a:59
ce1: flags=29040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,STANDBY> mtu 1500 index 3
        inet 10.0.0.78 netmask ffffff00 broadcast 10.0.0.255
        groupname netgroup
        ether 0:3:ba:33:5a:5a
ce1:1: flags=21000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,STANDBY> mtu 1500 index 3
        inet 10.0.0.142 netmask ffffff00 broadcast 10.0.0.255

# more /etc/hostname.ce0
dummy1 netmask + broadcast + group netgroup deprecated -failover up addif hostname netmask + broadcast + failover up

# more /etc/hostname.ce1
dummy2 netmask + broadcast + group netgroup deprecated -failover standby up

Thanks

DukeNuke2 · April 30, 2009, 3:16pm

your config files are a little strange... maybe you should have a look at the ipmp admin guide...

IPMP (System Administration Guide: IP Services) - Sun Microsystems

incredible · May 2, 2009, 8:38am

advisable to use local-mac-address?=true under the eeprom

run the snoop command to troubleshoot. also what are the errors you see in the messages file?

sparcman · May 5, 2009, 8:19am

Hi,

I have used the setting local-mac-address=true. I also rebooted after setting this.

# eeprom | grep local
local-mac-address?=true

Messages file.

Apr 30 19:02:16 hostname in.mpathd[397]: [ID 168056 daemon.error] All Interfaces in group netgroup have failed
Apr 30 19:02:27 hostname genunix: [ID 408789 kern.notice] NOTICE: ce1: fault cleared external to device; service available
Apr 30 19:02:27 hostname genunix: [ID 451854 kern.notice] NOTICE: ce1: xcvr addr:0x01 - link up 1000 Mbps full duplex
Apr 30 19:02:27 hostname in.mpathd[397]: [ID 820239 daemon.error] The link has come up on ce1
Apr 30 19:02:59 hostname in.mpathd[397]: [ID 299542 daemon.error] NIC repair detected on ce1 of group netgroup
Apr 30 19:02:59 hostname in.mpathd[397]: [ID 237757 daemon.error] At least 1 interface (ce1) of group netgroup has repaired
Apr 30 19:02:59 hostname in.mpathd[397]: [ID 832587 daemon.error] Successfully failed over from NIC ce0 to NIC ce1
May 1 11:33:30 hostname in.mpathd[27703]: [ID 215189 daemon.error] The link has gone down on ce0

I can ping the failed interface. I reckon it's just that the in.mpathd daemon has failed this interface and failed to repair it. I'm just wondering if there is a way to manually repair it and bring it back online. I have a call logged with Sun on this one as well. Thanks.

incredible · May 6, 2009, 4:25am

kindly post your /etc/hosts file

sparcman · May 6, 2009, 5:15am

# more /etc/hosts

127.0.0.1 localhost
10.0.0.142 hostname.eu.domain.com hostname loghost
10.0.0.77 dummy1 #Multipathing Test Address
10.0.0.78 dummy2 #Multipathing Test Address

DukeNuke2 · May 6, 2009, 5:24am

as allready mentioned in my first post, your configfile with "- failover + failover" is a little strange. please stuck to the adminguide!

sparcman · May 6, 2009, 6:48am

Hi DukeNuke2. The - failover + failover in the config file should be ok. I had the Sun engineer take a look and he said it was fine. I'll ask him to double check it. The first -failover is for the test address which won't fail over. The second +failover is for the virtual address which will. I have implemented this on to three other servers with the exact same configuration and it has worked.

I have used if_mpadm to test the failover instead of unplugging the network cable. It worked with no problem. IPMP configuration aside, this failed interface problem has happened on a server I have without using IPMP.

We did a network test where the primary network was switched off. This server had one interface connected to that network. When the network was switched back on the interface on the Sun-Fire-V210 had a failed flag on it.

I couldn't get it working again so I plumbed a second interface and configured that. It only happened to that server. All the other servers were fine. Even unplumbing and plumbing the problem interface did not work. There has to be a hardware/software setting somewhere that prevents this from happening. Sun engineer so far has come back with nothing:(

Thanks.

Sparcman

DukeNuke2 · May 6, 2009, 7:17am

just for test, use the settings from the admin guide like:

dummy1 netmask + broadcast + group netgroup up \
	addif hostname deprecated -failover netmask + broadcast + up

dummy2 netmask + broadcast + deprecated group netgroup -failover standby up

hth,
DN2

sbk1972 · May 6, 2009, 7:21am

Mmmm IPMP, we meet again. I had issue with this, and it was down to the config. Here's a copy of my setup :-

::::::::::::::
hostname.ce0
::::::::::::::
group bsun50-ipmp-grp0
set 148.253.138.36/27 broadcast + -failover standby deprecated up
::::::::::::::
hostname.ce2
::::::::::::::
group bsun50-ipmp-grp0
set 148.253.138.37/27 broadcast + -failover deprecated up
addif bsun50/27 broadcast +

I dont see any definition of group in your settings. Mind you, IVe seen a ton of different ways of configuring IPMP.

Here are some dos / donts :-

IPMP requires a default gateway to be set. (A remote to ping to ensure things are working)
Clients must connect to the logical interfaces not the physical.
Local mac address must be used. * (# eeprom local-macaddress?=true)
Load balancing is performed on a connection basis not per packet.
When the failover occurs you will see the logical address of the failed card or link appear on the working interface card.
Can be made permanent using /etc/hostname.(networkcardtype+number) e.g. hostname.ce0
*Applications should not bind to the physical IP addresses but use the logcal ip address

There's a comand that allows you to failover the cards, e.g. if_mpadm

# if_mpadm -d ce0 - detaches ce0
# if_mpadm -r ce0 - re-attaches ce0

HTH

SBK

incredible · May 6, 2009, 7:38am

This is how I will prefer. My following example provides failover with 1 public IP. Advantages will be easier debugging, 1 lesser IP used and easier firewalling.

Primary Interface
# cat /etc/hostname.ce0
DUMMY1 netmask + broadcast + group production deprecated -failover up \
addif REALNAME netmask + broadcast + failover up

Standby Interface
# cat /etc/hostname.ce1
DUMMY2 netmask + broadcast + group production deprecated -failover standby up

/etc/hosts file
# Internet host table
#
127.0.0.1 localhost
192.168.1.10 REALNAME loghost
192.168.1.11 DUMMY1
192.168.1.12 DUMMY2

The above sets up two dummy (private) IP addresses that are fixed to the interfaces. It sets up a failover group named production. It adds an IP REALNAME to the group and marks it as the failover IP that will be migrated, and ce1 is set as the standby interface. In most situations, ce0 will be used to transmit and receive packets. In the case of failure (interface, switch, cable, router, etc), the IP for REALNAME will migrate to ce1 interface. When ce0 recovers, the IP will migrate back.

sparcman · May 6, 2009, 7:40am

Thanks for the input. I do have my group set up in the config files. It's called netgroup. I will change the configuration slightly and test it that way.

I'm implementing IPMP across all of my test servers at the moment so I can play around with the config a little to get it right before implementing on Prod.

I'm confident that the interface on this server hasn't physically failed. Solaris has just flagged it as failed. There must be some way of changing the flag manually or to get Solaris to probe the interface again?

I have the default router set up etc which is used to check the health of the interface. Basically all I did was unplug the network cable for about two minutes and then plug it back in. Won't be doing that again:( Still working with the Sun Engineer to resolve that problem. I will post the resolution if i get it.

Thanks.