AIX with 2 Net Interfaces lose connectivity

hi guys

We have a AIX Server with TSM installed.
This server has en0 for administration purposes and we have en1 for backup stuff.

en0 subnet 10.10.10.x
en1 subnet 10.10.20.x

The issue we are having is all of a sudden the LPARs we are backing up lose connectivity to the AIX-TSM Server. for example
en1 ip is 10.10.20.10 and a TSM LPAR client that we are backing up is 10.10.20.11

Backup fails and we found that LPAR client 10.10.20.11 won't ping 10.10.20.10, but if go to the AIX-TSM server ping works and ping from LPAR client is back.

It's like ping from the AIX-TSM box reactivates connectivity.

Something important to mention we found the problem making backups but today we were doing to test and the connectivity is gone doing nothing and pinging from AIX-TSM server to LPAR Clients reactivates it

Anyone has seen an issue like this?

en0 is always fine

thanks a lot

This sounds quite strange.

First, do you have a VIOS or are you using HEAs in your LPAR?

Second, do you have any "energy saving" enabled in your system profiles? Believe it or not, there is such a thing and the hardware might find it worthwhile to switch off a (seemingly) adapter to conserve energy.

I hope this helps.

bakunin

1 Like

well I forgot to mention this is a Power 701, where I installed AIX 7.1 there is not VIO.

and second LPARs are in Pureflex this is a p260 Power server which have Etherchannel configured and on this server I have VIO 2.2.3

LPARs have 2 virtual nics as well, administration NIC is always OK, is the backups network / interfaces which lose connectivity

AIX-TSM Server
en0 - 10.10.10.x GW: 10.10.10.1
en1 - 10.10.20.x

LPAR
en0 10.10.11.x
en1 - 10.10.20.x

LPAR loses connectivity to en1 IP, BTW I've tested en2 same issue

found something, in this pureflex I have also Intel servers with Vmware which host some Windows VMs, I configure a Windows VM in the same VLAN and the Windows VM does not lose connectivity to the AIX-TSM IP.

Something else I found doing some more testing If I change the IP to the AIX-TSM server AIX LPARs won't ping

for example

AIX-TSM Server IP 10.10.20.10 If I change it to 10.10.20.30 LPARs AIX won't ping now Windows VM pings without any problem
to get AIX LPARs ping I have to go to the AIX-TSM server and ping any ip in the 10.10.20.x segment

so far looks like AIX issue...?
should I add a static route or something? but the thing is the AIX I am testing right now only have one interface in the 10.10.20.x segment not like the production LPARs

thanks

How many default gateways are configured?

Can you paste the output of netstat -nr

like I said I configured an extra AIX LPAR just for testing using just one nic interface not 2 like the others LPARs being backed up.

this AIX LPAR for testing is 10.10.20.16 and this is the command you are looking for

# netstat -nr                                                                      
Routing tables                                                                     
Destination        Gateway           Flags   Refs     Use  If   Exp  Groups        
                                                                                   
Route Tree for Protocol Family 2 (Internet):                                       
10.10.20.0          10.10.20.16        UHSb      0         0 en0      -      -   =>  
10.10.20/24         10.10.20.16        U         0     18306 en0      -      -       
10.10.20.16         127.0.0.1         UGHS      0       670 lo0      -      -       
10.10.20.255        10.10.20.16        UHSb      2      1172 en0      -      -       
127/8              127.0.0.1         U         6    184693 lo0      -      -       
                                                                                   
Route Tree for Protocol Family 24 (Internet v6):                                   
::1%1              ::1%1             UH        1     14560 lo0      -      -     

but the way VIO in my p260 and AIX-TSM Power 701 are using etherchannel.

but just to let you know I removed etherchannel in Power 701 to test more and same issue I am not able to remove the etherchannel in the p260 since is Production and I am sure network 10.10.10.x has been working fine for quite some time

Where is the default gateway in here? I see 10.10.20.16 as en0 interface IP.
I feel its an issue of gateway.

Ok run this and paste the output.

lsattr -El inet0 | grep route

thanks a lot for you help

Ok, I was able to access one of the production server which has both nics and I run both commands again

Server1PRD: /> netstat -nr                                                           
Routing tables                                                                     
Destination        Gateway           Flags   Refs     Use  If   Exp  Groups        
                                                                                   
Route Tree for Protocol Family 2 (Internet):                                       
default            10.10.10.1         UG       34  35227001 en0      -      -       
10.10.10.0          10.10.10.11        UHSb      0         0 en0      -      -   =>  
10.10.10/24         10.10.10.11        U         2      6962 en0      -      -       
10.10.10.11         127.0.0.1         UGHS     61 560916318 lo0      -      -       
10.10.10.255        10.10.10.11        UHSb      0         4 en0      -      -       
10.10.20.0          10.10.20.11        UHSb      0         0 en1      -      -   =>  
10.10.20/24         10.10.20.11        U         0 568733107 en1      -      -       
10.10.20.11         127.0.0.1         UGHS      2       763 lo0      -      -       
10.10.20.255        10.10.20.11        UHSb      0         0 en1      -      -       
127/8              127.0.0.1         U        17   2471843 lo0      -      -       
                                                                                   
Route Tree for Protocol Family 24 (Internet v6):                                   
::1%1              ::1%1             UH        2    426257 lo0      -      -       
Server1PRD: /> lsattr -El inet0 | grep route                                          
route         net,-hopcount,0,,0,10.10.10.1 Route                               True 

by the way both en0 and en1 have GW 10.10.10.1

I can not remove Gateway 10.10.10.1 from en1 should I?

thanks a lot

You are confusing me,
You showed a different server 1st, which does not have a gateway and now you are showing another server.

Paste the output from the problematic server (mention the hostname and/or IP), I don't need output from others.
Are you saying you did etherchanneling and configured different subnet on each adapter? have you done vlan tagging? For multiple VLANs to exist on same ent port we need vlan tagging and NOT just doing etherchanneling.

Explain me the above.

the output I just pasted is from a problematic server which has 2 nics I was using a "TEST" server which has just one nic in the 10.10.20.x segment but to be realistic I requested access to one of the production server having the issue... since it was out customer who found the issue

this server is 10.10.10.23 - Production IP
10.10.20.11 - Backup IP

yes both ends have etherchannels I mean Power 701 (AIX-TSM) is using 2 adapters for etherchannel and yes I used smitty vlan to create VLAN 500 for backups

for VIO in the p260 - pureflex - which hosts the PRODuction LPARs is using etherchannel and vlans are create as well

Hmmm,
Ok, did you try to login to the client using 10.10.20.11 IP?

Have you done a traceroute from TSM server to client (10.10.20.11)?

Or have you tried adding the TSM server IP/Hostname in /etc/hosts file of client?

Yes I can login from AIX-TSM to 10.10.20.11 using ssh

this is the traceroute

Power-TSM:/> traceroute 10.10.20.11
trying to get source for 10.10.20.11
source should be 10.10.20.10
traceroute to 10.10.20.11 (10.10.20.11) from 10.10.20.10 (10.10.20.10), 30 hops max
outgoing MTU = 1500
 1  10.10.20.11 (10.10.20.11)  1 ms  0 ms  0 ms

I have not tested - /etc/hosts - but do you think it will work?

thanks

Ah! ok, let me ask you this, have you looked the DNS?

nslookup <hostname of  tsmserver>  from client
also do the reverse nslookup
nslookup <IP of tsmserver> from client (x.x.20.11)

Is it resolving to correct IP/Hostname?

Check you /etc/resolv.conf file

Also, your /etc/netsvc.conf file for "hosts=xxxx" what is this value?

Let me know, what you see.

no we are not using DNS here, we are hosting those LPARs in our VIO and there is on DNS to resolve between 10.10.20.x

So, you when you do
nslookup <tsm server> you get what?

If you are relying on local /etc/hosts file for name resolution, then go ahead and add the entry for tsm server on the client's hosts file.

no we are not using DNS to resolve this

since DNS resolve only 10.10.10.x IPs that are used by our customer and they added an entry in /etc/host to resolve by name and IP this AIX-TSM Server

Ok. So, this x.x.20.11 is a private network.

So, how did you configure the IP (ent1 - 10.10.20.11)?

I assume you did

smitty chinet --> selected en1 -->
INTERNET ADDRESS (dotted decimal)                  [10.10.20.11]
  Network MASK (hexadecimal or dotted decimal)       [xx.xx.xx.xx]
Rest values default

Now the question is, have you correctly feeded the subnet mask value?

Show me the subnet mask value from client and also from TSM server
You can get it by running
ifconfig en1 ( if you want to grep netmask you can do so)
Value will be something like 0xffffff00 (this is an example)

Post the values from TSM server for 20.x network and also from client for 20.x network.

yes I did there is nothing difficult in setting mask: 255.255.255.0

Ok,
Nothing more is coming to my mind now, at this point what I can say is

Add tsm server (20.x) to /etc/hosts file of client.

Make sure you follow the correct naming convention for both, if you go with default hostname it will try to connect using 10.10.10.x network.

Also, check what is the IP address of tsm server on client's dsm.sys file
You can find it under

/usr/tivoli/tsm/client/ba/bin(64)

earlier version it will be bin and later version it will be bin64
And look for TCPServeraddress in dsm.sys file

Run dsmc from client and see if it connecting to tsm server and you can do some queries
q files

1 Like

thanks a lot ibmtechh, yes it's weird and seeing Windows virtual Machines working fine makes think in something about AIX

I will try to do the last things you are telling me :b: