solaris ping timeout

Hi,

I have two solaris 9 servers on the same switch,

primary
int0: 10.35.65.51
int1: 10.35.65.53

warm standby
int0: 10.35.65.52
int1: 10.35.65.54

Primary server communicates with the other for db replication on int0 interfaces.

But meanwhile we are using web interface running on int0 and sometimes it doesn't respond.

In order to investigate problem, I start ping to int0, int1 of the servers and saw that sometimes ping doesnt response on int0, but int1 on the same switch responding.

I changed switch in order to be sure about any problem on the switch, but ping problem appeared again.

The frequency of the problem: I cannot reach first box via ping for 10-15 seconds in every 10 minutes.

Is it possible that ping doesnt respond due to high volume communication on int0 of both servers?

or is it due to a problem on it? How can I check?

Could you help?

P.S I checked system log files and there were no hardware error there.
Thanks in advance

I would make sure you have at least a basic level of "Recommended Patches" on the box - these can often include kernel patches.
I would also swap out the UTP cables if you have spares - its a simple and cheap check that can often be missed.
Also is there likely to be any boxes that might have been allocated a duplicate IP address ? Ive seen in the past, because of the regularity of arp messages, two machines flip-flopping answering ping requests.

hi,

2 NICs on the same subnet are likely to cause you problems. What sort of routing entries do you have for these interfaces?

Can you paste your routing table please.

netstat -rn

I would go out on a limb and suspect asymmetric routing as the cause of the erratic behaviour.

You could check this from the machine itself, by doing a few times in sucession the following command:

traceroute -p 22 some_box_with_sshd_running

And look for which interface is chosen. If they tend to differ, then it lends credence to the asymmetric routing.

Hi,

Thanks for quick replies.

Below is the netstat output

# netstat -rn

Routing Table: IPv4
Destination Gateway Flags Ref Use Interface
-------------------- -------------------- ----- ----- ------ ---------
10.35.65.0 10.35.65.55 U 1 4 ce1
10.35.65.0 10.35.65.51 U 1 0 ce2
10.60.4.0 10.35.65.1 UG 1 2
10.35.0.0 10.35.65.1 UG 1 0
224.0.0.0 10.35.65.51 U 1 0 ce2
default 192.168.196.1 UG 1 15
127.0.0.1 127.0.0.1 UH 2 8202 lo0

I am going to check "tracereoute -p 22 some_box-sshd_running"

By the way, we tried with different cable, but still got the same problem

Ah yes...I seem to remember somewhere in my brain that Solaris does/did the odd thing of having the same hardware MAC address for multiple NIC's doesnt it ? Could that be confusing the switch ?

You mean local-mac-address in eeprom.

That could be it, yep -

@ OP, please run the following and post.

eeprom | grep local-mac

Let us also know what the traceroute threw up.

This article discusses the one MAC/multiple NIC issue, and how to disable that....might be worth a try if you're allowed a reboot.

Thanks for the link citaylor. I will try that option as soon as possible.

---------- Post updated at 02:16 PM ---------- Previous update was at 02:10 PM ----------

Below is the traceroute output
traceroute: Warning: Multiple interfaces found; using 10.35.65.51
traceroute to 10.35.65.52 (10.35.65.52), 30 hops max, 40 byte packets
1 vfelig2 (10.35.65.52) 0.323 ms 0.263 ms 0.227 ms

Below is latest netstat output (previous netstat output was the older one)

----------------------------------------------------
Routing Table: IPv4
Destination Gateway Flags Ref Use Interface
-------------------- -------------------- ----- ----- ------ ---------
10.10.4.0 10.35.65.1 UG 1 644
10.35.65.0 10.35.65.55 U 1 9529 ce1
10.35.65.0 10.35.65.51 U 1 0 ce2
10.35.67.0 10.35.65.1 UG 1 736
192.168.196.0 192.168.196.196 U 1 30263 ce0
10.60.4.0 10.35.65.1 UG 1 2263
10.10.0.0 10.35.65.1 UG 1 4526
10.35.0.0 10.35.65.1 UG 1 435
224.0.0.0 10.35.65.51 U 1 0 ce2
default 192.168.196.1 UG 1 2784
127.0.0.1 127.0.0.1 UH 39289846423 lo0

eeprom | grep local-mac
local-mac-address?=true
(So it is okey, no need to change it)

ok looks like ce1 is getting the most use from the netstat, but what I wanted to know with the traceroutes was, what happens when you run it in succession, (one-after-another) maybe 5 - 8 times.

You'd be looking for this like changing:

traceroute: Warning: Multiple interfaces found; using 10.35.65.51

If the "using" part changes to another interface/IP, then back again across the iterations, then you know you have a routing issue due to having 2 NICs on the same subnet.