Serious un-pingable stumper of a problem...

I have been busting my head over a network issue at work recently. I believe the problem to be in the L2 domain, but "the powers that be" believe that it looks more like a server port related problem. And the biggest problem of all is that EVERYBODY in the Engineering Department uses this file-server...

The symptoms are as follows:

  • A samba connection is shared out from "FileServ_1" to my desktop. While having a file open for read/write, I will lose the file (aka. the persistence of connection), and will be prompted by my App to save a local copy (lucky me).
  • From that point, I immediately (being prepared) switch to a shell in which I kick off a ping to "FileServ_1"... then another shell I bypass DNS & go straight for the IP... then another shell I have a remote connection from a totally different subnet, also pinging "FileServ_1"... and finally a trace-route running from both my desktop and the remote connection.
  • From ALL pings I receive timeouts & from all traces I find the last hop is the dead-zone.

Although "the powers that be" make a strong case for their point, I have noticed "network topology changes" being reported at the switch (indicating a loop?) and I have been able to serial-console "FileServ_1" and watch it while it is supposedly "down"... only problem is: It never thinks that it is down.

  • Eth1 (till last week was the only port plugged in) never reports any issues (at least not at any default log levels) and from what I can see there is no way to tell if the ICMP packets are dying on the way in or on the way out.

Finally, as if things were not bad enough, they decided last week to make Eth0 a redundant fail-over for Eth1... which amazingly seemed to lighten the problem from "a few minutes of un-ping" to "a few seconds of un-ping"... and now, instead of happening 10 times a day it happens only once or twice.

So first things first (unless you have better ideas), I am wondering how to turn up the logging of ICMP (thats kernel level right?) and possibly Eth* logging so that I don't have to resort to sniffing for the entire day till it happens. Cause if nothing else, I would like to diagnose this problem correctly and get something done about it.

Any Help?

This is how it can be done on router's side, certainly, this would require the net-admins to get involved. On your end, you may find "ngrep" utility useful to track down ICMP traffic. More you can do is to run "netstat -s" which will show all network connection statistics.