How to demonstrate network problems?

Evan · February 4, 2011, 6:15am

Hi,
I work on several Sun servers running Solaris (SunOS 5.10). All of these are Application Servers with a propietary software running on it.
It happens that some times (not regularly/deterministic and not so often, i.e. twice a month circa) we register what I think are network problems.
I say this because the messages appearing in alarm log from AS says:

Can't connect to NE
AS not available
Connection to NE time out
Also it happened that after such alarms we had a huge number of frozen TCP connection in CLOSE_WAIT. Following this the File Descriptors for the process that handle TCP connection reached the system limit of 5120 and the AS started to transmit in UDP causing the failure of transactions since the other side didn't expect UDP packets.
Now this behavior (frozen CLOSE_WAIT) is fixed with a patch from the vendor but still remains the network (?) problems causing the alarms.

So since the first time it happened I have a fixed idea in my mind: network problems (Catalyst side to be clear).
Whatever it can be (traffic switch, downlink ...) this affect the traffic (SIP) of our AS.
The only problem is that the network guys says: from our point of view it's everything fine! Always, all the times it happens the problems!

At the end of the story:
What kind of data can I collect (and HOW!?) to DEMONSTRATE that it's a problem of the network (or the AS)?
Do you know a script or tool (not traffic affecting) that can monitor the described scenario?

Thanks you very much!

Regards,
Evan

jim_mcnamara · February 4, 2011, 10:22am

First read up on snoop. You can capture and analyze packets, then run reports against those packets when you have a problem for network verification. The main downside to this strategy is disk usage. Not performance. You point of interest is to be able to realte the send time of a packet from your app to the time it takes for the expected return packet to appear. This is not the same thing as ping return times.

You can also use netstat -i to gather information on packet collisions. When collisions rise to a significant level trafic is impeded to large degree.

However, most network problem I have seen are the result of crappy application code.
Once a network is set up correctly, and is not subject to huge torrents of random data, you do not see problems except for hardware issues or intrusions.

A common application fault is not implementing a circular buffer (queue) large enough to deal with full bore traffic. In other words the network outruns the application between two apps: one slower, one faster.

Evan · February 4, 2011, 10:48am

Hi jim thanks for your reply.

Regarding the snoop I'll brush up a previous script: an infinite loop that created a snoop file at once with 5000 packets:

snoop -q -o ./trace.cap -d eth0 -c 5000

Then the file is kept or deleted based on other stats so no disk space problems.
The script didn't affected much the system from CPU and memory point of view.
I'll size the number of packets.

Your last statement is not very clear to me.
You mean that in some cases the network could be slower than the application traffic and then some packets are lost?
So you suppose that the application has a limited buffer (undersized) and during traffic peak it could loose packets and take to the mentioned errors.
Is this right?

In the meanwhile I'm studying kstat and dtrace as described here:
H.K. Jerry Chu's Weblog

Do you think I can get interesting infos with these tools?

Thanks,
Evan

jim_mcnamara · February 4, 2011, 11:38am

No - the reverse. The network is not the problem. The application loses data because it cannot handle the traffic.

If your network guy knows what he/she is doing, the likelihood of your applications being at fault is pretty good.

Your goal should be: correlate times of network data with good performance and bad performance.

Evan · May 23, 2011, 7:38am

Hi jim,
sorry for not updating this post.
TCP issue was the problem!

Let me explain: the core network sends TCP packets with parameter Window=0 (this happens on process restart, still not clear why it restarts).
The AS handle such scenario in a standard manner: no more traffic toward the core NE.
This lead to some asserts in log files reporting disconnection from all the nodes.

Regards,
Evan