Bizzare TCP/IP problem

pileofrogs · February 1, 2011, 8:29pm

Hi all.

I have a really really weird problem that I've been working on for days.

The problem manifested as users cannot connect to our web servers via SSH when they're using our wireless network. Here's where it gets weird:

Clients from anywhere other than the wireless subnet can connect fine
Wireless clients can connect to ssh servers on subnets other than the one our web servers are on (both onsite and offsite)
I can run nc -l 22 on one of the web servers and transfer big files from a wireless client with cat bigfile | nc webserver 22.
If I run telnetd on port 22 one of our web servers, I cannot connect. It fails in a very similar way to ssh
Update (Three days later) I can recreate the problem in netcat by typing into the client and server alternately. If I just send one-way in netcat, the problem never comes up.
The TCP handshake succeeds, then packets stop arriving and the client starts resending packets. The server seems to be waiting.
When I kill the ssh or sshd process, a bunch of tcp packets start flowing. If I kill the client, the server will actually show a completed key exchange (ssh obviously). Said another way, the connection stalls, I kill the client, the connection continues a bit with the client dead and then closes.
Googling around I found lots of folks who recommended fiddling with MTU and some IP /proc variables, but that did not help. The problem is too consistent to be that anyway. And I can nc big files (10Mb) with no problem. (md5 checked)
I thought it might be a DNS problem, but tcpdump shows no DNS queries while the connection hangs (set UseDNS no in sshd_config).
Update (Two days later...) - I plugged in a machine that is not a Xen host or client, and it shows the same behaviour, so we can rule out any Xen strangeness as the culprit.
Update (Three days later...) - After the TCP handshake, the client can send as many packets as it wants UNTIL the server sends anything (again, after the initial handshake), after which any packets from the client do not reach the server.

Other important info:
I only control the client I'm testing with and the web servers. I do not control the wireless setup or the routers or the firewalls. Those are all controlled by my boss. He's checked his config and it looks good to him, so if it really is something wrong on his end, I need really good evidence before I waste his time some more. Really, the clues so far point to my servers being the source of the problem.

The servers are all CentOS 5.5. They are virtualized under Xen. (Tcpdump shows the same stuff on the Xen host/Dom0 as on the client/DomU, so I don't think it's a Xen problem, but then again....) Update My client is also linux, Fedora 11. The problem was initially reported by a Mac user, version unknown.

Okay, I gotta go soak my head.... Thanks All!

-Pileofrogs

methyl · February 3, 2011, 5:20pm

Did you mention what Operating System and ssh software the clients use? If Microsoft is involved, please be specific about software versions.

Corona688 · February 3, 2011, 5:50pm

It sounds like an MTU problem. A while ago we had another fellow with a similar-looking problem -- he could connect on FTP, but the socket would transfer a few kilobytes then timeout, because his client's MTU was too large.

Early in the session when they're still negotiating they'll be mostly sending small packets and the problem goes unnoticed, but when you start transferring bulk data(or ssh keys?), some link between your clients and your web server chokes on packets larger than its configured to handle and drops them into hyperspace, leaving both ends waiting for the other. Retransmits also get dropped, so the connection chokes and eventually dies.

It should be able to handle that gracefully -- compliant routers send an ICMP reply which says "too big! fragment them more!" But there are unfortunately lots and lots of firewalls set up by people convinced that all ICMP is bad.

Try reducing the MTU on your clients and see if that helps.

Try pinging hosts from the wireless with huge packets to see if some links start dropping before others and, if they do drop, whether anything ICMP replies.

methyl · February 3, 2011, 6:08pm

If it is a MTU problem, try ftp with the parameter "-B 1". I have seen dramatic speed improvements because "-B 1" prevents "jumbo packets" which can be extremely slow unless every software and hardware component in the network was expecting this "enhancement" to the TCP/IP protocol.

@Corona688
Hmm sounds like a classic unix-to-Microsoft ftp problem. It is a firewall problem because Imho Microsoft don't implement ftp correctly. In unix you can transmit small files on port 21 but need port 20 open to transmit large files. Nuff said.

If it's unix-to-unix lowering the MTU with the "-B" parameter to "ftp" can produce serious speed improvements on a mixed-manufacturer network.

pileofrogs · February 4, 2011, 1:13pm

Thanks for taking the time to answer! Sadly, I've already played around with the MTU and it didn't help. It's actually balking on packets with almost no data at all in them.

---------- Post updated at 10:13 AM ---------- Previous update was at 10:06 AM ----------

Sorry! Client is Fedora 11 linux using OpenSSH. The problem was originally reported by someone using a Mac, I don't know the OS version. He tried using the command line ssh & something called cyberduck.

Since I can recreate the problem using telnet and now, netcat, I think it's not specific to any versions or OS.