Solaris 11 Express NAT/Router IP Fragments

vectox · April 8, 2011, 12:59am

Upon replacing my linux router/server with a Solaris one I've noticed very poor network performance. The server itself has no issues connecting to the net, but clients using the server as a router are getting a lot of IP fragments as indicated from some packet sniffing I conducted.

Here was my old setup.
<DSL_Modem>-<Linux Router>-<switch>-<wifi>-<macbook>

this setup works fine, with no fragmentation or performance issues

Setup 1
<DSL_Modem>-<Sol 11 Router>-<switch>-<wifi>-<macbook>

this setup has major packet fragmentation

Setup 2 (taking wifi out of the flow)
<DSL_Modem>-<Sol 11 Router>-<switch>-<macbook>

this setup has major packet fragmentation

I played with various MTU settings on the solaris server internal NIC, but it made no difference so I tried a couple of things with the client box.

I determined the max MTU I could send from my macbook as 1464 without getting fragmentation by using:
ping -D -s 1464 <any internet ip>

Once I manually set my MTU down to 1464 on my macbook instead of the default 1500 web pages started loading normally. So here's the problem...why do I have to manually set the MTU on the client macbook when I have my solaris server setup as a router. Is there some network related tuning I can perform on the server that will address these issues?

DGPickett · April 8, 2011, 10:33am

Maybe some firewall or setting is not allowing Path MTU Detection, the process where routing tables are used to record, for specific hosts on normal routes, the discovered max MTU of the path to that host. This is done by sending DNF flagged packets and getting too-big ICMP messages back, or no response at all (a Path MTU Black Hole, where a firewall or setting prevents the ICMP too-big message).

Packet fragmentation is not uncommon with VPN, for instance, as the VPN wrapper expands the packet size. NAT just rewrites packets in place, does not expand packets, unless they have added NAT features since I was playing with NAT.

Normally, MTU is 1500 on Ethernet. The 802.3 MTU is 1492. I wonder what is trimming the MTU to 1464? Is VPN in play? http://en.wikipedia.org/wiki/Maximum\_transmission_unit

Packet fragmentation should not be the end of the world, speed wise, just a bit less than optimal, with all the additional, small fragments. Can you, did someone turn off reassembly to avoid a related denial of service?

Extremely low MSS or RWIN (window size) settings can lower packet size. Low RWIN means the recipient does not have the buffer to hold the data in the packet, which seems very silly, but here we are. A "nice" TCP stack could ack the part it could digest (once it has some space) and discard the part it has no buffer for, but who knows? At one time, for Internet traffic, servers wanted an RWIN that is about 4 * MSS = (MTU - IP Header (20 for IPV4) - TCP Header (20 for IPV4 plus any modulo-4 byte option additions, one of which can send the MTU) ), so 4 packets are sent and then an ack is waited for, but you can go much higher to boost performance at the cost of more potential retransmission in case of error. Originally, RWIN maxed out at 65535, but later (RFC-1323) it was enabled to go higher. http://en.wikipedia.org/wiki/Transmission\_Control\_Protocol\#Window_scaling RWIN represents the size of an end's TCP socket stream buffer (ret = setsockopt( sock, SOL_SOCKET, SO_RCVBUF, (char *)&sock_buf_size, sizeof(sock_buf_size) );), and RAM has gotten cheaper and more ample. RWIN needs to accommodate all the data you can normally send before the ack of the first packet returns, to not choke throughput. Big transmit socket buffers SO_SNDBUF are nice but not that critical to net throughput; they ensure that the sending app can write/send all the data of one ply off on the API and move on, not blocking. Of course, both ends have an MSS, but MSS is only important at the end receiving the bulk of the data, so the sourcing system can keep sending at max rate without delays. Welcome to the full duplex world of TCP, simulated if not real. Be careful to tune both ends! http://en.wikipedia.org/wiki/Maximum\_segment_size http://en.wikipedia.org/wiki/Transmission\_Control\_Protocol\#TCP\_segment_structure

So, once you find choke points in the MTU, you need to tune the RWIN, MSS so TCP will use it, tune any apps for big buffers and ensure Path Detection and Black Hole Detection are properly configured, then you can get close to the throughput you paid for, at least in the more popular direction.

vectox · April 9, 2011, 8:04pm

Some good detail in there. I also found some useful information here MSS Problems with Sun PPPoE . Additionally, I reviewed my Linux router config to see what may be "working" and found that it's likely that the following firewall rule was addressing the issue I'm now experiencing with Solaris.

iptables -I FORWARD 1 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

Too bad it wasn't that simple with the Solaris setup :). I'll attempt to tune the Solaris setup and see how I make out.

---------- Post updated 04-09-11 at 08:04 PM ---------- Previous update was 04-08-11 at 08:16 PM ----------

It's working now...and appears to be performing, but is it optimal?...I'm not sure yet. For those who wish to tackle using Solaris as a firewall/router against a PPPoE connection, I'll put my details here.

By default, the negotiated MTU over PPPoE is going to be 1492.
Using the 1492 MTU as the model, I've knocked 40 off for a max MSS number of 1452 for the TCP stack to use.

ndd -set /dev/tcp tcp_mss_max_ipv4 1452

In addition to this I'll want to turn off Path MTU Discovery.

ndd -set /dev/ip ip_path_mtu_discovery 0

DGPickett · April 11, 2011, 2:48pm

Path MTU Discovery is nice on a varied intranet, but not so good for the Internet, where the short relationships might make it not worth the effort.

Use a sniffer to see what sort of options are in your standard TCP packets (not SYN or FIN). Add their length to the 40 before subtracting. Sometimes, the RWIN is called max MSS. Try various options with a long stream between two local hosts.

Normally, frags are for big UDP packets and normal-net-max packets on VPN.