TCP Tuning and Network Troubleshooting
Pages: 1, 2
Setting the Maximum TCP Buffer Size
For many links, it is critical to be able to increase the system-defined maximum TCP buffer size to obtain good throughput. For example, assume a 100Mbps link between California and the United Kingdom, which has an RTT of 150ms. The optimal TCP buffer size for this link is 1.9MB, which is 30 times larger than the default buffer, and 7.5 times bigger than the default maximum TCP buffer for Linux.
To change TCP settings in Linux, add the entries below to the file /etc/sysctl.conf, and then run
sysctl -p. The system will also set these values at boot time.
# increase TCP maximum buffer size net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # increase Linux autotuning TCP buffer limits # min, default, and maximum number of bytes to use net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216
Set the maximum buffer sizes to a value large enough to handle the longest, fastest network link you think that host will encounter (16MB in the example above). Windows does not require any modifications, as the default maximum TCP buffer size (
GlobalMaxTcpWindowSize) is not defined. My TCP Tuning Guide web site has information on how to set the maximum buffer size for other OSes.
By now you probably have questions such as How do I actually use this in practice? Do I trust users to set the buffer size? Do I try to compute the optimal buffer size for the user? Or should I just set a big buffer size and not worry about it?
In general, I suggest the following strategy for many applications designed to operate over high-bandwidth (> 40Mbps), high-latency (RTT > 10ms) networks. Your client should run
ping to determine the RTT and then just assume the bandwidth is 100Mbps. ping traffic is blocked by a number of sites, so you can also use the tool synack, which uses TCP instead of ICMP to try to estimate the RTT. If your users are fairly network savvy, then providing them with a way to set the buffer size is useful. Be sure to also provide a way to tell them when the maximum buffer size is too small, so that they can have the system administrator increase it. Just setting large buffers for all network paths is not a good idea, particularly if the application might be run over slow links such as DSL or modem connections.
Linux to the Rescue
Starting with version 2.4, Linux added a feature called sender-side TCP buffer autotuning. This means that for the sender, you no longer need to worry about doing the
setsockopt() call. However you still need to do the
setsockopt() at the receiver side, and you still have to adjust the system default maximum autotuning buffer, which by default is only 128K. Starting with Linux 2.6.7, Linux added receiver-side autotuning, so you no longer need to worry about the receiver either. Hooray! Unfortunately, the default maximum TCP buffer size is still way too small--but at least now it is only a system administration problem, and not a programmer problem.
My initial results are quite impressive. After increasing the maximum TCP buffers, on a 1Gbps link across the United States (RTT = 67ms), performance went from 10Mbps using Linux 2.4 to 700Mbps using Linux 2.6.12, for a speedup of 70X. On a link from California to the United Kingdom (RTT = 150 ms), performance went from 4Mbps on Linux 2.4 to 560Mbps, for a speedup of 140X! Think about what this means for network backbone utilization when there are lots of Linux 2.6 hosts connected to fast networks running BitTorrent. To emphasize the point: these performance improvements are obtainable only by increasing the default maximum TCP buffer size.
Linux 2.6 also includes several other TCP enhancements, so this speedup is not completely due to TCP buffer tuning. In particular, Linux 2.6 now uses the BIC congestion control algorithm, which intends to improve TCP performance over high-bandwidth, high-latency links. Hand tuning of Linux 2.4 over the same links gives throughputs of 300Mbps across the United States and 70Mbps to the United Kingdom. Hopefully these features will appear in future versions of Windows as well.
If you still have trouble getting high throughput, the problem may well be in the network. First, use
netstat -s to see if there are a lot of TCP retransmissions. TCP retransmits usually indicate network congestion, but they can also happen with defective network hardware or misconfigured networks. You may also see some TCP retransmissions if the sending host is much faster than the receiving host, but TCP flow control should make the number of retransmits relatively low. Also look at the number of errors reported by
netstat, as a large number of errors may also indicate a network problem. A surprisingly common source of LAN trouble with 100BT networks is when the host is set to full duplex but the Ethernet switch is set to half duplex, or vice versa. Newer hardware will autonegotiate this, but with some older hardware, autonegotiation will sometimes fail, with the result being a working but very slow network (typically only 1Mbps to 2Mbps). It's best for both to be in full duplex if possible, but some older 100BT equipment only supports half duplex. See the TCP Tuning Guide for some ways to check your computer's duplex setting.
The Internet2's Network Diagnostic Tool (NDT) is a nice tool to detect both congestion problems and duplex problems. NDT is a Java applet, and you can run it by connecting to one of the NDT servers.
A Warning on the scp Program
A common tool for copying files across the internet is
scp. Unfortunately, TCP tuning does not help with
scp throughput, because
scp uses OpenSSL, which uses statically defined internal flow control buffers. These buffers act as a bottleneck on network throughput, especially on long-delay, high-bandwidth network links. The Pittsburgh Supercomputing Center's High Performance SSH/SCP page explains this in more detail and also has a patch for OpenSSL to fix this problem.
For More Information
I maintain a TCP Tuning web site with more information, including some specific techniques for very high speed networks (1Gbps and faster). I try to continually update the web site as new versions of operating systems are released. Please email me at with additions and corrections.
Brian Tierney is a researcher in the Distributed Systems Department at Lawrence Berkeley National Laboratory (not to be confused with Lawrence Livermore National Laboratory, the weapons lab!).
Return to ONLamp.com.