The other day my friend Bob came to me with a question. He'd written a Java program to copy 100MB data files from his Windows XP computer at his office in Sunnyvale, California, to a Linux server at his company's East Coast office in Reston, Virginia. He knew both offices had 100Mbps Ethernet networks that connected over a 155Mbps Virtual Private Network (VPN). When he measured the speed of the transfers, he found out that his files were transferring at less than 4Mbps, and wondered if I had any idea why.
I wrote this article to explain why this is the case, and what Bob needs to do to achieve the maximum network throughput. This article is aimed mainly at software developers. All too often software developers blame the network for poor performance, when in fact the problem is untuned software. However, there are times when the network is the problem. This article also explains some network troubleshooting tools that can give software developers the evidence needed to make network engineers take them seriously.
The most common network protocol used on the internet is the Transmission Control Protocol, or TCP. TCP uses a "congestion window" to determine how many packets it can send at one time. The larger the congestion window size, the higher the throughput. The TCP "slow start" and "congestion avoidance" algorithms determine the size of the congestion window. The maximum congestion window is related to the amount of buffer space that the kernel allocates for each socket. For each socket there is a default value for the buffer size, which programs can change by using a system library call just before opening the socket. For some operating systems there is also a kernel-enforced maximum buffer size. You can adjust the buffer size for both the sending and receiving ends of the socket.
To achieve maximum throughput, it is critical to use optimal TCP socket buffer sizes for the link you are using. If the buffers are too small, the TCP congestion window will never open up fully, so the sender will be throttled. If the buffers are too large, the sender can overrun the receiver, which will cause the receiver to drop packets and the TCP congestion window to shut down. This is more likely to happen if the sending host is faster than the receiving host. An overly large window on the sending side is not a big problem as long as you have excess memory.
Assuming there is no network congestion or packet loss, network throughput is directly related to TCP buffer size and the network latency. Network latency is the amount of time for a packet to traverse the network. To calculate maximum throughput:
Throughput = buffer size / latency
Typical network latency from Sunnyvale to Reston is about 40ms, and Windows XP has a default TCP buffer size of 17,520 bytes. Therefore, Bob's maximum possible throughput is:
17520 Bytes / .04 seconds = .44 MBytes/sec = 3.5 Mbits/second
The default TCP buffer size for Mac OS X is 64K, so with Mac OS X he would have done a bit better, but still nowhere near the 100Mbps that should be possible.
65936 Bytes / .04 seconds = 1.6 MBytes/sec = 13 Mbits/second
(Network people always use bits per second, but the rest of the computing world thinks in terms of bytes, not bits. This often leads to confusion.)
Most networking experts agree that the optimal TCP buffer size for a given network link is double the value for delay times bandwidth:
buffer size = 2 * delay * bandwidth
ping program will give you the round trip time (RTT) for the network link, which is twice the delay, so the formula simplifies to:
buffer size = RTT * bandwidth
For Bob's network,
ping returned a RTT of 80ms. This means that his TCP buffer size should be:
.08 seconds * 100 Mbps / 8 = 1 MByte
Bob knew the speed of his company's VPN, but often you will not know the capacity of the network path. Determining this can be difficult. These days, most wide area backbone links are at least 1Gbps (in the United States, Europe, and Japan anyway), so the bottleneck links are likely to be the local networks at each endpoint. In my experience, most office computers connect to 100Mbps Ethernet networks, so when in doubt, 100Mbps (12MBps) is a good value to use.
Tuning the buffer size will have no effect on networks that are 10Mbps or less; for example, with the hosts connected to a DSL link, cable modem, ISDN, or T1 line. There is a program called pathrate that does a good job of estimating network bandwidth. However, this program works on Linux only, and requires the ability to log in to both computers to start the program.
There are two TCP settings to consider: the default TCP buffer size and the maximum TCP buffer size. A user-level program can modify the default buffer size, but the maximum buffer size requires administrator privileges. Note that most of today's Unix-based OSes by default have a maximum TCP buffer size of only 256K. Windows does not have a maximum buffer size by default, but the administrator may set one. It is necessary to change both the send and receive TCP buffers. Changing only the send buffer will have no effect, as TCP negotiates the buffer size to be the smaller of the two. This means that it is not necessary to set both the send and receive buffer to the optimal value. A common technique is to set the buffer in the server quite large (for example, 1,024K) and then let the client determine and set the correct "optimal" value for that network path. To set the TCP buffer, use the setSendBufferSize and setReceiveBufferSize methods in Java, or the
setsockopt call in C. Here is an example of how to set TCP buffer sizes within your application using Java:
java.net.Socket skt; int sndsize; int sockbufsize; /* set send buffer */ skt.setSendBufferSize(sndsize); /* check to make sure you received what you asked for */ sockbufsize = skt.getSendBufferSize(); /* set receive buffer */ skt.setReceiveBufferSize(sndsize); /* check to make sure you received what you asked for */ sockbufsize = skt.getReceiveBufferSize();
It is always a good idea to call
getReceiveBufferSize) after setting the buffer size. This will ensure that the OS supports buffers of that size. The
setsockopt call will not return an error if you use a value larger than the maximum buffer size, but will just use the maximum size instead of the value you specify. Linux mysteriously doubles whatever value you pass in for the buffer size, so when you do a
getReceiveBufferSize you will see double what you asked for. Don't worry, as this is "normal" for Linux.
Here is the same example in C:
int skt, sndsize; err = setsockopt(skt, SOL_SOCKET, SO_SNDBUF, (char *)&sndsize, (int)sizeof(sndsize)); err = setsockopt(skt, SOL_SOCKET, SO_RCVBUF, (char *)&sndsize, (int)sizeof(sndsize));
Here is the sample C code for checking the current buffer size:
int sockbufsize = 0; size = sizeof(int); err = getsockopt(skt, SOL_SOCKET, SO_RCVBUF, (char *)&sockbufsize, &size);
For many links, it is critical to be able to increase the system-defined maximum TCP buffer size to obtain good throughput. For example, assume a 100Mbps link between California and the United Kingdom, which has an RTT of 150ms. The optimal TCP buffer size for this link is 1.9MB, which is 30 times larger than the default buffer, and 7.5 times bigger than the default maximum TCP buffer for Linux.
To change TCP settings in Linux, add the entries below to the file /etc/sysctl.conf, and then run
sysctl -p. The system will also set these values at boot time.
# increase TCP maximum buffer size net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # increase Linux autotuning TCP buffer limits # min, default, and maximum number of bytes to use net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216
Set the maximum buffer sizes to a value large enough to handle the longest, fastest network link you think that host will encounter (16MB in the example above). Windows does not require any modifications, as the default maximum TCP buffer size (
GlobalMaxTcpWindowSize) is not defined. My TCP Tuning Guide web site has information on how to set the maximum buffer size for other OSes.
By now you probably have questions such as How do I actually use this in practice? Do I trust users to set the buffer size? Do I try to compute the optimal buffer size for the user? Or should I just set a big buffer size and not worry about it?
In general, I suggest the following strategy for many applications designed to operate over high-bandwidth (> 40Mbps), high-latency (RTT > 10ms) networks. Your client should run
ping to determine the RTT and then just assume the bandwidth is 100Mbps. ping traffic is blocked by a number of sites, so you can also use the tool synack, which uses TCP instead of ICMP to try to estimate the RTT. If your users are fairly network savvy, then providing them with a way to set the buffer size is useful. Be sure to also provide a way to tell them when the maximum buffer size is too small, so that they can have the system administrator increase it. Just setting large buffers for all network paths is not a good idea, particularly if the application might be run over slow links such as DSL or modem connections.
Starting with version 2.4, Linux added a feature called sender-side TCP buffer autotuning. This means that for the sender, you no longer need to worry about doing the
setsockopt() call. However you still need to do the
setsockopt() at the receiver side, and you still have to adjust the system default maximum autotuning buffer, which by default is only 128K. Starting with Linux 2.6.7, Linux added receiver-side autotuning, so you no longer need to worry about the receiver either. Hooray! Unfortunately, the default maximum TCP buffer size is still way too small--but at least now it is only a system administration problem, and not a programmer problem.
My initial results are quite impressive. After increasing the maximum TCP buffers, on a 1Gbps link across the United States (RTT = 67ms), performance went from 10Mbps using Linux 2.4 to 700Mbps using Linux 2.6.12, for a speedup of 70X. On a link from California to the United Kingdom (RTT = 150 ms), performance went from 4Mbps on Linux 2.4 to 560Mbps, for a speedup of 140X! Think about what this means for network backbone utilization when there are lots of Linux 2.6 hosts connected to fast networks running BitTorrent. To emphasize the point: these performance improvements are obtainable only by increasing the default maximum TCP buffer size.
Linux 2.6 also includes several other TCP enhancements, so this speedup is not completely due to TCP buffer tuning. In particular, Linux 2.6 now uses the BIC congestion control algorithm, which intends to improve TCP performance over high-bandwidth, high-latency links. Hand tuning of Linux 2.4 over the same links gives throughputs of 300Mbps across the United States and 70Mbps to the United Kingdom. Hopefully these features will appear in future versions of Windows as well.
If you still have trouble getting high throughput, the problem may well be in the network. First, use
netstat -s to see if there are a lot of TCP retransmissions. TCP retransmits usually indicate network congestion, but they can also happen with defective network hardware or misconfigured networks. You may also see some TCP retransmissions if the sending host is much faster than the receiving host, but TCP flow control should make the number of retransmits relatively low. Also look at the number of errors reported by
netstat, as a large number of errors may also indicate a network problem. A surprisingly common source of LAN trouble with 100BT networks is when the host is set to full duplex but the Ethernet switch is set to half duplex, or vice versa. Newer hardware will autonegotiate this, but with some older hardware, autonegotiation will sometimes fail, with the result being a working but very slow network (typically only 1Mbps to 2Mbps). It's best for both to be in full duplex if possible, but some older 100BT equipment only supports half duplex. See the TCP Tuning Guide for some ways to check your computer's duplex setting.
The Internet2's Network Diagnostic Tool (NDT) is a nice tool to detect both congestion problems and duplex problems. NDT is a Java applet, and you can run it by connecting to one of the NDT servers.
A common tool for copying files across the internet is
scp. Unfortunately, TCP tuning does not help with
scp throughput, because
scp uses OpenSSL, which uses statically defined internal flow control buffers. These buffers act as a bottleneck on network throughput, especially on long-delay, high-bandwidth network links. The Pittsburgh Supercomputing Center's High Performance SSH/SCP page explains this in more detail and also has a patch for OpenSSL to fix this problem.
I maintain a TCP Tuning web site with more information, including some specific techniques for very high speed networks (1Gbps and faster). I try to continually update the web site as new versions of operating systems are released. Please email me at with additions and corrections.
Brian Tierney is a researcher in the Distributed Systems Department at Lawrence Berkeley National Laboratory (not to be confused with Lawrence Livermore National Laboratory, the weapons lab!).
Return to ONLamp.com.
Copyright © 2009 O'Reilly Media, Inc.