Problem Area ============ TCP, as a protocol widely used for reliable end-to-end network communication, the users must often perform tedious manual optimizations simply to achieve acceptable performance. Among those tasks, one important aspect is to determine and set appropriate socket buffer size (on the send side and the receive side). This specific manual task is not only tedious, but also very difficult to do since it requires the intervention of very experienced administrators or application developers, who can balance the needs for the larger socket buffer size to maximize the TCP connection throughput and the limitation of the memory resources. Especially nowadays, with the widespread arrival of bandwidth-intensive applications such as bulk-data transfers, multi-media web streaming, the default socket buffer size is always not optimal and manual intervention is always required to achieve better performance. Therefore, we identify the needs for an automatically-sizing algorithm in our TCP/IP stack, which can automatically adjust the buffer size based on the current state of each specific connection. This goal is to achieve the preferable transfer rates on each connection out-of-box without manual intervention. Important Terminology in Socket Buffer Auto-sizing ================================================== - BDP To optimize TCP throughput (assuming a reasonably error-free transmission path), the sender should send enough packets to fill the logical pipe between the sender and receiver. The capacity of the logical pipe can be calculated by the following formula: Capacity in bits = path bandwidth in bits per seconds * RTT (in seconds) The capacity is known as bandwidth-delay product (BDP). The pipe can be fat (high bandwidth) or thin (low bandwidth) or short (low RTT) or long (high RTT). Pipes that are fat and long have the highest BDP. Examples of high BDP transmission paths are those across satellites or enterprise wide area networks (WANs) that include intercontinental optical fiber links. The goal of socket buffer auto-sizing project is to automatically choose good socket buffer size so that the connection's BDP capacity will not be limited either by the receive buffer and the send buffer size, and can be utilized fully. - Receive Socket Buffer The receiver's socket buffer is used to reassemble the data in sequential order, queuing it for delivery to the application. The amount of the available space in the receive buffer determines the receiver's advertised window (receive window), which is an important characteristic of a TCP connection. |<----------------- Receive Buffer ------------------>| | | |<---- Current Receive Window ----->| +------------------+------------------+----------------+ | Acknowledged but | Received but not | Not yet | | not retrieved by | acknowledged | received | | the application | | | +------------------+------------------+----------------+ The receive window is used to limit the amount of data that can be sent at any one time which provides the receiver-side flow control. This flow control works with TCP congestion control together to control the sender's transmit-rate. When the receiver's advertised window is smaller than the congestion window (cwnd) on the sender, the connection is receive window-limited, otherwise it is congestion window limited. A good auto-sizing implementation would allow the receive buffer to be large enough so that the connection won't be limited by the receive window, but not too large to waste the memory resource that is required for the connection. As discussed above, in order to obtain full throughput, the advertised window must be at least as large as the BDP of the connection. - Send Socket Buffer In most OS systems, the sender's socket buffer holds data that the application has passed to TCP until the receiver has acknowledged receipt of the data, which includes both unsent data and the un-acknowledged data. However, unlike other OS systems, the send buffer in Solaris only holds unsent data, and the maximum unacknowledged data is not limited by the send buffer size, but instead limited by the receive window size. Although Solaris interpret the send socket buffer differently from the other OS systems, Solaris's interpretion actually has some advantages: It removes yet another variable that may restrict the performance; Also, even with an iresponsible receiver which advertise large receive window, the unacknowledged data will still be bounded because of the congestion control enforced on the sender itself. Further experiments also showed that increasing the default sending buffer size (49152) on a Solaris system does not change the performance. Therefore, this project will focus on the buffer auto-sizing technology on the receive side. Comparison of existing TCP receive buffer auto-sizing algorithms ================================================================ - User-level vs. Kernel-level User-level and Kernel-level auto-sizing refers to whether the buffer size tuning is accomplished as an application-level solution or as a change to the kernel. There are two forms of user-level auto-sizing. One form of auto-sizing is optimization for specific types of applications (such as FTP). The second form may need additional daemon to monitor each TCP connection, and automatically adjust buffer size based on the characteristic of each specific connection. The first form of auto-sizing is not generalized enough, while the second one is not efficient since kernel always has access to more network and high-resolution timing information. - Static vs. Dynamic Static and Dynamic auto-sizing refers to whether the buffer tuning is set to a constant at the start of a connection, or if it can change over time during the lifetime of a connection. It is obvious that a dynamic solution is preferable, as network state changes dynamically, and a constant buffer size is always too large or too small given "live" networks. On the other hand, dynamic buffer changes imply changes in the advertised window size, which could break TCP semantics (that data legally sent for a given window is in-flight when the window is reduced - receive window shrinks) if the implementation is not carefully done. This will be discussed later. - In-Band vs. Out-of-Band In-Band and Out-of-Band refers to whether the current network state which is used to determine the current buffer size is obtained from the connection itself or is gathered separately. Ideally, connection's in-band data should be used to assure the correctness of the time-dependent and path-dependent information. There are several kernel in-band dynamic buffer size auto-sizing techniques that were investigated. * PSC The PSC[1] technique focuses on the discussion of the send side buffer auto-sizing, and the receiver's buffer size tuning is only briefly discussed. Specifically, PSC concludes that with over-subscription of receive buffers the receiver buffer tuning is completely unnecessary. It asserts that even with large receive buffer size that is advertised, the receive buffer that can be really consumed will be bounded by the congestion window size of the sender. The above assertion is not really true since the receive buffer could become unbounded if receiver application retrieves the data at an extreme slow rate, and buffer is needed to hold the data has been received and acknowledge by TCP/IP stack, but is waiting for the receiving application. * DRS Dynamic Right-Sizing (DRS[2]) first discusses the incorrect assertion PSC made above, then discusses a new techniques to auto-tune the receiver buffer size. DRS discussed the techniques that a receiver can try to measure the number of bytes received during each RTT time. Assuming the connection is currently sender-cwnd-limited, that measurement can be regarded as sender's congestion window size, and can be used to adjust the size of the receiver's window advertisement dynamically. With these techniques, the sender will be congestion-window-limited rather than flow-control-window-limited. DRS is the algorithm we choose to implement, and more details will be discussed later. * FreeBSD FreeBSD uses the following criteria to increase the receive buffer size: 1. The number of bytes received during the RTT time is greater than 7/8th of the current receive socket buffer size; 2. Receive buffer size has not hit maximal size that is allowed; Again, whenever the above criteria are met, the receive buffer size is increased by a fixed value (which is configurable). Like DRS, this technique also auto-tune the receive socket buffer size by estimate of the sender's cwnd, but the constant (7/8) in the algorithm seems being randomly chosen or it is a empirical value. * Linux 2.4.x [3] In Linux 2.4.x, the receive buffer size is updated based on the size of the received segment whenever data is received by the host. The algorithm works on the basis that the TCP stack will not increase the advertised window if lots of small segments (i.e. interactive data flow) are received (otherwise, the buffer could be exhausted if a large window of data which is spreadded across many small segments are received). In the converse case, when the received segment is large (i.e. bulk data flow), the window will be increased linearly. There are two different linear incremental rate. The difference assures that window size grows more gradually for the medium sized segments (size is between 128 and 647 bytes) than the large sized segments (size larger than 647 bytes). * Comparisons Paper [4] compares several existing popular auto-sizing techniques comprehensively, and gives a guideline on selecting which one to use. Specifically, comparing DRS with Linux 2.4.x auto-sizing, DRS tends to overbook the memory, so that it may perform worse in the case of large numbers of parallel small connections; whereas Linux 2.4.x auto-sizing is targeted for web servers, and it does not perform well when large window size is required, such as FTP or bulk data transfers. Since there is no single algorithm that is perfect for every network scenario. With experiments with the DRS prototype, we found that it does not regress the performance in the case of large numbers of parallel small connections, and does bring the significant benefit when the RTT is high (long pipe), therefore, we choose to deploy the DRS approach for the receive side buffer auto-sizing. DRS in Solaris ============== As we discussed above, DRS tries to let the receiver estimate the sender's congestion window size (cwnd) and use that estimate to dynamically change the size of the receiver's window advertisements. Specifically, DRS assumes that the sender is cwnd limited and tries to estimate the sender's cwnd by calculating the amount of data received over a period of time that is a round-trip time (RTT) (BDP measurement). The result would be the lower bound of the sender's cwnd. Then, DRS tries to advertise the window size twice as big as estimated cwnd. Since the sender's cwnd at most doubles once per RTT DRS assures advertised receive window is larger than the sender's next cwnd size. Our experiments showed that advertising the window size twice as big as as the estimated cwnd (BDP) is not sufficient. Then we found out the reason is because that during the time between the receiver sends back an ACK (in which rwnd=2*BDP) and the sender receives the ACK, sender continues to send more data, and when the sender receives that ACK, it will not be able to send 2*BDP data as the DRS algorithm expects (all it can is to send 2*BDP - unack'ed data). By stream-line the time sequence events for a fully-network-throttled TCP connection, we found that in order not to limit the sender by the receive window: 1. the initial rwnd should be set as 1.5 times of the initial cwnd; 2. Each RTT, the rwnd needs to be updated to 3.5 times of measured BDP (3.5*BDP); Unfortunately there is no way to guess the initial cwnd of the peer. Currently Solaris is setting the initial cwnd to be 4*mss, but people are thinking of increase the initial cwnd to 10*mss. Therefore, the initial rwnd will be set to 15*mss for now (this will be made configurable, see the tcp_recv_autosize_initmss TCP property discussed below). - High-resolution time In today's Solaris, the TCP timestamps option is filled in with low- resolution time unit (llbolt) which has precision of 10ms. This would result in the imprecision of the RTT measurement. E.g., in a typical back-to-back setup, the RTT is usually less than 1ms, but with the current low-resolution time, the minimum RTT 10ms will be assumed, which will results in a much larger BDP measurement, thus unnecessary larger receive buffer size. This not only wastes memory resource that defeats the purpose of the auto-sizing technique, but also introduces more latency which hurts the performance. Therefore, the high-resolution time (I.e., the gethrtime() function) will be used instead to get the current timestamps in our TCP/IP stack. Since gethrtime() returns 64 bits result which is expressed as nanoseconds, the high-resolution time will be right-shifted 20 bits and fit into the 32 bits timestamps option. This assures the time precision of 1ms (1ms is the fastest precision that is required by PAWS[5]), hence prevents overbooking of the receive buffer. - Window Scaling Receive buffer auto-sizing enables TCP window scaling by default, in order to allow up to tcp_max_buf (an existing TCP property, default is 1M) maximum receive window size. As the data flows over the connection, the TCP/IP stack monitors the connection, measures its current state and adjusts the receive windows size to optimize throughput. Since the window scale is negotiated at the time that a connection is setup, and cannot be changed over the connection's lifetime, the scaling must be carefully chosen to balance the maximum size and the granularity of the buffer size. In the future, we will choose the smallest window scale value that can represent the maximum receive buffer size (defined by tcp_max_buf). If the window scaling negotiation fails (that the peer does not support window scaling option), the TCP buffer auto-sizing will be disabled over this specific connection. In that case, the receive buffer size is limited by the maximum receive window that is allowed, which is 65535. - Loopback connections In the case of loopback TCP connections, thread scheduling affects performance more than the receive buffer size. To avoid unnecessary code complexities (especially for the TCP fusion code path), the receive buffer auto-sizing is disabled for loopback connections. - Round-trip time (RTT) estimate In order to calculating the data received over an RTT, it is obvious we'd need to first know what RTT is. * Timestamps option approach In a typical TCP implementation, if the timestamps option is enabled, the RTT can be measured by observing the time difference when a timestamps is reflected back from the peer. This approach works for systems on both send-side and the receive-side. * Send-side ACK-based approach When TCP timestamps option is not enabled, during a typical bi-direction TCP connection, RTT can be measured by observing the time between when the data is sent and an acknowledgment is returned. * Receive-side Sequence-number-based approach During a bulk-data transfer, one system could mostly receive packets and may not be sending any data. In that case, one cannot get a good RTT measurement using the ACK-based approach discussed above. In such system, the RTT can still be able to be measured by observing the time between the acknowledgment of sequence number S which announces receive window W and the receipt of data segment that contains sequence number which is at least S + W + 1. This approach assumes that the sender is receive-window limited and the RTT measurement would only be a valid estimate when the sender is throttled by the network. The measured result could be much larger than the actual RTT if the sender did not have any data to send. Thus this measurement acts only as an upper-bound on the RTT. Solaris today already has the send-side RTT measurement based on both timestamps option and the send-side ACK-based approaches. Note that because of the choice of the initial value, the send-side averaged RTT (tcp_rtt_sa * 8) can only become statistically useful when there are enough send-side RTT samples: 1. When there are more than tcp_tx_rtt_updates (a new TCP property, default is set to 20, see section "Relevant TCP protocol properties"), the send-side averaged RTT will be used as one source of the RTT estimates. Besides that, we will add the receive-side RTT measurement (based on both the timestamps and the receive-side sequence-number-based approaches). The receive-side RTT will be updated whenever: 2. A timestamps is echoed back from the peer if the timestamps option is enabled; 3. The receipt of data that is one window beyond the sequence number that is just acknowledged; Since the receiver RTT measurement acts only as an upper-bound on the RTT, each time receive-side RTT is updated in 2 and 3, the receive-side RTT is updated to be the minimum of the old receive-side RTT and the current RTT measurement value. Both receive-side RTT and the send-side RTT will be considered as the source of the RTT estimate when they are available. The final RTT will be set as the average of both values. - Receive buffer auto-sizing At the start of the connection, if the receive buffer auto-sizing is enabled and it is not a loopback connection, the receive buffer will be initialized to the initial receive buffer size, which is tcp_recv_autosize_initmss * MSS (tcp_recv_autosize_initmss is another new TCP property, default is set to 15. See section "Relevant TCP/IP protocol properties"). This is an attempt to assure the sender to not be receive-window limited during the first RTT. Then each time a packet is received, the current time will be compared to the last measurement time for that connection. If more than the current estimated RTT has passed, the highest sequence number seen is compared to the next sequence number expected at the beginning of the measurement. Assuming that all packets are received in order, the result will be the number of bytes that were received during RTT period (BDP). The receive buffer space for this connection is then updated, to advertise the receive window (rwnd) that is 3.5 times of the amount of data received during this measurement period. We need to be careful in the case that the calculated receive buffer size is smaller than the previous value: 1. To avoid fluctuation, the receive buffer size is only decreased when it is decreased at least one fourth of the previous value; 2. The receive buffer is set to no less than half of its previous value to decrease the penalty for brief stalls in transmission. 3. The receive window size will not be updated immediately based on the decreased receive buffer size to prevent the right window edge of the advertised receive window from moving backwards. Instead, the connection will be marked. This mark will only be cleared when the receive window is reopened and the last receive window advertised is smaller than the new receive buffer size. - Relevant TCP/IP protocol properties 1. tcp_recvbuf_autosize (new) A new tcp_recvbuf_autosize property will be introduced. The current possible value would be either "off" or "drs". In the future, other values may be introduces when more auto-sizing algorithms are implemented. If tcp_recvbuf_autosize is set to "drs", the DRS receive buffer auto-sizing will be attempted on all non-loop-back connections; If it is set to "off", the auto-sizing will be disabled. By default, this property will be set to "drs". Like other ipadm properties, one would require 'sys_ip_config' privilege to configure this property. 2. tcp_recv_autosize_initmss (new) If the receive buffer auto-sizing is enabled, a TCP connection's initial receive buffer size will be set as tcp_recv_autosize_initmss * MSS. The default value of tcp_recv_autosize_initmss will be set to 15. 3. tcp_xmit_rtt_autosize_start (new) As we discussed, RTT estimate is essential to the receive buffer auto-sizing technique. Further, the send-side RTT estimate can only become statistically useful when there are enough samples: if there are more than tcp_xmit_rtt_autosize_start send-side RTT measurement samples, we will start to use the averaged send-side RTT as one source of the RTT estimates of the connection. The default value of tcp_xmit_rtt_autosize_start will be set to 20. All the 3 new properties will be made as Consolidation Private properties for now, and will only be expected to be used for diagnosing purpose. They will only be made public when it proves to be useful for customers. 4. tcp_max_buf The existing tcp_max_buf property will be used as the maximum receive buffer size that can be set (or auto-sized to). The same value will be used to determine the window scale value of the connection if the auto-sizing is enabled. 5. recv_maxbuf This is an existing property that is general for transportation protocols. Specifically for TCP, today it defines the default receive buffer size for a connection. After this project is integrated, if auto-sizing is enabled for a specific TCP connection, the recv_maxbuf property will become irrelevant, since the receive buffer size will be automatically adjusted based on its network condition. - Socket options 1. SO_RCVBUF In today's Solaris, this socket option is used to set/get the current receive buffer size of a specific TCP connection. In the future, the receive buffer size will be changed dynamically during the lifetime of the connection when the auto-sizing is enabled. We could chose to return the current receive buffer size when SO_RCVBUF is called to query the receive buffer size. This may confuse some application since the return values would keep changing when auto-sizing is enabled. Further, our experiments show this may cause performance issue for some applications, because those applications would first use SO_RCVBUF to get the expected receive buffer size at the beginning and use that return value as the length of the receive buffer for the recv()/recvfrom() system call. While the initial receive buffer starts from a relatively small value in case of auto-sizing, this significantly regresses the performance of that receiver. As the result, the semantics of SO_RCVBUF is going to change. If auto-sizing is disabled, the SO_RCVBUF socket option will still be used to set/get the current receive buffer size. But On the other hand, if auto-sizing is enabled, the SO_RCVBUF socket option will be used to set the maximum receive buffer size that be auto-sized up to for a connection. 2. TCP_RCVBUF_AUTOSIZE (new) A new IPPROTO_TCP level TCP_RCVBUF_AUTOSIZE socket option will be introduced to enable or disable the receive buffer auto-sizing for a specific TCP connection. It can be set to either "TCP_AUTOSIZE_OFF" (disable) or "TCP_AUTOSIZE_DRS" (enable the DRS algorithm) for now. When the TCP_RCVBUF_AUTOSIZE option is used to disable receive buffer auto-sizing, the current receive buffer size will be set to the value specified by the recv_maxbuf property, the same as the current Solaris behavior. When the TCP_RCVBUF_AUTOSIZE option is used to enable receive buffer auto-sizing, the maximum receive buffer size will be set to the value specified by the tcp_max_buf property, and the current receive buffer size will stay the same and will be automatically adjusted based on the current network condition. 3. TCP_CUR_RCVBUF (new) A new IPPROTO_TCP level TCP_CUR_RCVBUF socket option will be introduced to get the current receive buffer size of a specific TCP connection. This property will be read-only and its return value may change during the lifetime of a TCP connection. Both TCP_RCVBUF_AUTOSIZE and TCP_CUR_RCVBUF socket options will be Consolidation Private interfaces for now and will only be made public when it proves to be useful for customers. - Observability 1. pfiles(1) pfiles(1) will be extended to show whether the receive buffer auto-sizing is enabled and the current receive buffer size value for a TCP connection. We will also make several other minor changes of the output format so that it reads more clearly: 9: S_IFSOCK mode:0666 dev:337,0 ino:10285 uid:0 gid:0 size:0 O_RDWR|O_NONBLOCK SOCK_STREAM send buffer: 49152 bytes receive buffer: 21720 bytes (maximum: 1048576 bytes) auto-size: drs sockname: AF_INET 129.146.104.83 port: 65506 peername: AF_INET 192.18.34.10 port: 5001 2. DTrace probes sdt DTrace probes will be provided to observe the RTT measurement process, the BDP calculation process and the current receive buffer size. Interface Table =============== +-----------------------------+-----------------------+----------------+ | Interface | Stability | Description | +-----------------------------+-----------------------+----------------+ | TCP_RCVBUF_AUTOSIZE | Consolidation Private | socket option | | TCP_CUR_RCVBUF | Consolidation Private | socket option | +-----------------------------+-----------------------+----------------+ | tcp_recvbuf_autosize | Consolidation Private | ipadm property | | tcp_recv_autosize_initmss | Consolidation Private | ipadm property | | tcp_xmit_rtt_autosize_start | Consolidation Private | ipadm property | +-----------------------------+-----------------------+----------------+ | sdt probes for auto-sizing | Project Private | DTrace probes | +-----------------------------+-----------------------+----------------+ References ========== [1] J. Semke, J. Mahdavi, and M Mathis., "Automatic TCP Buffer Tuning" ACM SIGCOMM 1998, 28)4). Oct. 1998. [2] M. Fisk, W. Feng, "Dynamic Right-Sizing in TCP" Oct. 2001. [3] M. Smith, S. Bishop, "Flow Control in the Linux Network Stack" Feb. 2005 [4] E. Weigle, W. Feng "A Comparison of TCP Automatic Tuning Techniques for Distributed Computing" [5] TCP Extensions for High Performance - rfc1323