Problem Area
============

   TCP, as a protocol widely used for reliable end-to-end network
   communication, the users must often perform tedious manual optimizations
   simply to achieve acceptable performance. Among those tasks, one
   important aspect is to determine and set appropriate socket buffer
   size (on the send side and the receive side). This specific manual
   task is not only tedious, but also very difficult to do since it
   requires the intervention of very experienced administrators or
   application developers, who can balance the needs for the larger
   socket buffer size to maximize the TCP connection throughput and
   the limitation of the memory resources.

   Especially nowadays, with the widespread arrival of bandwidth-intensive
   applications such as bulk-data transfers, multi-media web streaming, the
   default socket buffer size is always not optimal and manual intervention
   is always required to achieve better performance.

   Therefore, we identify the needs for an automatically-sizing algorithm 
   in our TCP/IP stack, which can automatically adjust the buffer size
   based on the current state of each specific connection. This goal is to
   achieve the preferable transfer rates on each connection out-of-box
   without manual intervention.

Important Terminology in Socket Buffer Auto-sizing
==================================================

  - BDP

   To optimize TCP throughput (assuming a reasonably error-free
   transmission path), the sender should send enough packets to fill
   the logical pipe between the sender and receiver. The capacity of
   the logical pipe can be calculated by the following formula:

     Capacity in bits = path bandwidth in bits per seconds * RTT (in seconds)

   The capacity is known as bandwidth-delay product (BDP). The pipe can
   be fat (high bandwidth) or thin (low bandwidth) or short (low RTT) or
   long (high RTT). Pipes that are fat and long have the highest BDP.
   Examples of high BDP transmission paths are those across satellites
   or enterprise wide area networks (WANs) that include intercontinental
   optical fiber links.

   The goal of socket buffer auto-sizing project is to automatically choose
   good socket buffer size so that the connection's BDP capacity will not
   be limited either by the receive buffer and the send buffer size, and
   can be utilized fully.

 - Receive Socket Buffer

   The receiver's socket buffer is used to reassemble the data in sequential
   order, queuing it for delivery to the application. The amount of the
   available space in the receive buffer determines the receiver's advertised
   window (receive window), which is an important characteristic of a TCP
   connection.

        |<-----------------  Receive Buffer ------------------>|
        |                                                      |
                           |<---- Current Receive Window ----->|
        +------------------+------------------+----------------+
        | Acknowledged but | Received but not |    Not yet     |
        | not retrieved by | acknowledged     |    received    |
        | the application  |                  |                |
        +------------------+------------------+----------------+

   The receive window is used to limit the amount of data that can be sent
   at any one time which provides the receiver-side flow control.
   This flow control works with TCP congestion control together to control
   the sender's transmit-rate. When the receiver's advertised window is
   smaller than the congestion window (cwnd) on the sender, the connection
   is receive window-limited, otherwise it is congestion window limited.

   A good auto-sizing implementation would allow the receive buffer to
   be large enough so that the connection won't be limited by the receive
   window, but not too large to waste the memory resource that is required
   for the connection.

   As discussed above, in order to obtain full throughput, the advertised
   window must be at least as large as the BDP of the connection.

 - Send Socket Buffer

   In most OS systems, the sender's socket buffer holds data that the
   application has passed to TCP until the receiver has acknowledged receipt
   of the data, which includes both unsent data and the un-acknowledged data.
   
   However, unlike other OS systems, the send buffer in Solaris only holds
   unsent data, and the maximum unacknowledged data is not limited by
   the send buffer size, but instead limited by the receive window size.

   Although Solaris interpret the send socket buffer differently from the
   other OS systems, Solaris's interpretion actually has some advantages:
   It removes yet another variable that may restrict the performance;
   Also, even with an iresponsible receiver which advertise large receive
   window, the unacknowledged data will still be bounded because of the
   congestion control enforced on the sender itself.

   Further experiments also showed that increasing the default sending buffer
   size (49152) on a Solaris system does not change the performance.

   Therefore, this project will focus on the buffer auto-sizing technology
   on the receive side.

Comparison of existing TCP receive buffer auto-sizing algorithms
================================================================

 - User-level vs. Kernel-level

   User-level and Kernel-level auto-sizing refers to whether the buffer
   size tuning is accomplished as an application-level solution or as a
   change to the kernel.

   There are two forms of user-level auto-sizing. One form of auto-sizing
   is optimization for specific types of applications (such as FTP). The
   second form may need additional daemon to monitor each TCP connection,
   and automatically adjust buffer size based on the characteristic of
   each specific connection.

   The first form of auto-sizing is not generalized enough, while the second
   one is not efficient since kernel always has access to more network and
   high-resolution timing information.

  - Static vs. Dynamic

   Static and Dynamic auto-sizing refers to whether the buffer tuning is
   set to a constant at the start of a connection, or if it can change
   over time during the lifetime of a connection.

   It is obvious that a dynamic solution is preferable, as network state
   changes dynamically, and a constant buffer size is always too large or
   too small given "live" networks.

   On the other hand, dynamic buffer changes imply changes in the advertised
   window size, which could break TCP semantics (that data legally sent for
   a given window is in-flight when the window is reduced - receive window
   shrinks) if the implementation is not carefully done. This will be
   discussed later.

  - In-Band vs. Out-of-Band

   In-Band and Out-of-Band refers to whether the current network state which
   is used to determine the current buffer size is obtained from the
   connection itself or is gathered separately.

   Ideally, connection's in-band data should be used to assure the
   correctness of the time-dependent and path-dependent information.

  There are several kernel in-band dynamic buffer size auto-sizing
  techniques that were investigated.

  * PSC

   The PSC[1] technique focuses on the discussion of the send side buffer
   auto-sizing, and the receiver's buffer size tuning is only briefly
   discussed.

   Specifically, PSC concludes that with over-subscription of receive
   buffers the receiver buffer tuning is completely unnecessary. It asserts
   that even with large receive buffer size that is advertised, the receive
   buffer that can be really consumed will be bounded by the congestion
   window size of the sender.

   The above assertion is not really true since the receive buffer could
   become unbounded if receiver application retrieves the data at an
   extreme slow rate, and buffer is needed to hold the data has been
   received and acknowledge by TCP/IP stack, but is waiting for the
   receiving application. 

  * DRS

   Dynamic Right-Sizing (DRS[2]) first discusses the incorrect assertion
   PSC made above, then discusses a new techniques to auto-tune the
   receiver buffer size.

   DRS discussed the techniques that a receiver can try to measure the
   number of bytes received during each RTT time. Assuming the connection
   is currently sender-cwnd-limited, that measurement can be regarded as
   sender's congestion window size, and can be used to adjust the size of
   the receiver's window advertisement dynamically. With these
   techniques, the sender will be congestion-window-limited rather than
   flow-control-window-limited.

   DRS is the algorithm we choose to implement, and more details will be
   discussed later.

  * FreeBSD

   FreeBSD uses the following criteria to increase the receive buffer size:

     1. The number of bytes received during the RTT time is greater than
        7/8th of the current receive socket buffer size;
     2. Receive buffer size has not hit maximal size that is allowed;

     Again, whenever the above criteria are met, the receive buffer
     size is increased by a fixed value (which is configurable).

   Like DRS, this technique also auto-tune the receive socket buffer size
   by estimate of the sender's cwnd, but the constant (7/8) in the
   algorithm seems being randomly chosen or it is a empirical value.

  * Linux 2.4.x [3]

   In Linux 2.4.x, the receive buffer size is updated based on
   the size of the received segment whenever data is received by the
   host. The algorithm works on the basis that the TCP stack will not
   increase the advertised window if lots of small segments (i.e.
   interactive data flow) are received (otherwise, the buffer could
   be exhausted if a large window of data which is spreadded across
   many small segments are received). In the converse case, when the
   received segment is large (i.e. bulk data flow), the window will
   be increased linearly. There are two different linear incremental
   rate. The difference assures that window size grows more gradually
   for the medium sized segments (size is between 128 and 647 bytes)
   than the large sized segments (size larger than 647 bytes).

  * Comparisons

   Paper [4] compares several existing popular auto-sizing techniques
   comprehensively, and gives a guideline on selecting which one to use.
   Specifically, comparing DRS with Linux 2.4.x auto-sizing, DRS tends to
   overbook the memory, so that it may perform worse in the case of large
   numbers of parallel small connections; whereas Linux 2.4.x auto-sizing
   is targeted for web servers, and it does not perform well when
   large window size is required, such as FTP or bulk data transfers.

   Since there is no single algorithm that is perfect for every network
   scenario. With experiments with the DRS prototype, we found that it
   does not regress the performance in the case of large numbers of parallel
   small connections, and does bring the significant benefit when the RTT
   is high (long pipe), therefore, we choose to deploy the DRS approach
   for the receive side buffer auto-sizing.

DRS in Solaris
==============

   As we discussed above, DRS tries to let the receiver estimate the
   sender's congestion window size (cwnd) and use that estimate to
   dynamically change the size of the receiver's window advertisements.
   Specifically, DRS assumes that the sender is cwnd limited and tries
   to estimate the sender's cwnd by calculating the amount of data
   received over a period of time that is a round-trip time (RTT) (BDP
   measurement). The result would be the lower bound of the sender's
   cwnd. Then, DRS tries to advertise the window size twice as big as
   estimated cwnd. Since the sender's cwnd at most doubles once per RTT
   DRS assures advertised receive window is larger than the sender's
   next cwnd size.

   Our experiments showed that advertising the window size twice as big as
   as the estimated cwnd (BDP) is not sufficient. Then we found out the
   reason is because that during the time between the receiver sends back
   an ACK (in which rwnd=2*BDP) and the sender receives the ACK, sender
   continues to send more data, and when the sender receives that ACK, it
   will not be able to send 2*BDP data as the DRS algorithm expects
   (all it can is to send 2*BDP - unack'ed data).

   By stream-line the time sequence events for a fully-network-throttled
   TCP connection, we found that in order not to limit the sender by the
   receive window:

   1. the initial rwnd should be set as 1.5 times of the initial cwnd;
   2. Each RTT, the rwnd needs to be updated to 3.5 times of measured
      BDP (3.5*BDP);

   Unfortunately there is no way to guess the initial cwnd of the peer.
   Currently Solaris is setting the initial cwnd to be 4*mss, but people
   are thinking of increase the initial cwnd to 10*mss. Therefore, the
   initial rwnd will be set to 15*mss for now (this will be made
   configurable, see the tcp_recv_autosize_initmss TCP property
   discussed below).

   - High-resolution time

    In today's Solaris, the TCP timestamps option is filled in with low-
    resolution time unit (llbolt) which has precision of 10ms. This
    would result in the imprecision of the RTT measurement. E.g., in a
    typical back-to-back setup, the RTT is usually less than 1ms, but with
    the current low-resolution time, the minimum RTT 10ms will be assumed,
    which will results in a much larger BDP measurement, thus unnecessary
    larger receive buffer size. This not only wastes memory resource that
    defeats the purpose of the auto-sizing technique, but also introduces
    more latency which hurts the performance.

    Therefore, the high-resolution time (I.e., the gethrtime() function)
    will be used instead to get the current timestamps in our TCP/IP stack.
    Since gethrtime() returns 64 bits result which is expressed as
    nanoseconds, the high-resolution time will be right-shifted 20 bits
    and fit into the 32 bits timestamps option. This assures the time
    precision of 1ms (1ms is the fastest precision that is required
    by PAWS[5]), hence prevents overbooking of the receive buffer.
       
   - Window Scaling

    Receive buffer auto-sizing enables TCP window scaling by default, in
    order to allow up to tcp_max_buf (an existing TCP property, default
    is 1M) maximum receive window size. As the data flows over the
    connection, the TCP/IP stack monitors the connection, measures its
    current state and adjusts the receive windows size to optimize
    throughput.

    Since the window scale is negotiated at the time that a connection is
    setup, and cannot be changed over the connection's lifetime, the
    scaling must be carefully chosen to balance the maximum size and
    the granularity of the buffer size.

    In the future, we will choose the smallest window scale value that
    can represent the maximum receive buffer size (defined by tcp_max_buf).

    If the window scaling negotiation fails (that the peer does not
    support window scaling option), the TCP buffer auto-sizing will be
    disabled over this specific connection. In that case, the receive buffer
    size is limited by the maximum receive window that is allowed, which
    is 65535.

   - Loopback connections

    In the case of loopback TCP connections, thread scheduling affects
    performance more than the receive buffer size. To avoid unnecessary
    code complexities (especially for the TCP fusion code path), the
    receive buffer auto-sizing is disabled for loopback connections.

   - Round-trip time (RTT) estimate

    In order to calculating the data received over an RTT, it is obvious
    we'd need to first know what RTT is.

    * Timestamps option approach

      In a typical TCP implementation, if the timestamps option is
      enabled, the RTT can be measured by observing the time difference
      when a timestamps is reflected back from the peer.

      This approach works for systems on both send-side and the
      receive-side.

    * Send-side ACK-based approach

      When TCP timestamps option is not enabled, during a typical
      bi-direction TCP connection, RTT can be measured by observing the
      time between when the data is sent and an acknowledgment is
      returned.

    * Receive-side Sequence-number-based approach

      During a bulk-data transfer, one system could mostly receive
      packets and may not be sending any data. In that case, one cannot
      get a good RTT measurement using the ACK-based approach discussed
      above.

      In such system, the RTT can still be able to be measured by
      observing the time between the acknowledgment of sequence number S
      which announces receive window W and the receipt of data segment
      that contains sequence number which is at least S + W + 1.

      This approach assumes that the sender is receive-window limited and
      the RTT measurement would only be a valid estimate when the sender
      is throttled by the network. The measured result could be much larger
      than the actual RTT if the sender did not have any data to send. Thus
      this measurement acts only as an upper-bound on the RTT.

    Solaris today already has the send-side RTT measurement based on both
    timestamps option and the send-side ACK-based approaches. Note that
    because of the choice of the initial value, the send-side averaged RTT
    (tcp_rtt_sa * 8) can only become statistically useful when there are
    enough send-side RTT samples: 

    1. When there are more than tcp_tx_rtt_updates (a new TCP property,
       default is set to 20, see section "Relevant TCP protocol properties"),
       the send-side averaged RTT will be used as one source of the RTT
       estimates.

    Besides that, we will add the receive-side RTT measurement (based on
    both the timestamps and the receive-side sequence-number-based
    approaches). The receive-side RTT will be updated whenever:

    2. A timestamps is echoed back from the peer if the timestamps option
       is enabled;

    3. The receipt of data that is one window beyond the sequence number
       that is just acknowledged;

    Since the receiver RTT measurement acts only as an upper-bound on the
    RTT, each time receive-side RTT is updated in 2 and 3, the receive-side
    RTT is updated to be the minimum of the old receive-side RTT and the
    current RTT measurement value.

    Both receive-side RTT and the send-side RTT will be considered as the
    source of the RTT estimate when they are available. The final RTT will
    be set as the average of both values.

   - Receive buffer auto-sizing

    At the start of the connection, if the receive buffer auto-sizing is
    enabled and it is not a loopback connection, the receive buffer will
    be initialized to the initial receive buffer size, which is
    tcp_recv_autosize_initmss * MSS (tcp_recv_autosize_initmss is another
    new TCP property, default is set to 15. See section "Relevant TCP/IP
    protocol properties"). This is an attempt to assure the sender to not
    be receive-window limited during the first RTT.

    Then each time a packet is received, the current time will be compared
    to the last measurement time for that connection. If more than the
    current estimated RTT has passed, the highest sequence number seen
    is compared to the next sequence number expected at the beginning of
    the measurement. Assuming that all packets are received in order, the
    result will be the number of bytes that were received during RTT
    period (BDP).

    The receive buffer space for this connection is then updated, to
    advertise the receive window (rwnd) that is 3.5 times of the amount
    of data received during this measurement period.

    We need to be careful in the case that the calculated receive buffer
    size is smaller than the previous value:

     1. To avoid fluctuation, the receive buffer size is only decreased
        when it is decreased at least one fourth of the previous value;
     
     2. The receive buffer is set to no less than half of its previous
        value to decrease the penalty for brief stalls in transmission.

     3. The receive window size will not be updated immediately based on
        the decreased receive buffer size to prevent the right window edge
        of the advertised receive window from moving backwards. Instead,
        the connection will be marked. This mark will only be cleared
        when the receive window is reopened and the last receive window
        advertised is smaller than the new receive buffer size.

   - Relevant TCP/IP protocol properties

     1. tcp_recvbuf_autosize (new)

        A new tcp_recvbuf_autosize property will be introduced. The current
        possible value would be either "off" or "drs". In the future, other
        values may be introduces when more auto-sizing algorithms are
        implemented.

        If tcp_recvbuf_autosize is set to "drs", the DRS receive buffer
        auto-sizing will be attempted on all non-loop-back connections;
        If it is set to "off", the auto-sizing will be disabled.

        By default, this property will be set to "drs".

        Like other ipadm properties, one would require 'sys_ip_config'
        privilege to configure this property.

     2. tcp_recv_autosize_initmss (new)

        If the receive buffer auto-sizing is enabled, a TCP connection's
        initial receive buffer size will be set as
        tcp_recv_autosize_initmss * MSS.

        The default value of tcp_recv_autosize_initmss will be set to 15.

     3. tcp_xmit_rtt_autosize_start (new)

        As we discussed, RTT estimate is essential to the receive buffer
        auto-sizing technique. Further, the send-side RTT estimate can
        only become statistically useful when there are enough samples:
        if there are more than tcp_xmit_rtt_autosize_start send-side RTT
        measurement samples, we will start to use the averaged send-side
        RTT as one source of the RTT estimates of the connection.

        The default value of tcp_xmit_rtt_autosize_start will be set to 20.

     All the 3 new properties will be made as Consolidation Private properties
     for now, and will only be expected to be used for diagnosing purpose.
     They will only be made public when it proves to be useful for customers.

     4. tcp_max_buf

        The existing tcp_max_buf property will be used as the maximum
        receive buffer size that can be set (or auto-sized to). The same
        value will be used to determine the window scale value of the
        connection if the auto-sizing is enabled.

     5. recv_maxbuf

        This is an existing property that is general for transportation
        protocols. Specifically for TCP, today it defines the default
        receive buffer size for a connection. After this project is
        integrated, if auto-sizing is enabled for a specific TCP
        connection, the recv_maxbuf property will become irrelevant,
        since the receive buffer size will be automatically adjusted
        based on its network condition.

   - Socket options

     1. SO_RCVBUF

        In today's Solaris, this socket option is used to set/get the
        current receive buffer size of a specific TCP connection. In the
        future, the receive buffer size will be changed dynamically during
        the lifetime of the connection when the auto-sizing is enabled.

        We could chose to return the current receive buffer size when
        SO_RCVBUF is called to query the receive buffer size. This
        may confuse some application since the return values would keep
        changing when auto-sizing is enabled.  Further, our experiments
        show this may cause performance issue for some applications,
        because those applications would first use SO_RCVBUF to get the
        expected receive buffer size at the beginning and use that return
        value as the length of the receive buffer for the recv()/recvfrom()
        system call. While the initial receive buffer starts from a
        relatively small value in case of auto-sizing, this significantly
        regresses the performance of that receiver. 

        As the result, the semantics of SO_RCVBUF is going to change.
        If auto-sizing is disabled, the SO_RCVBUF socket option will
        still be used to set/get the current receive buffer size. But
        On the other hand, if auto-sizing is enabled, the SO_RCVBUF
        socket option will be used to set the maximum receive buffer
        size that be auto-sized up to for a connection.

     2. TCP_RCVBUF_AUTOSIZE (new)

        A new IPPROTO_TCP level TCP_RCVBUF_AUTOSIZE socket option will be
        introduced to enable or disable the receive buffer auto-sizing
        for a specific TCP connection. It can be set to either
        "TCP_AUTOSIZE_OFF" (disable) or "TCP_AUTOSIZE_DRS" (enable the DRS
        algorithm) for now.

        When the TCP_RCVBUF_AUTOSIZE option is used to disable receive buffer
        auto-sizing, the current receive buffer size will be set to the
        value specified by the recv_maxbuf property, the same as the current
        Solaris behavior.

        When the TCP_RCVBUF_AUTOSIZE option is used to enable receive buffer
        auto-sizing, the maximum receive buffer size will be set to the
        value specified by the tcp_max_buf property, and the current
        receive buffer size will stay the same and will be automatically
        adjusted based on the current network condition.

     3. TCP_CUR_RCVBUF (new)

        A new IPPROTO_TCP level TCP_CUR_RCVBUF socket option will be
        introduced to get the current receive buffer size of a specific
        TCP connection. This property will be read-only and its return
        value may change during the lifetime of a TCP connection.

     Both TCP_RCVBUF_AUTOSIZE and TCP_CUR_RCVBUF socket options will be
     Consolidation Private interfaces for now and will only be made
     public when it proves to be useful for customers.

   - Observability

     1. pfiles(1)

        pfiles(1) will be extended to show whether the receive buffer
        auto-sizing is enabled and the current receive buffer size value
        for a TCP connection. We will also make several other minor changes
        of the output format so that it reads more clearly:

            9: S_IFSOCK mode:0666 dev:337,0 ino:10285 uid:0 gid:0 size:0
               O_RDWR|O_NONBLOCK
               SOCK_STREAM
               send buffer: 49152 bytes
               receive buffer: 21720 bytes (maximum: 1048576 bytes)
               auto-size: drs
               sockname: AF_INET 129.146.104.83 port: 65506
               peername: AF_INET 192.18.34.10 port: 5001

     2. DTrace probes

        sdt DTrace probes will be provided to observe the RTT measurement
        process, the BDP calculation process and the current receive buffer
        size.

Interface Table
===============

   +-----------------------------+-----------------------+----------------+
   |          Interface          |     Stability         |  Description   |
   +-----------------------------+-----------------------+----------------+
   | TCP_RCVBUF_AUTOSIZE         | Consolidation Private | socket option  |
   | TCP_CUR_RCVBUF              | Consolidation Private | socket option  |
   +-----------------------------+-----------------------+----------------+
   | tcp_recvbuf_autosize        | Consolidation Private | ipadm property |
   | tcp_recv_autosize_initmss   | Consolidation Private | ipadm property |
   | tcp_xmit_rtt_autosize_start | Consolidation Private | ipadm property |
   +-----------------------------+-----------------------+----------------+
   | sdt probes for auto-sizing  | Project Private       | DTrace probes  |
   +-----------------------------+-----------------------+----------------+

References
==========
   
[1] J. Semke, J. Mahdavi, and M Mathis., "Automatic TCP Buffer Tuning"
ACM SIGCOMM 1998, 28)4). Oct. 1998.

[2] M. Fisk, W. Feng, "Dynamic Right-Sizing in TCP" Oct. 2001.

[3] M. Smith, S. Bishop, "Flow Control in the Linux Network Stack" Feb. 2005

[4] E. Weigle, W. Feng  "A Comparison of TCP Automatic Tuning Techniques for Distributed Computing"

[5] TCP Extensions for High Performance - rfc1323