Reliable Datagram Sockets rds(3SOCKET) NAME rds - Reliable Datagram Sockets SYNOPSIS cc [ flag ... ] file ... -lsocket -lnsl [ library ... ] #include #include #include rds_socket = socket(AF_INET_OFFLOAD, SOCK_SEQPACKET, 0); DESCRIPTION This is an implementation of the Reliable Datagram Sockets API on InfiniBand. It provides reliable, in-order datagram delivery between sockets. A new RDS socket has no local address when it is first returned from socket(3SOCKET). It must be bound to a local address by calling bind(3SOCKET) before any messages can be sent or received. An RDS socket can only be bound to one address and only one socket can be bound to a given address. If no port is specified in the binding address then an unbound port is selected at random. RDS supports unicast and loopback communication using IPv4 style addresses. broadcast, inaddr_any, and multicast addresses are not supported. RDS uses sockaddr_in to describe addresses, sin_family should be set to AF_INET_OFFLOAD. Address supplied for bind must be a loopback address or an address hosted on an RDS capable interface. Messages may be sent using sendmsg(3SOCKET) once the RDS socket is bound. Message length cannot exceed 4 gigabytes. RDS does not support any out of band data of any kind. A successful sendmsg(3SOCKET) call puts the message in the socket's transmit queue where it will remain until the destination acknowledges the receipt of the message.If an attempt is made to send a message to a destination whose buffer does not have room for the new message EAGAIN will be returned. If sendmsg(3SOCKET) succeeds then RDS guarantees that the message will be visible to recvmsg(3SOCKET) on a socket bound to the destination address as long as that destination socket remains open. If there is no socket bound on the destination than the message is silently dropped. All messages sent from the same socket to the same destination will be delivered in the order they're sent. Messages sent from different sockets, or to different destinations, may be delivered in any order. RDS socket's send and receive buffers can be set using SO_SNDBUF and SO_RCVBUF options. poll(2) is supported on RDS sockets. POLLIN is returned when there is a message waiting in the socket's receive queue. POLLOUT is always retunred. It is up to the application to back off if poll is used to trigger sends. following ioctl(2)'s are supported on an RDS socket O_SIOCGIFCONF SIOCGIFCONF: SIOCGIFNUM: SIOCGIFMTU: SIOCGIFFLAGS: TI_GETMYNAME: When a socket is closed all pending messages are freed. The RDMA interface is currently based on control messages (ancil- lary data) sent or received via the sendmsg(2) and recvmsg(2) system calls. Initially, the client will send RDMA requests along with a RDS_CMSG_RDMA_MAP control message. The control message con- tains the address and length of the memory region for which to obtain a handle, some flags, and a pointer to a memory location (in the caller's address space) where the kernel will store the RDMA cookie. Alternatively, if the application has already obtained a RDMA cookie for the memory range it wants to RDMA to/from, it can hand this cookie to the kernel using the RDS_CMSG_RDMA_DEST control message. When the server receives the RDMA request, the kernel will deliver the cookie wrapped inside a RDS_CMSG_RDMA_DEST con- trol message. The server then initiates the data transfer by sending the RDMA ACK message along with a RDS_CMSG_RDMA_ARGS control message. This message contains the RDMA cookie, and the local memory to copy to or from. The server process may request a notification when an RDMA operation completes. Notifications are delivered as a RDS_CMSG_RDMA_STATUS control messages. When an application calls recvmsg(2), it will either receive a regular RDS mes- sage (possibly with other RDMA related control messages), or an empty message with one or more status control messages. When an RDMA operation fails for some reason and is discarded, the application can ask to receive notifications for failed messages as well, regard- less of whether it asked for success notification of an individual message or not. This behavior is turned on by setting the RDS_RECVERR socket option. In addition to the control message interface, RDS allows a process to register and release memory ranges for RDMA through calls to setsockopt(2). RDS_GET_MR To obtain a RDMA cookie for a given memory range, the application can use setsockopt with RDS_GET_MR. This operates essentially the same way as the RDS_CMSG_RDMA_MAP control message: the argument contains the address and length of the memory range to be registered, and a pointer to a RDMA cookie variable, in which the system call will store the cookie for the registered range. RDS_FREE_MR Memory ranges can be released by calling setsockopt with RDS_FREE_MR, giving the RDMA cookie and additional flags as arguments. RDS_RECVERR This is a boolean option which can be set as well as queried (using getsockopt). When enabled, RDS will send RDMA notification messages to the application for any RDMA operation that fails. This option defaults to off. For all of these calls, the level argument to setsockopt is SOL_RDS. RDMA cookie typedef u_int64_t rds_rdma_cookie_t This encapsulates a memory location in the client pro- cess. In the current implementation, it contains the R_Key of the remote memory region, and the offset into it (so that the application does not have to worry about alignment. The RDMA cookie is used in several struct types described below. The RDS_CMSG_RDMA_DEST control mes- sage contains a rds_rdma_cookie_t all by itself as pay- load. The following data type is used with RDS_CMSG_RDMA_MAP control messages and with the RDS_GET_MR socket option: struct rds_iovec { u_int64_t addr; u_int64_t bytes; }; struct rds_get_mr_args { struct rds_iovec vec; u_int64_t cookie_addr; uint64_t flags; }; The cookie_addr specifies a memory location where to store the RDMA cookie. The flags value is a bitwise OR of any of the following flags: RDS_RDMA_USE_ONCE This tells the kernel that the allocated RDMA cookie is to be used exactly once. When the RDMA ACK message arrives, the kernel will automatically unbind the memory area and release any resources associated with the cookie. If this flag is not set, it is the application's responsibility to release the memory region at a later time using the RDS_FREE_MR socket option. RDS_RDMA_INVALIDATE Normally, RDMA memory mappings are invalidated lazily, as this requires some relatively costly synchronization with the HCA. However, this means that the server application can continue to access the registered memory for some indeterminate amount of time. If this flag is set, the RDS code will invalidate the mapping at the time it is released (either upon arrival of the RDMA ACK, if USE_ONCE was specified; or when the application destroys it using FREE_MR). RDMA operations are initiated by the server using the RDS_CMSG_RDMA_ARGS control message, which takes the following data as payload: struct rds_rdma_args { rds_rdma_cookie_t cookie; struct rds_iovec remote_vec; u_int64_t local_vec_addr; u_int64_t nr_local; u_int64_t flags; u_int32_t user_token; }; The cookie argument contains the RDMA cookie received from the client. The local memory is given via an array of rds_iovecs. The array address is given in local_vec_addr, and its number of elements is given in nr_local. The struct member remote_vec specifies a location rela- tive to the memory area identified by the cookie: remote_vec.addr is an offset into that region, and remote_vec.bytes is the length of the memory window to copy to/from. This length must match the size of the local memory area, i.e. the sum of bytes in all members of the local iovec. The flags field contains the bitwise OR of any of the following flags: RDS_RDMA_READWRITE If set, any RDMA WRITE is initiated from the server's memory to the client's. If not set, RDS will do a RDMA READ from the client's memory to the server's memory. RDS_RDMA_FENCE By default, Infiniband makes no guarantee about the ordering of an RDMA READ with respect to sub- sequent SEND operations. Setting this flag asks that the RDMA READ should be fenced off the subse- quent RDS ACK message. Setting this flag requires an additional round-trip of the IB fabric, but it is a good idea to use set this flag by default, unless you are really sure you do not want it. RDS_RDMA_NOTIFY_ME This flag requests a notification upon completion of the RDMA operation (successful or otherwise). The noticiation will contain the value of the user_token field passed in by the application. This allows the application to release resources (such as buffers) assosicated with the RDMA transfer. The user_token can be used to pass an application specific identifier to the kernel. This token is returned to the application when a status notification is generated (see the following section). The RDS kernel code is able to notify the server appli- cation when an RDMA operation completes. These notifi- cations are delivered via RDS_CMSG_RDMA_STATUS control messages. By default, no notifications are generated. There are two ways an application can request them. On one hand, status notifications can be enabled on a per-operation basis by setting the RDS_RDMA_NOTIFY_ME flag in the RDMA arguments. On the other hand, the application can request notifications for all RDMA operations that fail by setting the RDS_RECVERR socket option (see below). In both cases, the format of the notification is the same; and at most one notification will be sent per completed operation. The message format is this: struct rds_rdma_notify { u_int32_t user_token; int32_t status; }; The user_token field contains the value previously given to the kernel in the RDS_CMSG_RDMA_ARGS control message. The status field contains a status value, with 0 indicating success, and non-zero indicating an error. The following status codes are currently defined: RDS_RDMA_SUCCESS The RDMA operation succeeded. RDS_RDMA_REMOTE_ERROR The RDMA operation failed due to a remote access error. This is usually due to an invalid R_key, offset or transfer size. RDS_RDMA_CANCELED The RDMA operation was canceled by the applica- tion. (This error code is not yet generated). RDS_RDMA_DROPPED RDMA operations were discarded after the connec- tion broke and was re-established. The RDMA opera- tion may have been processed partially. RDS_RDMA_OTHER_ERROR Any other failure. When using the RDS_GET_MR socket option to register a memory range, the application passes a pointer to a struct rds_get_mr_args variable, described above. The RDS_FREE_MR call takes an argument of type struct rds_free_mr_args: struct rds_free_mr_args { rds_rdma_cookie_t cookie; u_int64_t flags; }; cookie specifies the RDMA cookie to be released. RDMA access to the memory range will usually not be invoked instantly, because the operation is rather costly. How- ever, if the flags argument contains RDS_RDMA_INVALIDATE, RDS will invalidate the indicated mapping immediately, as described in section Mapping arguments above. If the cookie argument is 0, and RDS_RDMA_INVALIDATE is set, RDS will invalidate old memory mappings on all devices. RETURN VALUES A -1 is returned if an error occurs. Otherwise the return value is a descriptor referencing the socket. ERRORS The socket() function will fail if: EACCES Permission to create a socket of the specified type or protocol is denied. EAFNOSUPPORT The specified address family is not supported by the protocol family. EMFILE The per-process descriptor table is full. ENOMEM Insufficient user memory is avail- able. EPFNOSUPPORT The specified protocol family is not supported. EPROTONOSUPPORT The protocol type is not supported by the address family. EPROTOTYPE The socket type is not supported by the protocol. ENODEV Transport driver failed to attach or there is no transport hardware. ATTRIBUTES See attributes(5) for descriptions of the following attri- butes: ____________________________________________________________ | ATTRIBUTE TYPE | ATTRIBUTE VALUE | |_____________________________|_____________________________| | MT-Level | Safe | |_____________________________|_____________________________| SEE ALSO socket(3SOCKET), bind(3SOCKET), sendmsg(3SOCKET), recvmsg(3SOCKET), getsockopt(3SOCKET), setsockopt(3SOCKET), ib(7D). Reliable Datagram Sockets rds(3SOCKET)