S10C-NET B. Kenkre R. Srivatsavai Oracle Corporation March 24, 2010 S10C Networking with exclusive IP stack Table of Contents 1. Introduction 2. Running native binaries in S10C 3. Components with issues 3.1. IPMP 3.2. IPFilter(5) and IPNAT 3.3. netstat(1M) 3.4. MIBs 3.5. autopush(1M) 3.6. DHCP Client and in.ndpd 3.7. IP Tunneling 3.8. snoop(1M) and libdlpi(3LIB) 3.9. Socket Options 3.10. Brussels phase II (ipadm) 4. Components not supported 5. Components that work 6. Other components 7. Relevant ARC cases and CRs 8. References Authors' Addresses 1. Introduction 1.1. Project/Component Working Name: S10C Networking with exclusive IP stack 1.2. Name of Document Author/Supplier: Baban Kenkre, Rishi Srivatsavai 1.3. Date of This Document: 24 Mar, 2010 Branded zones emulate user environments from non-native Operating systems. Solaris 10 Containers (S10C) [1] are branded zones that host Solaris 10 (S10) user environments on Solaris.Next. In other words, S10C runs S10 user-level binaries on top of the Solaris.Next kernel. Only S10u8 and beyond are supported and tested in such zones. S10C with shared IP stack support [2] was integrated in ONNV build 127. In this paper, we are documenting the results of our investigation of various networking features in S10C with exclusive IP stack and propose solutions to issues encountered. 2. Running native binaries in S10C In the following sections, many of the proposed solutions involve running native binary in S10C. Running a native binary in S10C means running the Solaris.Next version of the binary in the S10C zone instead of the S10 version. The native binaries are already mounted and accessible in S10C using appropriate wrappers. Use of native binary is allowed as long as this does not break any documented interfaces in S10. Moreover, this helps us avoid adding complex code as well as code that was previously removed to Solaris.Next. For more information please refer to the S10C development guide [1]. 3. Components with issues 3.1. IPMP 3.1.1. Problem Currently, IPMP does not work in S10C. IPMP has undergone considerable change in Solaris.Next [3] wherein the ifconfig(1M) command and the kernel communicate back and forth using ioctl and error codes to create IPMP groups. The ifconfig(1M) command, in.mpathd(1M) daemon and the kernel need to be in sync for IPMP to work as expected. This is not the case in S10C where the userland binaries are S10-based and the kernel is Solaris.Next. 3.1.2. Proposed Solution We propose to run native ifconfig(1M), in.mpathd(1M) and ipmpstat(1M) in S10C to support IPMP. S10C will also run native dhcpagent(1M) and the ipmgmt service. The alternate solution is to modify either the S10 ifconfig(1M) or the Solaris.Next kernel or both to talk to each other to setup IPMP but such an implementation is complex and will make the source code difficult to maintain. On the other hand, the documented features of ifconfig(1M) and the documented features of IPMP such as administration, configuration and semantics haven't changed from S10 to Solaris.Next. Therefore, S10 applications and scripts that use these documented interfaces will continue to work in S10C without any modification when using native commands. 3.1.3. Changes in ifconfig(1M) output There are some differences in the output of the native ifconfig(1M) command compared to S10 when IPMP is configured. We don't consider this as an issue because parsing IPMP information from ifconfig(1M) output is not supported. Moreover the native output resolves much of the confusion and issues that were associated with S10. These differences will be documented for the benefit of the administrators. These differences are listed below: o One of the fundamental IPMP related difference between Solaris.Next and S10 is that in Solaris.Next the IPMP group is modeled as an IP interface and all the IP data addresses associated with the IPMP group are hosted on the IPMP IP interface. In S10, there is no IPMP interface and the data addresses are associated with the underlying interfaces that are in the IPMP group. Therefore, the output of native ifconfig(1M) shows the IPMP interface and the output of S10 ifconfig(1M) does not. o Since the data addresses are not associated with the underlying interfaces in Solaris.Next, the native ifconfig(1M) command does not show the binding of the underlying interfaces to IP addresses. However, this information can be viewed using "arp -an" command. Moreover, Solaris.Next includes a new tool ipmpstat(1M) to display information about the IPMP subsystem. This tool will be included in S10C. o If an interface is plumbed for IPv6 and address autoconfiguration succeeds then it gets its own global address. In S10 each physical interface in an IPMP group will have its own global address and the IPMP group will have as many global addresses as the interfaces. In Solaris.Next only the IPMP interface will have its own global address and not the underlying interfaces. 3.1.4. IPMP Singleton configuration IPMP Singleton configuration is one in which there is only one interface in the IPMP group. A S10 IPMP singleton group allows a single IP address to be both a data address and a test address. However, this is not supported by Solaris.Next which requires the two to be different. This configuration is being used by Sun Cluster to monitor interface failures. It is unlikely that anyone outside of Sun Cluster is using this feature. Therefore, we propose to document that S10C will not support IPMP singleton group in which the test address and the data address are the same. If such a configuration exists then an explicit test address must be configured to perform probe-based failure detection. 3.2. IPFilter(5) and IPNAT 3.2.1. Problem Tools such as ipfstat(1M) and ipnat(1M) fail with version mismatch error in S10C. This is because the IPFilter version used by the S10 tools does not match the IPFilter version used by the Solaris.Next kernel in S10C. The IPFilter version was changed in Solaris.Next by PSARC/2008/250 which introduced IPv6 NAT support for IPFilter. The version was changed because the public data structures nat_t and natlookup_t associated with SIOCNATL and SIOCSTPUT ioctls were modified. 3.2.2. Proposed Solution We propose to run native versions of ipf(1M), ipfs(1M), ipfstat(1M), ipmon(1M), ipnat(1M) and the ippool(1M) commands which are all backward compatible with their S10 counterparts. 3.3. netstat(1M) 3.3.1. Problem Multiple failures were observed when running netstat(1M) in S10C. Examples: o Shows garbage when printing UNIX sockets i.e. "netstat -f unix". This is because an additional field "si_faddr_noxlate" was added to "struct sockinfo" (See socketvar.h) in Solaris.Next and the S10 netstat (See netstat/unix.c) does not handle this. o Garbage in the output of "netstat -ra". o Jumbled output of "netstat -r". 3.3.2. Proposed Solution We propose to run native version of netstat(1M) in S10C which is backward compatible with netstat(1M) in S10. 3.4. MIBs 3.4.1. Problem The data structures associated with TCP, UDP and IP MIBs have changed in Solaris.Next. They are (See $SRC/uts/common/inet/mib2.h) o mib2_ipAddrEntry_t, mib2_ipv6AddrEntry_t o mib2_tcp_t, mib2_tcpConnEntry_t, mib2_tcp6ConnEntry_t o mib2_udp_t, mib2_udpEntry_t, mib2_udp6Entry_t o mib2_ipRouteEntry_t All the above changes except for mib2_ipRouteEntry_t are additive in nature i.e. new fields were added at the end of the data strutures by the Updated MIBs project [4]. These changes have been backported to S10 update and the header file changes are guarded by the NEW_MIB_COMPLIANT macro. Two fields were removed from mib2_ipRouteEntry_t by the EOF of Mobile IP project [5] because they were not being used. These MIBs can be directly requested using the undocumented T_SRV4_OPTMGMT_REQ or T_OPTMGMT_REQ putmsg(2) request and thereafter retrieved using the getmsg(2) request. Known consumers of this undocumented feature are netstat(1M), SMA(net-snmp) and quagga(zebra) routing suite. snmpd daemon (SMA) cores when trying to snmpwalk TCP, UDP or IP MIBs. (See CR 6912339). This is because snmpd uses sizeof operator to determine the size of these MIBs. These sizes are then used to create and traverse an internal cache and the daemon cores because the incoming MIBs are of different size. Note that SMA 1.0 is based on net-snmp 5.0.9 which has been EOLed and replaced by net-snmp 5.4.1 in Solaris.Next. net-snmp 5.4.1 works fine in S10C however as per RPE there are no plans to replace SMA 1.0 due to API stability issues. quagga software suite tries to read mib2_ipRouteEntry_t entries using the undocumented interface. 3.4.2. Proposed Solution As noted earlier, an application retrieves these MIBs using the getmsg(2) call. It is not possible for the brand module to determine what type of messages are being exchanged by the getmsg(2) call by looking at the associated data structures. Only the application and the kernel knows how to interpret that data. Therefore the solution is to either modify and recompile the application to handle the updated MIBs or to modify the kernel to return legacy data. We propose to modify the Solaris Next kernel to return legacy MIBs if the user process requesting the MIB is running in a S10C and the 'len' field in the request is 0. In all other cases, it would return updated MIBs. In S10u8, 'len' field in the getmsg(2) request is used to distinguish between request for legacy MIBs versus updated MIBs. In Solaris.Next, 'len' field is ignored. The native netstat binary will be modified to make the request with 'len' = 1. This will allow it to get updated MIBs even when running in S10C. 3.5. autopush(1M) 3.5.1. Problem autopush(1M) command is used to configure a list of modules to be automatically pushed onto the stream when a STREAMS device is opened. In S10, sockets are mapped to STREAMS devices. However in Solaris.Next (and therefore S10C), socket types such as tcp, udp, icmp are mapped to modules by default [6] using the sock2path(4) file and are referred to as the module-based or function call based sockets. When a function call based socket is opened, the autopush configuration is ignored because there is no stream involved. Therefore in S10C, autopush configuration will be ignored when tcp/ udp/icmp sockets are opened. This affects any application or site that requires specific modules to be pushed onto the stream when opening tcp/udp/icmp sockets. At this point we do not know if any applications will be affected and also whether supporting this feature in S10C is desired. Nevertheless workaround exists and it is documented below. Section 3.5.2 3.5.2. Solution Solaris.Next (and S10C) will use STREAMS-based sockets and therefore the autopush configuration if o sockets are explicitly mapped to STREAMS-based devices. See sock2path(4) file and soconfig(1M). Both configurations i.e. sock2path(4) and autopush(1M) are machine-wide. OR o The application that opened the function call based socket issues I_POP or I_PUSH or I_LIST ioctls in which case tcp, udp and icmp sockets fallback to using streams. 3.5.3. Future alternatives If there is a sufficient demand and justification, we could modify the Solaris.Next kernel to always open STREAMS devices (using hard- coded device information) when sockets are opened by S10C and corresponding autopush configuration exists. On the other hand, a separate project or RFE can pursue making the sockparams table (In- kernel table that stores the sockets to devices/modules mappings) zones-aware so that non-global zones can provide their own mappings. However, at this point it is not clear if these alternatives are desired for tcp, udp and icmp sockets because we are moving away from streams. 3.6. DHCP Client and in.ndpd 3.6.1. Problem Failures were observed when using the DHCP client in S10C. IPC request structure size change in Solaris.Next causes IPC failures between native ifconfig and S10C dhcpagent. In addition dhcpagent in S10C does not lookup network links under /dev/net. Similarly, in.ndpd uses the same IPC mechanism to communicate with dhcpagent in S10C. 3.6.2. Proposed Solution To address the failures, we propose to use the native version of dhcpagent(1M) and in.ndpd(1M) in S10C. 3.7. IP Tunneling IP Tunneling does not work in S10C without the use of native ifconfig due to changes in ioctls used to configure IP Tunneling in the Solaris next kernel. Creating IPv4, IPv6 and 6to4 tunnels using the native ifconfig in S10C is supported. However automatic tunnels created using the atun STREAMS module are not supported in S10C. Automatic tunnels have been EOL'ed and atun STREAMS module has been removed in Solaris.Next and hence is not supported in S10C. 3.8. snoop(1M) and libdlpi(3LIB) snoop(1M) is unable to access /dev/net links in S10C due to missing support for /dev/net links in S10 libdlpi . We propose to use the native version of snoop(1M) in S10C. In addition, we will update the dlpi_open function in libdlpi(3LIB) on Solaris 10 to lookup device links under /dev/net and /dev. S10C customers will be recommended to install the patch to support /dev/net links in libdlpi(3LIB). Applications not using the patched libdlpi(3LIB) or libpcap (1.0.0+) built with patched libdlpi, will not be able to access /dev/net/ links in S10C. We will document that S10 DLPI applications that do not use the patched libdlpi will not work over /dev/net links in S10C. 3.9. Socket Options 3.9.1. Problem The IP_XMIT_IF socket option has been removed from Solaris.Next. IP_XMIT_IF is consolidation private and in.routed is the only application impacted in Solaris 10. 3.9.2. Proposed Solution We propose to run the native version of in.routed in S10C, which uses IP_PKTINFO. 3.10. Brussels phase II (ipadm) 3.10.1. Problem Brussels phase II [7] targeting ONNV integration in build 135 introduces ipadm to ease administering of IP on Solaris. ipadm relies on a new daemon (ipmgmtd) for IP configuration storage that runs per- exclusive IP stack on the system. Legacy commands such as ifconfig, ndd and network services such as network/loopback and network/ physical rely on ipmgmtd for IP administration and configuration. Administering IP in S10C with exclusive-IP stack will require adding support for ipmgmtd with the integration of Brussels phase II. 3.10.2. Proposed Solution We propose to add native versions of ndd, ipmgmtd, ip-interface- management SMF service and method in S10C. Native version of ndd, ipmgmtd, the SMF manifest and method for ip-interface-management are added to S10C during zone boot (similar to how native wrappers of commands and daemons are introduced in S10C). The network/loopback service manifest in S10C includes a dependency on the ip-interface- management service in S10C. The following ndd tunables are not available on Solaris.Next (the commitment level documented in Solaris 10 for the tunable noted in parenthesis) and are therefore not supported in S10C. o ip_squeue_fanout (unstable) o ip_soft_rings_cnt (obsolete) o ip_ire_pathmtu_interval (unstable & not recommended) o tcp_mdt_max_pbufs (unstable) 3.10.3. ipmgmtd manifest import in S10C The new ip-interface-management SMF service introduced in S10C should run before network/loopback during boot. But the new service must be imported early in boot otherwise network/loopback will fail to run and mult-user milestone will fail to enter online state. To ensure network/loopback runs only after the import of the new ip-interface- management SMF service, we will deliver a new net-loopback method script for S10C in Solaris Next as part of the system/zones/brand/s10 package. The new net-loopback method installed in S10C zones waits for ip-interface-management service to enter online state when run during re-configure boot. 4. Components not supported The following components will not be supported in S10C o NL7C because it is currently not supported in non-global zones. o IPQoS because it is currently not supported in non-global zones. o Mobile IP has been EOL'ed in Solaris.Next and we are not aware of any customers using Mobile IP in Solaris 10. o Automatic tunnels using atun STREAMS module is not supported. o S10C specific autopush(1M) configuration. See Section 3.5 o S10 DLPI applications that do not use libdlpi are not supported in S10C. See Section 3.8 o IPMP singleton group in which the test address and the data address are the same. See Section 3.1.4 5. Components that work The following components work fine in S10C o libsocket(3LIB) o Socket I_PUSH, I_POP, I_LIST ioctls o Kernel SSL proxy (Greyhound) o Manual IP addressing with ifconfig(1M) o ping(1M) o Dynamic Routing o route(1M) o arp(1M) o DAD o IPv6 Stateless Address Autoconfiguration 6. Other components IPSEC is being investigated by the IPSEC team. 7. Relevant ARC cases and CRs o 6912339 snmpd cores in S10 Container when I try to snmpwalk TCP, UDP or IP MIBs o 6915141 autopush(1M) configuration ignored by S10 Containers o PSARC/2009/253 S10C o PSARC/2009/306 Brussels II - ipadm and libipadm 8. References [1] "S10C Developer Guide", . [2] "PSARC/2009/253 S10C". [3] "PSARC/2007/272 Project Clearview: IPMP Rearchitecture". [4] "PSARC/2006/314 Updated MIBs". [5] "PSARC/2007/311 EOF of Mobile IP". [6] "PSARC/2007/587 Volo -- Low Latency Socket Framework". [7] "PSARC/2009/306 Brussels II - ipadm and libipadm". Authors' Addresses Baban Kenkre Oracle Corporation Menlo Park, CA USA Rishi Srivatsavai Oracle Corporation Burlington, MA USA