Background: ---------- RFE 4958215 describes a source address selection mechanism that is needed for a virtual IP based multipathing solution. This case deals with the exported interfaces needed in order to implement this solution. Details: ------- The technical details can be found here: http://jurassic.eng/home/amehta/work/proj/vipa/design.txt Note, since packets are neither transmitted nor received on the vni interfaces, no changes to snoop or IP filtering are required for the new interface type. Interfaces exported: ------------------- Interface Classification Comments --------- -------------- -------- struct lifsrcof Evolving if, if_tcp (7P) net/if.h IFF_VIRTUAL Evolving if, if_tcp (7P) net/if.h SIOCGLIFUSESRC Evolving if, if_tcp (7P) sys/sockio.h SIOCSLIFUSESRC Evolving if, if_tcp (7P) sys/sockio.h SIOCGLIFSRCOF Evolving if, if_tcp (7P) sys/sockio.h SUNW_DL_VNI Project Private sys/dlpi.h usesrc Evolving ifconfig(1M) option usesrc none Evolving ifconfig(1M) option ifconfig -a output Unstable ifconfig(1M) kernel/drv/sparcv9/vni Project Private vni(7D) kernel/drv/vni.conf Project Private kernel/drv/vni Project Private vni Evolving vni(7D) IP interface man page diffs (describes operation of new interfaces; see also comparison with other multipathing technologies below for details): --------------- Devices vni(7D) NAME vni- STREAM virtual network interface driver DESCRIPTION The vni pseudo device is a multi-threaded, loadable, clonable, STREAMS pseudo-device supporting the connectionless Data Link Provider Interface, dlpi(7P) Style 2. Note, DLPI access is primarily for interacting with IP and this is not extended to applications, that is, DLPI access to application is not supported. So for example, snoop would fail on the vni interface. Snoop will intentionally skip over all IFF_VIRTUAL interfaces when started without a specific interface specified by "-d". vni is like a physical interface, but does not send or receive data. The device provides a DLPI upper interface that identifies itself to IP with a private media type. It can be configured via ifconfig(1M), and can have IP addresses assigned to it, that is, aliases are possible. This pseudo device is particularly useful in hosting an IP address when used in conjunction with the usesrc ifconfig subcommand (see ifconfig(1M) for examples). The logical instances of the device can also be used to host addresses as an alternative to hosting them over the loopback interface. Multicast is not supported on this device. More specifically, the following options would return an error when used with an address specified on vni: IP_MULTICAST_IF, IP_ADD_MEMBERSHIP, IP_DROP_MEMBERSHIP, IPV6_MULTICAST_IF, IPV6_JOIN_GROUP, IPV6_LEAVE_GROUP. Broadcast, similarly, is not supported. Traffic is neither received through nor transmitted on a virtual interface. There is no physical hardware configured below the virtual interface, so there's no way to transmit or receive. All packet transmission and reception is accomplished with physical interfaces and tunnels. This means that all applications that deal with packet transmission and reception, such as packet filters, cannot filter traffic on virtual interfaces. Attempting to set up a packet filter on a virtual interface will be ineffective, as the packets never go through the interface. Instead, configure the policy rules to apply to the physical interfaces and tunnels, and, if necessary, use the virtual IP addresses themselves as part of the rule configuration. FILES /dev/vni SEE ALSO ifconfig(1M), ip(7P), ip6(7P) -- ifconfig man page addition: usesrc [ | none ] Specify a physical interface used for source address selection on a given physical interface. If the keyword "none" is used, then any previous selection is cleared. When an application does not choose a non-zero source address using bind(3SOCKET), the system will select an appropriate source address based on the outbound interface and the address selection rules (see ipaddrsel(1M)). When this option is specified and the interface is selected by the forwarding table for output, the system will look first to specified physical interface and it's associated logical interfaces when selecting a source address. If no usable address appears there, then the ordinary selection rules apply. For example, if we do 'ifconfig hme0 usesrc vni0', and vni0 has address 10.0.0.1 on it, then the system will prefer 10.0.0.1 as the source address for any packets originated by local connections on the system that are sent through hme0. More concrete examples are provided below. Any physical interface (or even loopback) may be specified, but note that the virtual IP interface (see vni(7D)), which is not associated with any physical hardware (and hence immune to hardware failures), is a useful way to specify these addresses. Any number of physical interfaces can be specified to use the source address hosted on this virtual interface, allowing for multipathing to happen, that is, if one of the physical interfaces were to fail, communications would continue through one of the other physical interfaces. This assumes that the reachability of the address hosted on the virtual interface is advertised in some manner (eg: through a routing protocol). The 'ifconfig preferred' option is applied to all interfaces so it's coarser-grained than the 'usesrc' subcommand. It will be overridden by 'usesrc' and 'setsrc' [2] (route subcommand), in that order. example% ifconfig qfe0 usesrc vni0 sets up the source address selection such that every packet that is locally generated with no bound source address and going out on qfe0 prefers a source address hosted on vni0. The ifconfig -a output for these two interfaces would look like: qfe0: flags=1100843 mtu 1500 index 3 usesrc vni0 inet 1.2.3.4 netmask ffffff00 broadcast 1.2.3.255 ether 0:3:ba:17:4b:df vni0: flags=11100c1 mtu 0 index 5 srcof qfe0 inet 3.4.5.6 netmask ffffffff Notice, the usesrc and srcof of keywords in the output. These keywords also appear on the logical instances of the physical interface even though this is a per-physical interface parameter. example: ifconfig qfe0 inet6 usesrc vni0 sets up the similar source address selection for IPv6 and corresponding output would look like: qfe0: flags=2000841 mtu 1500 index 3 usesrc vni0 inet6 fe80::203:baff:fe17:4bdf/10 ether 0:3:ba:17:4b:df qfe0:1: flags=2080841 mtu 1500 index 3 usesrc vni0 inet6 fec0:56::203:baff:fe17:4bdf/64 qfe0:2: flags=2080841 mtu 1500 index 3 usesrc vni0 inet6 2000:56::203:baff:fe17:4bdf/64 vni0: flags=2210041 mtu 0 index 5 srcof qfe0 inet6 fe80::203:baff:fe17:4444/10 vni0:1: flags=2200041 mtu 0 index 5 srcof qfe0 inet6 fec0::203:baff:fe17:4444/64 vni0:2: flags=2200041 mtu 0 index 5 srcof qfe0 inet6 2000:56::203:baff:fe17:4444/64 Depending on the scope of the destination of the packet going out qfe0, the appropriately scoped source address is selected from vni0 and it's aliases [3]. The following is an example of how the usesrc feature can be used with zones. If the following were executed in the global zone: example%ifconfig hme0 usesrc vni0 example%ifconfig eri0 usesrc vni0 example%ifconfig qfe0 usesrc vni0 then the ifconfig -a output for the virtual interfaces would look like: vni0: flags=10008d1 mtu 0 index 23 srcof: hme0 eri0 qfe0 inet 10.0.0.1 netmask ffffffff vni0:1: flags=11100c1 mtu 0 index 23 zone test1 srcof: hme0 eri0 qfe0 inet 10.0.0.2 netmask ffffffff vni0:2: flags=11100c1 mtu 0 index 23 zone test2 srcof: hme0 eri0 qfe0 inet 10.0.0.3 netmask ffffffff vni0:3: flags=11100c1 mtu 0 index 23 zone test3 srcof: hme0 eri0 qfe0 inet 10.0.0.4 netmask ffffffff Note, there is one virtual interface alias per zone (test1, test2, test3). So depending on the zone in which the packet is being sent out, the source address from the virtual interface alias in the same zone is selected. The virtual interface aliases would have been created through zonecfg(1M). example% zonecfg -z test1 zonecfg:test1> add net zonecfg:test1:net> set physical=vni0 zonecfg:test1:net> set address=10.0.0.2 In a similar manner the test2 and test3 zone interfaces and addresses are created. --- The following text will be added to the ipaddrsel(1M) man page: Note, if the usesrc ifconfig(1M) subcommand is applied to a particular physical interface, then a preference is given to the addresses hosted on the specified interface before the source address selection rules specified by ipaddrsel(1M) are applied for packets going out on that interface. This holds true for packets that are locally generated and for applications that do not choose a non-zero source address using bind(3SOCKET). ---- The following text will be added to the if, if_tcp (7P) man pages: SIOCSLIFUSESRC Set the interface from which to choose a source address. The lifr_index field has the interface index corresponding to the interface whose address is to be used as the source address for packets going out on the interface whose name is provided by lifr_name. If the lifr_index field is set to zero then the previous setting is cleared. See ifconfig (1M) for examples of the 'usesrc' subcommand. SIOCGLIFUSESRC Get the interface index of the interface whose address is used as the source address for packets going out on the interface provided by lifr_name field. The value is retrieved in the lifr_index field. SIOCGLIFSRCOF Get the interface configuration list for interfaces that use an address hosted on the interface, provided by the lifs_ifindex field in the lifsrcof struct (see below), as a source address. The application sets lifs_maxlen to the size (in bytes) of the buffer it has allocated for the data. On return, the kernel sets lifs_len to the actual size required. Note, the application could set lifs_maxlen to zero to query the kernel of the required buffer size instead of estimating a buffer size. The application tests 'lifs_len <= lifs_maxlen' -- if that's true, then the buffer was big enough, and the application has an accurate list. If that's false, then it needs to allocate a bigger buffer and try again, and lifs_len is a hint of how big to make the next trial. The lifsrcof structure has the form: /* * Structure used in SIOCGLIFSRCOF to get the interface * configuration list for those interfaces that use an address * hosted on the interface (set in lifs_ifindex), as the source * address. */ struct lifsrcof { uint_t lifs_ifindex; /* addr on this interface used as the src addr */ size_t lifs_maxlen; /* size of buffer: input */ size_t lifs_len; /* size of buffer: output */ union { caddr_t lifsu_buf; struct lifreq *lifsu_req; } lifs_lifsu; #define lifs_buf lifs_lifsu.lifsu_buf /* buffer address */ #define lifs_req lifs_lifsu.lifsu_req /* array returned */ }; The following text will be added to the ERRORS section: ERRORS ENXIO For SIOCGLIFSRCOF, the lifs_ifindex member of the lifsrcof structure contains an invalid value. For SIOCSLIFUSESRC, this error is returned if the lifr_index is set to an invalid value. EINVAL For SIOCSLIFUSESRC, this error is returned if either the lifr_index or lifr_name identify interfaces that are already part of an existing IPMP group. The following text for the new flag will be added: #define IFF_VIRTUAL 0x0100000000 /* Does not send or recv pkts */ The IFF_VIRTUAL flag will be applied to vni interfaces and lo0. Relationship with other Multipathing technologies: ------------------------------------------------- This feature used in conjunction with IP routing protocols provides a multipathing mechanism facilitating network redundancy. RFE 4923356 (need a vipa-like network multipathing technology) describes some of the trade offs between IPMP and this routing based multipathing scheme. For eg: with the IP routing based multipathing the interfaces in a group can be on different subnets (unlike IPMP in which they are on the same subnet) thus providing greater protection against network failures, since path to the network interfaces on different subnets can be made distinct more easily. Of the various multipathing solutions, IPMP and CGTP[2] protect paths in a layer 2 (bridging) environment, but not paths through the network layer (i.e., beyond the first router). This solution addresses this category; it works through an internet. There are also other multipathing solutions, such as AP and Sun Trunking that work at the interface layer and can be used to protect link facilities. Network designers may choose to use none, one of them, or several of them to cover for various kinds of equipment failure, depending on network topology. So, this is not a replacement for IPMP or any of the other multipathing solutions. The IPMP team is aware of this feature and has in fact contributed to the specification. When redundancy is implemented using IP routing, the systems participating in the network each run an interior gateway routing protocol (such as RIP, OSPF, or IS-IS). They communicate with each other about the failures of particular nodes and links, and are able to adjust the routes to forward around the damaged portions. Moreover, they're able to compute multiple independent paths through the same network to get to any destination, and load balance traffic among those paths (a feature known as Equal-Cost Multi-Path, or "ECMP"). This feature is a portion of the work that's necessary for a complete IP routing redundancy solution. It does not implement the routing protocols themselves. Instead, it provides operating system support for a feature commonly (though somewhat confusingly) called a "loopback address." These are addresses assigned to the system itself, rather than to any interface. The advantage of such an address is that it doesn't go away when an interface is DR'd away or renumbered. It's thus desirable for long-lived TCP/IP connections. Without this feature, applications must somehow "know" about the loopback addresses (likely through manual configuration of each application) and issue bind() calls to attach themselves to these addresses. Otherwise, they're assigned interface- related addresses automatically by the kernel. This feature allows the administrator to exert some influence over the kernel's choice of addresses. This feature, in combination with IP routing, form a common solution implemented by many other vendors. For example, IBM successfully markets this combination under the name "VIPA." Customers are already quite aware of this. Future work: ------------ An RFE (5060063) has been filed which is to be a placeholder for a generic interface that will allow applications to expand and track group memberships. References: ---------- [1] IP Network Multipathing (PSARC/1999/225) [2] Carrier Grade Transport Protocol (CGTP) (PSARC/2000/539) [3] IPv6 Default Address Selection (PSARC/2002/390) Acknowledgment: -------------- Thanks to James Carlson for providing bulk of the text of the "relationship with other Multipathing technologies" section.