0. Introduction This project adds basic Ethernet (layer two) bridging support to OpenSolaris. It consists of a Project Private kernel module and daemon, some Project Private SMF properties, and Committed dladm and SMF control interfaces. It is targeted for a Minor release of an OpenSolaris distribution; the change to the default "dladm show-link" output that causes Minor binding is detailed below. The architecture described in this document is based on the Clearview UV (PSARC 2006/499) terminology and dladm command-line design. In particular, Clearview obsoletes the idea of "network devices" and instead relies on "links" that may themselves be of varying types. The bridging protocol referred to in this document includes both the IEEE 802.1D-1998 "Spanning Tree Protocol," abbreviated in this document as "STP," and the IEEE 802.1Q-2004 "Rapid Spanning Tree Protocol," abbreviated as "RSTP." The newer and far more complex "Multiple Spanning Tree Protocol" (802.1Q-2005; MSTP) is intended to be backward compatible with STP. However, it is not part of this project, and may be the subject of a future project. 1. Administration All of the administration of this feature is based on dladm and SMF. The SMF portion provides the ability to enable, disable, and monitor bridge instances using the instance URIs described in section 3 below. The dladm portion creates and destroys bridges and assigns links to and removes links from them. 1.1 New dladm subcommands These commands are patterned after the existing aggregation commands in dladm, but without the "-t" support, as SMF doesn't adequately support temporary instances. (If required later, temporary instance and parameter support could be added.) dladm create-bridge [-R ] [-p ] [-m ] [-h ] [-d ] [-f ] [-l ]... This command creates a bridge instance and optionally assigns one or more network links to the new bridge. By default, no bridge instances are present on the system, and OpenSolaris will thus not bridge between network links by default. In order to bridge between links, you must create at least one bridge instance. Each bridge instance is separate: there is intentionally no forwarding connection between bridges, and a link is a member of at most one bridge. Note that a pair of internal bridges that are somehow interconnected are actually equivalent to a single larger bridge instance, so such a configuration should never be needed in ordinary practice. (These cases may be created using special test tools, however.) The provided is chosen by the administrator and arbitrary, but must at least be a legal SMF service instance name. For purposes of documentation, this is a URI component without escape sequences, meaning that the following characters may not be present: ; / ? : @ & = + $ , % < > # " including whitespace and ASCII control characters. The name "default" is reserved, as are all names beginning with the string "SUNW". Names with trailing digits are not permitted, in order to allow for creation of "observability devices." (For more about "observability," see section 2 below.) Because of the use of the observability devices, the names of legal bridge instances are further constrained to be a legal dlpi(7P) name, which matches: [A-Za-z_][A-Za-z0-9_]*[A-Za-z_] Names not matching that pattern will cause the command to fail and report "illegal name" to the user. Options are: -R Specify an alternate root directory. This allows the configuration of bridge instances in alternate roots, as with Live Upgrade and with jumpstart installs. Note that error checking for link type isn't possible when administering an alternate root, because the definition of the link itself may exist only in deferred commands in that alternate root. -p Specify the Bridge Priority. This sets the STP priority value for determining the root bridge node in the network. The default value is (per the specification) 32768, and legal values are 0 (highest priority) to 61440 (lowest priority), in increments of 4096. (This granularity is required per section 9.2.5 of IEEE 802.1D-2004; the lower 12 bits are now used for MSTP instances and treated as an ID extension.) If a value with any of the lower 12 bits set is used, then the system will silently ignore those bits and round downward to the next lower value divisible by 4096. -m Specify the maximum age for configuration information. This sets the STP Bridge Max Age parameter. Information older than this (in seconds) is discarded by all bridges in the network if this node is the root bridge. It defaults to 20 seconds. Legal values are from 6 to 40 seconds. (See the "-d " parameter for additional constraints.) -h Specify the Bridge Hello Time. This sets the STP Bridge Hello Time parameter. If this node is the root node, it sends Configuration BPDUs at this interval throughout the network. It defaults to 2 seconds. Legal values are from 1 to 10 seconds. (See the "-d parameter for additional constraints.) -d Specify the Bridge Forward Delay. This sets the STP Bridge Forward Delay parameter. This timer is used to sequence the link states when a port is enabled anywhere in the network if this node is the root bridge. It defaults to 15 seconds. Legal values are from 4 to 30 seconds. Bridges must obey the following two constraints: 2 * (forward_delay - 1.0) >= max_age max_age >= 2 * (hello_time + 1.0) Any parameter setting that would violate those constraints will be treated as an error and cause the command to fail with a diagnostic message. -f Specify the forced maximum supported protocol. This sets the MSTP maximum supported protocol number, and must be a non-negative. The default is 3. The current implementation doesn't support RSTP or MSTP, so this currently has no effect. However, if the user wishes to prevent MSTP from being used in the future when implemented, the parameter may be set to 0 (STP only) or 2 (allow STP or RSTP). -l Add a link to the newly-created bridge. This is equivalent to creating the bridge and then adding one or more links, as with the "add-bridge" option below, except that if any of the links cannot be added, then the entire command fails, and the new bridge itself isn't created. The option is repeated to add multiple links at once. Bridges may be created without links if desired. See the "add-bridge" subcommand for details on link assignment. Bridge creation and link assignment require PRIV_SYS_DL_CONFIG. dladm modify-bridge [-R ] [-p ] [-m ] [-h ] [-d ] [-f ] This subcommand modifies the operational parameters of a given bridge instance. All of the options are the same as for the "create-bridge" subcommand above, except that the "-l" option is not permitted. To add links to an existing bridge, use the "add-bridge" subcommand below. Bridge parameter modification requires PRIV_SYS_DL_CONFIG. dladm delete-bridge [-R ] This subcommand deletes a bridge instance. Unlike the bridge creation subcommand, which can add links while creating, the delete subcommand does not have the option to remove links during the deletion process. The bridge being deleted must not have any attached links. If it does, then an error is returned and no action is taken. The user must use "remove-bridge" first to deactivate the links. Bridge deletion requires PRIV_SYS_DL_CONFIG. The "-R" option is the same as for the "create-bridge" subcommand. dladm add-bridge [-R ] -l [-l ]... This subcommand adds one or more links to a bridge instance. If multiple links are specified, and adding any one of them results in an error, then no changes are made to the system and the command fails. Link addition to a bridge requires PRIV_SYS_DL_CONFIG. A link may be a member of at most one bridge. It's an error to attempt to add a link that already belongs to another bridge. To move a link from one bridge instance to another, remove it from the current bridge before adding it to a new one. The links assigned to a bridge must not themselves be VLANs, VNICs, or tunnels. Only links that would be acceptable as part of an aggregation or links that are aggregations themselves may be assigned to a bridge. Other link types will result in error messages, and no action taken. (A future project may provide bridging over tunnels using GRE, and over PPP using BCP. Those features are not part of this project, but nothing this project is doing will preclude those features from being supported in the future.) Links assigned to a bridge must all have the same MTU. This is checked when the link is assigned, and the link will be rejected if it is not the first link on the bridge and has a differing MTU in order to avoid inadvertent errors. Note that Solaris also allows the MTU on a link to be changed on an existing link. In this case, we will log an error and the bridge instance will go into maintenance state. The user may then remove or change the assigned links so that the MTU matches and then restart. In this initial version, the links must also be Ethernet type, which includes 802.3 and 802.11 media. Bridging is well-defined over a few other media, and there are some dodgy ways to make it work on still others, but those cases are subjects for a future release. (It is remotely possible that there may be some drivers that do not permit transmission of frames with a user-chosen source MAC address. None have been found yet, but if any are found in testing, these will be listed in the documentation as unsupported drivers.) When links are added to a bridge, the bridging protocol in use (STP) will be notified, and the links will behave as though just created. For STP, this means that the link will be shut down and then brought back up using the standard protocol. The options are the same as for the "create-bridge" subcommand. dladm remove-bridge [-R ] -l [-l ]... This subcommand removes one or more links from a bridge instance. If multiple links are specified, and removing any one of them would result in an error, then none are removed and the command fails. Link removal from a bridge requires PRIV_SYS_DL_CONFIG. When links are removed from a bridge, the bridging protocol (STP) is notified, and will likely recalculate a new network topology, unless those links were unused due to loop-pruning activity by the bridging protocol. The options are the same as for the "create-bridge" subcommand. dladm show-bridge [-p] [-o ,...] [] This subcommand shows the running status and configuration of bridges. When given a bridge name, it shows the status of that one bridge. If no bridge name is given, then it shows summary status of all bridges on the system. The '-o' option allows the user to specify a comma-separated case-insensitive list of fields to display. The field name may "all" to display all fields, or any combination of: BRIDGE Assigned name of the bridge (same as , if provided) ADDRESS Bridge Unique Identifier value (MAC) PRIORITY Configured priority value (-p) BMAXAGE Configured bridge maximum age (-m) BHELLOTIME Configured bridge hello time (-h) BFWDDELAY Configured forwarding delay (-d) FORCEPROTO Configured forced maximum protocol (-f) TCTIME Time since last topology change in seconds TCCOUNT Count of the number of topology changes TCHANGE Topology change detected ("yes" or "no") DESROOT Bridge Identifier of the root node (MAC + priority) ROOTCOST Cost of the path to the root node ROOTPORT Port used to reach root node MAXAGE Maximum age value from root node HELLOTIME Hello time value from root node FWDDELAY Forward delay value from root node HOLDTIME Minimum BPDU interval The default set of fields when -o is not specified is "BRIDGE," "ADDRESS," "PRIORITY," and "DESROOT." Note the lack of a "-R" option for show-bridge. It is not possible to list bridge configuration information in an alternate root, in keeping with the rest of the dladm user interface. The reason for this restriction is to allow the data to be represented in SMF, where "writing" to an alternate root is supported by way of copying appropriate commands to $ROOT/var/svc/profile/upgrade, but "reading" is not feasible because the repository on the alternate root may be incompatible with the running system. dladm show-bridge -l [-p] [-o ,...] This variant of the show-bridge subcommand displays link-related status information for a single bridge instance. Note that configured parameters are shown through show-linkprop. The relevant field names for the "show-bridge -l" subcommand are: LINK Link name INDEX Port (link) index number on the bridge STATE "disabled", "listening", "learning", "forwarding", or "blocking" UPTIME Number of seconds since last reset or initialize OPERCOST Actual cost in use (1-65535) OPERP2P P2P mode flag ("yes" or "no") OPEREDGE Edge mode flag ("yes" or "no") DESROOT Root Bridge Identifier (MAC + priority) seen on this port DESCOST Path cost to root node through designated port DESBRIDGE Bridge Identifier (MAC + priority) DESPORT Port ID and priority of port used to transmit configuration messages for this port TCACK Topology Change Acknowledge flag ("yes" or "no") The default set of fields when -o is not specified is "LINK," "STATE," "UPTIME," and "DESROOT." dladm show-bridge -s [-p] [-o ,...] [-i ] [] This variant shows statistics for the bridge given, or, if no bridge name is supplied, then for all bridges in the system. The relevant field names are: BRIDGE Bridge name DROPS Number of packets dropped due to resource problems FORWARDS Number of packets forwarded to another link RECV Number of packets received on all links SENT Number of packets sent on all links UNKNOWN Number of packets with unknown destination; sent to all links The default set of fields when -o is not specified is "BRIDGE," "DROPS," and "FORWARDS." dladm show-bridge -ls [-p] [-o ,...] [-i ] This variant shows statistics for all of the links on the bridge named. The relevant field names are: LINK Link name CFGBPDU Number of configuration BPDUs received TCNBPDU Number of topology change BPDUs received RSTPBPDU Number of Rapid Spanning Tree BPDUs received TXBPDU Number of BPDUs transmitted DROPS Number of packets dropped due to resource problems RECV Number of packets received by bridge XMIT Number of packets sent by bridge The default set of fields when -o is not specified is "LINK," "DROPS," "RECV," and "XMIT." 1.2 New dladm Link Properties These may be used with the existing dladm set-linkprop, reset-linkprop, and show-linkprop subcommands. Note the use of underscores in the names; this is to match the existing variable naming practice among dladm properties. "stp" This is a boolean property. It defaults to 1 (true), which enables STP and RSTP. When set to 0 (false), the link will not use any type of Spanning Tree, and will be placed into forwarding mode (with BPDU guarding) at all times. The "false" setting is appropriate for point-to-point links connected to end nodes. Only non-VLAN, non-VNIC type links have this property. "forward" This is a boolean property on all but VNIC links. It defaults to 1 (true). When set to 0 (false), the VLAN associated with the link instance will not forward traffic through the bridge. Setting the property to "false" is equivalent to removing the VLAN from the "allowed set" for a traditional bridge, which means that VLAN-based I/O to the underlying link from local clients still operates, but no bridge-based forwarding is done. "default_tag" This is a numeric property with range 0 to 4094. It defaults to 1. It defines the default VLAN ID that's assumed for untagged packets sent to and received from this link. Only non-VLAN, non-VNIC type links have this property. Setting this value to 0 disables the forwarding of untagged packets to and from the port. "stp_priority" This is a numeric property with range 0 to 255. It defaults to 128. It corresponds to the STP and RSTP Port Priority value, which is used to determine the preferred root port on a bridge by prepending to the port identifier. Lower numerical values are higher priority. "stp_cost" This is a numeric property with range 1 to 65535; zero is not allowed by the standard, and is used to signal "auto" (default) cost computed by link type. It represents the STP and RSTP cost for using the link, and is equal to (per the standard) 100 for 10Mbps, 19 for 100Mbps, 4 for 1Gbps, and 2 for 10Gbps. "stp_edge" This is a boolean property. It defaults to 1 (true). If set to 0 (false), the daemon will assume that the port is connected to other bridges even if no bridge PDUs of any type are seen. "stp_p2p" This is an enumerated value. Legal values are "true", "false", and "auto". When set to "auto" (the default), point-to-point connections are automatically discovered. Otherwise, the port mode is forced to point-to-point mode (for "true") or normal multipoint mode (for "false"). 1.3 New Kstats Each bridge instance will have a set of statistics, named "bridge::0:", where: Arbitrary instance number assigned by the kernel and not necessarily retained across reboot. Administrator-specified bridge name. Name of statistic; at least the following: learn_source Number of sources learned learn_expire Number of learnt entries expired learn_size Current count of learnt entries forward_direct Directly forwarded packet count forward_unknown Forwarded with unknown destination forward_mbcast Forwarded multicast/broadcast Each link instance will also have new kstats, named "bridge::0-:", where the names will be: xmit Packets forwarded to the link by bridging rcvd Packets received from the link (and forwarded elsewhere) by bridging All of these statistics are considered Volatile for now. The existence of the statistics will be documented for users, but with warnings that the names and definitions of the statistics may change incompatibly. A future case for the overall RBridges project will elevate these in stability. 1.4 dladm show-link Changes A new "BRIDGE" field is added to the "dladm show-link" output. If a link is a member of a bridge, then this field identifies the name of the bridge of which it's a member. This field is shown by default, right before the larger "OVER" field. For links that are not part of a bridge, the field is displayed as a blank string (if parseable output is selected) or as "--" if non-parseable. The addition of the "BRIDGE" field in the default output format may require Minor release binding. The utility is intended to be used in parseable mode when run in cases where the output format matters, and in these cases the changes proposed here are compatible, but we should still err on the side of caution. The bridge observability node also appears in the "dladm show-link" output as a separate link. For this node, the existing "OVER" field will list the links that are members of the bridge. 2. Kernel Features 2.1 Packet Observability Each bridge instance will be assigned an "observability device," in a manner similar to the DLPI nodes created for "Clearview: IP Observability Devices" (PSARC 2006/475). These nodes will appear under the /dev/bridge/ directory, named by the bridge name plus a trailing "0". The observability node is intended for use with snoop and wireshark. It behaves as a standard Ethernet interface, but does not permit the transmission of packets. All transmitted packets are silently dropped. It's not possible to plumb IP on top; attempts to do DL_BIND_REQ without using the passive option will fail. The user of this node will get a single unmodified copy of every packet handled by the bridge, similar to a "monitoring" port on a traditional bridge, and subject to the usual DLPI "promiscuous mode" rules. Filtering on VLAN ID is accomplished by the use of pfmod(7M) or features in snoop and wireshark; the VLAN PPA hack mechanism (PSARC 2000/147) is not supported. (Note that Crossbow [PSARC 2006/357] has withdrawn support for the VLAN PPA hack.) The packets delivered will represent the data received by the bridge. In the cases where the bridging process will add, remove, or modify a VLAN tag, the data shown will be before this process takes place, which may be confusing if there are distinct default_tag values used on different links. This isn't often the case, but it's an important caution. To see the packets transmitted and received on a particular link (after the bridging process is complete), snoop on the individual links rather than the bridge observability node. Due to the vanity naming support in Clearview, no special changes are needed to dlpi_open(3DLPI) to make it work with these observability nodes. They "just work." 2.2 DLPI Behavior When a bridge is enabled on a datalink, the link behaves slightly differently in order to accomodate bridging behavior. a. Link up/down (DL_NOTE_LINK_{UP,DOWN}) are delivered in the aggregate. This means that when all external links are showing link-down status, the upper-level clients using the MAC layers will see link-down events as well. When any external link on the bridge shows link-up status, all upper-level clients see link-up. There are several reasons for this behavior. When link-down is seen, it means that nodes on the link are no longer reachable. That is no longer true when the bridging code can still send and receive packets through another link. Administrative applications that need the actual status of links can use the existing MAC-layer kstats to reveal the status. These applications are unlike ordinary clients (such as IP) in that they report hardware status information and do not get involved in forwarding. In the case where all external links are down, we let the status show through as though the bridge itself were shut down. In this special case, we allow the system to recognize that nothing could possibly be reachable. The trade-off is that bridges can't be used to allow local-only communication in the case where all interfaces are "real" (not virtual) and all are disconnected. (This could be made an option in the future if desired; the result would be that bridge links, like VNICs, are always "running.") b. All link-specific features are made generic. Links that support special hardware acceleration features will be unable to use those features because actual output link determination is not made entirely by the client: the bridge forwarding function has to choose an output link based on the destination MAC address, and this can be any link on the bridge. It may be possible in the future to handle various acceleration modes with bridging enabled. Doing so would mean either emulating the acceleration logic on links that lack it, or exposing the per-L2-destination nature of the behavior to MAC clients. Such extensions are not part of this project and not currently planned. One reason we are not planning to support these features is that enabling bridging fundamentally requires that the interfaces all be placed into promiscuous mode. In that mode, the system must handle all packets on the wire, and most hardware devices disable optimizations as this is the "slow mode." 3. STP Daemon Each bridge (created via "dladm create-bridge") is represented as an identically-named SMF instance of svc:/network/bridge. Each instance runs a copy of /usr/lib/bridged, which implements the Spanning Tree Protocol (STP). For example, if the user runs: # dladm create-bridge mybridge The system will have an SMF service named: svc:/network/bridge:mybridge and (per section 2 above) an observability node named: /dev/bridge/mybridge0 By default, all ports run standard STP. This is done for safety reasons: a bridge that does not run some form of bridging protocol (such as STP) can form long-lasting forwarding loops in the network. Because Ethernet has no hop-count or TTL on packets, any such loops are fatal to the network. When the administrator knows that a particular port is not connected to another bridge (for example, a direct point-to-point connection to a host system), STP can be disabled administratively for that port. Even if all ports on a bridge have STP disabled, the STP daemon still runs; this is in case new ports are added, for implementation of BPDU guarding, and because the daemon is responsible for enabling and disabling forwarding on the ports. When a port has STP disabled, the daemon will still listen for BPDUs (BPDU guarding). It will flag an error (via syslog) if any are seen, and disable forwarding on the port, as this typically indicates a serious network misconfiguration. The link will be reenabled when link status goes down and then up again, or when the administrator manually reenables by removing the link and readding it. (This implementation does not include Cisco's "portfast" feature.) If the SMF service instance for a bridge is disabled, then bridge forwarding stops on those ports as the STP daemon is stopped. If the instance is restarted, STP starts from its initial state. The bridge daemon runs as UID/GID "daemon" with PRIV_SYS_DL_CONFIG in order to access the raw network devices, but with most other basic privileges (e.g., PRIV_PROC_FORK and PRIV_PROC_EXEC) removed. This is set up by the SMF profile for the service. The user does not invoke the daemon directly. The existing "Network Management" RBAC profile is sufficient for the privileges required to administer bridges using dladm. No new RBAC or Least Privilege changes are required. 4. VLANs 4.1 VLAN Administration In general, administrators will want to have the VLANs they configure on the system to be forwarded among all the ports on a bridge instance, so this will be the default for VLANs. When the administrator invokes Clearview's "dladm create-vlan", and the underlying link is part of a bridge, that command will also enable forwarding of the specified VLAN on that bridge link. If an administrator wants to configure a VLAN on a link but not allow forwarding to or from other links on the bridge, then he must take specific action to do so, by disabling forwarding with "set-linkprop" -- see the "forward" parameter in section 1.2 above. Clearview UV provides two mechanisms for the creation of VLANs. The primary means of configuration is the new "dladm create-vlan" subcommand, which automatically enables the VLAN for bridging as described above, if the underlying link is configured as part of a bridge. The second mechanism is a legacy feature called the "PPA hack." This allows a user to create a VLAN simply by opening a DLPI provider and specifying a VLAN ID number as part of the PPA. This feature has been removed by Crossbow. However, in the event that bridging integrates without Crossbow, will default to disabling VLAN forwarding for these by default. In this case, the user may be doing nothing other than snooping on that VLAN, so adding the VLAN to the allowed set automatically is likely not the right answer. Administrators with legacy PPA hack VLANs will need to reconfigure to use the new Clearview VLANs to take full advantage of bridging, and, if Crossbow does not remove them first, this issue will be included in the documentation. Architecturally, this also means that all VLAN operations for bridging (enabling and disabling forwarding paths) can be driven by the user space libdladm, and do not need special support from the VLAN portions of the kernel dls module. In standards-compliant Spanning Tree, VLANs are ignored. The bridging protocol computes just one loop-free topology using tag-free BPDU messages and uses this tree to enable and disable links. Administrators are required (by the standard) to configure any "duplicate" links they may provision in their networks such that when those links are automatically disabled by STP, the configured VLANs are not disconnected. This means very careful administrative attention: either run all VLANs everywhere on your bridged backbone, or examine all loop-forming links carefully. MSTP (not included in this project) is somewhat similar, but allows administrators to assign each VLAN to a small number of distinct spanning tree "instances," and allows instances within an identically-configured "region" to have distinct topologies. In terms of this project, additional bridge and link properties would be required to enable MSTP operation. 4.2 VLAN Behavior The bridge performs forwarding by examining the allowed set of VLANs (as described above) and the default_tag parameter for each link. The steps involved are input VLAN determination, link membership check, and then tag update. Input VLAN determination begins with a received packet on a link. When a packet is received, it is checked for a VLAN tag. If that tag is not present or the tag is priority-only (tag zero), then the default_tag configured on that link (if not set to zero) is taken as the internal VLAN tag. If the tag is not present or zero and the default_tag is zero, then the packet is ignored; no untagged forwarding is performed. If the tag is present and it's equal to the default_tag, then the packet is also ignored; this is an error case. Otherwise, the input tag is taken to be the input VLAN. Next, the link membership check is performed. If the input VLAN is not configured as an allowed VLAN on this link, then the packet is ignored. Forwarding is then computed, and the same check is made for the output link. Finally, the tag update is done. If the VLAN (non-zero at this point) is equal to the default_tag on the output link and the priority value is zero, then the tag on the packet (if any) is removed. If the priority value is non-zero, then the output tag is set to zero. If the VLAN is not equal to the default_tag on the output link, then a tag is added if not currently present, and the tag is set for the output packet. Note that in the case where forwarding sends to multiple interfaces (for broadcast, multicast, and unknown destinations), the output link check and tag update must be done independently for each output link. 5. SMF Properties These parameters are all Project Private. They will not be documented, and the documented administrative interface will be the dladm command. 5.1 STP SMF Property Name Type Default -------------- ---- ------- config/priority ushort_t 32768 config/max-age ushort_t 5120 (20 seconds) config/hello-time ushort_t 512 (2 seconds) config/forward-delay ushort_t 3840 (15 seconds) config/force-protocol int 3 All of these properties (and their default values and granularities) are defined by the STP and related standards. The "force-protocol" parameter is specified to allow for an upgrade path. Users who do not want to see the use of MSTP when it is implemented can set this parameter to 0 or 2 (as specified in IEEE 802.1Q-2004) to select STP or RSTP as the maximum allowed protocol. In this project, the parameter will have no effect, as only STP is implemented. 5.2 Datalink Configuration Current storage for datalink configuration information is in /etc/dladm/datalink.conf, and is manipulated by dladm. To this existing file format, we will add the following keyword: bridge=string, When we are eventually able to switch over to SMF for link configuration (not this project), the parameters will be: Property Name Type Default -------------- ---- ------- config/bridge string "" On a Nemo driver (physical device), legacy device, or aggregation, the link parameters are used as above. On a VLAN, the "bridge" parameter is reserved for use with MSTP, where it will select an instance. This parameter is not used on VNICs, as each VNIC is constructed atop a VLAN or regular datalink. 6. Relationship To Other Projects And Futures 6.1 Virtual Switches Several other projects, including Crossbow and LDOMs, have independently introduced "virtual switches" into Solaris. Though superficially similar, these are not the same thing as 802.1D bridges. The differences include: a. A virtual switch cannot forward between physical interfaces on a given machine. It lacks the learning and loop-avoidance (Spanning Tree) mechanisms necessary to do that. b. A virtual switch doesn't need a forwarding database. It simply looks up the unicast destination among the known clients (virtual NICs), and delivers to one of them if a match is found or to the single external link, if none is found. c. A virtual switch can optimize substantially for the case of known local MAC addresses (using multiple receive functions and hardware support in the MAC layer), and for driver-specific features such as hardware checksum. A bridge cannot do this, as it must listen in promiscuous mode at all times and must be able to transmit a packet on any interface regardless of hardware support. (IP would have to understand a per-MAC- address capability list rather than per-link in order to use hardware features.) A useful analogy is that VNICs are the MAC layer equivalent of the L3 concept of IP aliases. They allow the user to create multiple MAC address instances on a single datalink, and, if on the same subnet, each instance can communicate with the external world through one datalink and with the others through internal loopback, but that internal communication is not the same thing as (or even related to) IP forwarding. If one ignores the performance issues noted in (c) above, the basic forwarding and learning features of an 802.1D bridge are a functional superset of those provided by a "virtual switch," except that a real bridge has no way to add virtual links. Again ignoring performance, it may be possible to replace these virtual switches with real bridge instances. The work required would include some way to configure "fake" links into a bridge. The "etherstub" feature provided by Crossbow may be one simple way to do that. Future projects may address this area, however the expectation is that performance of a bridge configured for this case will be below that of a dedicated virtual switch. 6.2 Zones, Xen, and LDoms Bridging is a feature similar to link aggregations in terms of its position in the network and usage in a data center. It will not be accessible from non-global zones (of any sort), but should be accessible from within virtualized environments such as Xen and LDoms. When Xen or LDoms is used with virtual NICs, a bridge running inside the DomU will see a link that doesn't go down when the external link goes down. The configuration is similar to interposing a repeater between the actual Ethernet port and the internal virtual port, except that because the normal I/O path is obscured, the bridging daemon will not see the half-duplex state that an actual repeater would produce, and thus will not determine stp_p2p state properly. Users attempting to run bridges in Xen DomU or under LDoms will need to force stp_p2p to "false" instead of "auto." 6.3 RBridges/TRILL A future project will introduce RBridge support with TRILL encapsulation. This project will extend the *-bridge features proposed here for dladm, and will add new parameters to control TRILL and RBridge behavior. 6.4 Forwarding Tables It may be useful to be able to display and manipulate the MAC-level forwarding tables used within bridging. This project does not define a mechanism to do this, but such a feature is likely to be included with RBridges. BSD uses protocol "bdg" in netstat to display forwarding tables, and it seems reasonable that we should do the same, though that is not part of this project. 6.5 "local-mac-address?" When set to "false," SPARC platforms will errantly use the same MAC address on multiple ports. Most modern platforms use "true" by default, but a few older ones use "false" (and some completely obsolete ones can't use "true" at all due to non-IEEE compliant hardware limitations). This shouldn't be a significant issue for bridging, as we can identify multiple local receivers for an inbound frame, but this will be included as a "don't do this" in the user documentation. 6.6 Kernel Integration This project is affected by the Crossbow changes to Nemo/GLDv3, and is planned to integrate after that project. The key portion of kernel integration work is with the MAC layer, where a bridge must function much like an aggregation, except that it does not open the links underneath exclusively and the node on top is for observability only. To implement the latter feature, we will make sure that "active" clients get an error when attempting to bind, just as is done today to prevent IP from being used on individual links within an aggregation. In terms of processing order, input packets must be handled by 802.3ad aggregation first, then by bridging, then VLAN segregation, and finally VNIC matching. Output packets go in the reverse order: VNIC matching, VLAN tagging, bridging, and finally 802.3ad load balancing. In terms of functionality, bridging must impose on the hardware features reported back to clients (such as IP) so that features that are not equivalent among all the links are not used, and it must set all of the active links into promiscuous mode. Additional details are available in the project design documents. 6.7 Bridging Gaps Some things will not be handled as well as they could be with this project due to resource constraints. Notable among these are a couple of bridge-related features. a. Bridges should be able to preserve CRCs end-to-end and not regenerate them during forwarding. The IEEE specifications allow regeneration, but rightly note that it's safer to do incremental update (where possible). Doing this would require MAC layer extensions, and might not be possible with all network adapters. b. When a link is disconnected from a bridge (through the administrative commands), it would be useful to force it into a link-down state externally, so that the link partner correctly detects the event and updates its state promptly. The Solaris MAC layer has no such feature, and adding one would involve extensive changes to a large number of drivers. 6.8 L2 Filters It would be wise to prevent any L2 Filtering feature from ever dropping MAC layer control messages (such as Bridge PDUs) in order to avoid known pathological cases in Spanning Tree that result in network failure. No filter that drops frames addressed to the 01:80:c2:00:00:0x range (16 specific multicast addresses) should ever be permitted by the system. 6.9 Solaris Audit All of the commands issued to the bridging daemon reflect the setting of parameters in dlmgmtd's datalink.conf storage or in SMF via configd. As a result, auditing changes is the responsibility of those components. The bridging daemon itself is responsible only for protecting the integrity of the door-based interface so that processes calling it have the same privileges as those modifying the configuration parameters. An existing issue here is that dlmgmtd does not audit parameter changes into local storage. This should be the subject of a separate project. ARC Note: the above assertions may not be complete; the project team intends to consult with the Solaris Auditing team to make sure that the right events are audited. 7. Implementation Alternatives 7.1 Using A Separate Command An alternative administrative command set design might be to create a new bridge control command (bridgeadm), rather than using dladm. The main problem with this command separation is that the configuration of the bridge would end up being split between two different utilities in a somewhat incoherent manner. Why would IEEE 802 aggregations be part of dladm but IEEE 802 bridges be configured elsewhere? Parts of the configuration of a bridge (such as the set of allowed VLANs and the default VLAN tag for a given link) are naturally part of the link configuration, and not a common property of the bridge. The creation of VLANs (logically located "above" links and bridges) and regular Ethernet links (logically located "below" VLANs and bridges) via dladm while bridging itself is in bridgeadm seems like a very strange result. It would be more natural only if VLAN and VNIC administration were in a separate command as well. We could still create a separate bridgeadm despite the above conceptual problems, but then we'd likely have to deal with the VLAN issues (enabling and disabling forwarding for each VLAN) some other way. Most likely, we would end up with either duplicate configuration in bridgeadm or the bulk of bridge configuration actually going on in dladm per-link properties, and only bridge create/destroy done via bridgeadm. While there are several IEEE-specified parameters for bridges, they're rarely of much interest, so that proposed separate utility wouldn't do very much in ordinary use. The main thing users need to manipulate for bridges are the VLANs, currently a dladm object, and we need to figure out how to represent that manipulation. We have chosen to equate dladm-created-VLAN with bridge-allowed-VLAN because it seems to produce the most natural results: there's only one way to "create" or "destroy" a VLAN in the system. The alternative is to break those apart, and allow users to create VLANs for potential use with IP via dladm (or some other command), and separately assign VLANs to bridge ports via bridgeadm, but that runs the very likely risk of misconfiguration: either forgetting to enable a bridge link for a VLAN while having IP plumbed atop, or thinking that destroying the VLAN removes it from the bridge. In any event, it creates multiple steps for the user to follow rather than one. Since neither of those misconfiguration scenarios seems to be particularly useful, allowing for them doesn't seem like a worthwhile goal. Or, for a really short answer: dladm is the location of all things datalinkish, and bridging is (like VLANs and aggregations) a datalink function. 7.2 Link Configuration Storage Alternative designs for the configuration information include having the set of links for a bridge listed as part of the bridge configuration, and using non-SMF files for storing configuration. The former approach (putting a list of links in the bridge) would work, and would have the advantage that during start-up of the STP daemon it would be easy to find the list of links configured for that instance. That's a benefit over the proposed design in that we will need to iterate over all links to get the list needed for a single instance. However, there are two reasons this approach wasn't chosen: a. A link may be a member of at most one bridge. This semantic is easy to enforce with a link property, as there's just one instance of the property, but is hard to enforce across multiple bridges. We end up needing to scan all bridge instances during configuration changes, and configuration transactions become more complex because two objects need to be changed at one time, so locking order matters. b. We want to have all configuration parameters for a link stored with the link itself. Having parameters for a link stored elsewhere in the system means that utilities that manipulate links or just display system configuration may end up needing to scan through these other locations in order to make coherent system changes. (For this project, we would be forced to change the existing Clearview "dladm delete-link" functionality so that it scanned the bridge instances and removed any links found there. Storing the data with the link instance removes that requirement.) The second approach of using non-SMF files would also work, and we could make use of the Clearview UV "link IDs" to avoid problems inherent with link renaming. However, longer term, the Clearview and NWAM teams are refactoring link configuration into SMF. Having native bridging designed for OpenSolaris but not actually integrated with its core administrative mechanisms seems like a poor recipe for the future. Placing these parameters with the rest of the link parameters means that when that day comes, the transition should be simple. 7.3 Obscuring Datalinks It would be possible to make bridging's use of links be exclusive (much as is done with member links in an aggregation), and force IP to use virtual links on the bridge for access rather than using the underlying links. Doing this would make bridging more like other solutions, but it would disable key features. Users today can open raw DLPI devices and talk to a given datalink instance; that would go away when bridging is used, and users would be forced to rely on bridging to transport packets to the desired interfaces. In particular, there'd be no obvious way to use a non-forwarded (interface local) VLAN on any interface. Doing this would also mean that VLAN membership would have to be configured on those exclusive-use links through some other parallel mechanism, resulting in the same sorts of problems as documented in 7.1 above. 8. Interface Summary Interface Stability Comments --------- --------- -------- dladm *-bridge Committed new subcommands field names Committed dladm show-bridge -o link properties Committed show-link BRIDGE Committed new field kstats Volatile Should be raised later /dev/bridge/ Committed Observability node control ioctls Project Private /usr/lib/bridged Project Private svc:/network/bridge Committed SMF URI config/* Project Private SMF properties bridge module Project Private Kernel bridging module /var/run/bridge_door/ Project Private Doors interface to daemons librstp.so.1 Project Private RSTP implementation mac, dls, dld Consolidation Private Kernel APIs