Background ========== Packet capture on Solaris is currently built around the use of DLPI. Whilst the introduction of libdlpi (PSARC/2006/436[1]) has made it easier to program using DLPI and the IP Observability Project (PSARC/2006/475[2]) introduced the means by which packets that are local to the host could be intercepted, neither did anything to address the primary problem with DLPI: compared to other mechanisms, it is slow, the in-kernel filtering is either not used or very primitive and provides very little useful information about the packet capturing itself by way of statistics. Introduction ============ The architecture of BPF[7] lends itself to more efficient means of doing packet capture, where a single read can transfer large numbers of packets per call. It also allows the sniffer to choose how much data from each packet they wish to copy, be it the entire packet or just the first 128 bytes to capture headers. Internal Architecture ~~~~~~~~~~~~~~~~~~~~~ Internally, the architecture of BPF is very simple: it has a lower half that receives packets from the NIC drivers, copying matching packets into a static buffer and an upper half that implements a character pseudo-device. Buffers ------- The backing for the pseudo-device operating as a character device is a buffer allocated by the driver for storing packet data in. The buffersize used by the device for storing copied packet data in is set by the application. By default libpcap sets this size the the same size as the driver's default: 32k. The maximum this project allows is 16M. Two buffers of this size are allocated by the driver: an "active" buffer and a "hold" buffer. This supports applications doing sleeping reads, if they aren't using poll, and reading an entire buffer of data whilst the system continues to catch new packets. Applications can set the buffer size using libpcap or with the BIOCSBLEN ioctl (see man page.) List of Interfaces ------------------ BPF maintains an internal list of network interfaces that it supports capturing packets for. What distinguishes this list from that either in the mac or ip modules is that it uses the datalink type as a part of the key for determining what is an identical entry. Additionally, on OpenSolaris the device structure used inside of the ip module is different to the mac module, preventing either one being used as a master list by BPF. Answering queries such as returning the complete list of datalink types supported by a device (BIOCGDLTLIST), would be much more complicated without that internal list. Packet Capture -------------- When BPF is called from the mac layer, it is handed the packet as it is received from the NIC driver as part of the promiscuous callback handling in the mac layer. It is the same mblk_t for the packet that will later be passed on though the stack and has neither the mblk_t's nor dblk_t's duplicated. Thus the capturing of the packet becomes part of the execution of the datapath for each packet. Interactions with existing technology in Solaris ================================================ This section goes into detail about what impact this project has on other areas of Solaris or what impact they have on this project. Vanity Naming ~~~~~~~~~~~~~ The Vanity Naming Project[6] introduced the means by which link names could be changed to be a different name than the underlying mac name. This project will only support packet capture on interfaces using the interface name allocated by the dls module that was delivered by the vanity naming project. IP Observability ~~~~~~~~~~~~~~~~ The IP observerability project introduced the ability to capture packets from within IP, presenting them through devices files in /dev/ipnet for libdlpi to use. This project will update some of the interfaces introduced by IP observability. Updating IPNET -------------- | In the IP Observability project[2], a new way to capture packets on | Solaris was introduced. The packets that are observed on network devices | in /dev/ipnet have a structure prepended to them that is described by | ipnet(7d). It was introduced as "version 1" of IPNET. | | This project proposes to obsolete version 1 of IPNET and to replace | it with "version 2". This is perceived as being feasible due to the | IPNET largely being an internal interface that has no external use yet. | | In addition this project will update the internal mechanism used to | supply packets to interested hooks internally to use netinfo | (PSARC/2008/219[3]). This change will allow BPF to easily subscribe | to the IPNET packet events and provide packets to applications. | | A single, new, event will be added for packets supplied using this | mechanism - NH_OBSERVING. | | Callbacks that are activated with this event will receive a pointer to | a hook_pkt_observe_t structure as the hook_data_t value. | | Zones | ~~~~~ | The following table details what type of access is available from | a zone on Solaris. The three zone categories are "Global" (the | global zone), "Shared" (local zone shares its networking instance | with the global zone) and "Exclusive" (local zone has its own | instance of networking.) | | Shared | Exclusive | Global | -------------------------+--------+------------+------------+ | DLT_IPNET in local zone | Read | Read | Read | | -------------------------+--------+------------+------------+ | Raw access to zone's NIC | None | Read/Write | Read/Write | | -------------------------+--------+------------+------------+ | Raw access to all NICs | None | None | Read/Write | | -------------------------+--------+------------+------------+ mac networking layer ~~~~~~~~~~~~~~~~~~~~ The design of BPF on BSD places it in the MAC layer where it has easy access to functions to enable/disable promiscuous mode and the data structures used to represent network interfaces. Promiscuous callbacks --------------------- This project does not add or change any of the existing promiscuous callbacks that exist in the mac module. On the receive side, the promiscuous callbacks have been placed in the mac module early in the receive path, before any classification work is done on the packet. This means that packet sniffing will happen at line rate but still in accordance to how the NIC has been programmed to queue packets on its rings. On the transmit side, the promiscuous callbacks are activated before a packet packet is classified and put in the appropriate descriptor ring for transmission. Existing Interfaces (mac layer) ------------------------------- With the mac layer presented by Crossbow, most of the necessary features are provided, albeit via private interfaces: dls_mgmt_get_linkid mac_client_close mac_client_handle_t mac_client_name - returns the name associated with the mac_client object mac_client_open mac_handle_t - value returned by a successful call to mac_open that is used with calls to other functions in the mac layer mac_multicast_add - add another multicast address to the interface mac_multicast_remove - remove a multicast address from the interface mac_name - renames the name of the MAC device (can be different to the name returned by mac_client_name) mac_open mac_promisc_add - enable receipt of packets "promiscuously" via callbacks mac_promisc_handle_t mac_promisc_remove - disable callbacks for packets mac_sdu_get - returns the MTU for a MAC mac_tx - deliver a packet | New Interfaces (dls layer) -------------------------- BPF requires notification about each NIC that gets added, its datalink type and length of the hardware address. It uses this information to maintain an internal list of network interfaces that are available for capturing packets on. To support this, it is necessary to call into BPF | from dls inside of dls_devnet_create() and dls_devnet_destroy(). This is | done using a function pointer that is set by calling a new function called | dls_set_bpfattach(). This function will take two parameters, one is | the function to call when attaching (dls_devnet_create), the other is the | function to call when detaching (dls_devnet_destroy). To account for the | fact that drivers may have already called dls_devnet_create() before | dls_set_bpfattach() is called, dls_set_bpfattach() will walk through all existing attached devices and call bpfattach() for each one of those. This approach is necessary due to the lack of a mechanism equivalent to netinfo's NE_PLUMB/NE_UNPLUMB events for IP, which will be used to support IPNET. Whilst the interface mentioned here is private, it's listed here for completeness. | dls_set_bpfattach - set function pointers to be called when drivers call | dls_devnet_create | New Interfaces (mac) | -------------------- | An additional change is required in the mac module to improve performance: the introduction of a new promiscuous callback flag: MAC_PROMISC_FLAGS_NO_COPY. By default, the current mode of operation when registering a callback with the mac module today using mac_promisc_add will perform a copymsg() on every packet passed in. When peek'ing at packets, such as what BPF does, it isn't necessary to create a new mblk_t as BPF does not perform any destructive operations on the packet. When this flag is presented with MAC_PROMISC_FLAGS_VLAN_TAG_STRIP, the behaviour is to fall back to doing the copymsg(). MAC_PROMISC_FLAGS_NO_COPY - promiscuous callback is well behaved and does not need a copy of the packet | Two additional accessor functions for the mac_impl_t are being added, | mac_addr_length() to return the MAC address length and mac_type() to | return the MAC type. Both values come from the mac plugin used by the | mac in question. Both functions have been designed to return the | respective value, rather use pass by reference. libpcap ~~~~~~~ This project will update the copy of libpcap delivered by PSARC/2008/288[5] to use the BPF interface delivered with this project. A follow on project may look at extending the libpcap filtering langauge to easily allow filtering on the IPNET header fields. snoop ~~~~~ While implementation of snoop will continue to use DLPI via libdlpi to retrieve packets, it will be updated to understand version 2 of the IPNET packet headers and by default will use this version when communicating with devices in /dev/ipnet. Support for understanding the existing version 1 headers will remain. libnet ~~~~~~ The implementation of libnet delivered into the SFW consolidation by PSARC/2008/409 will not be modified as a part of this case. etherstubs ~~~~~~~~~~ When an etherstub is being used as the locus for a vnic, it will be possible to see the network traffic on the vnic that: - is broadcast or multicast at the link layer - is moving between zones that are using vnic's on top of the etherstub driver to support an exclusive instance of IP It will not be possible to see link layer traffic for IP traffic between two local zones that are using a shared stack, or even multiple vnics in the global zone. To obesrve that traffic, the IPNET link layer must be specified. New Interfaces ============== This section looks at each of the new interfaces being introduced. Those that are described in the man page, bpf.7d, are not discussed here. Loopback DLT type ~~~~~~~~~~~~~~~~~ With the provision of access to loopback data by adapting the mechanism used for ipnet, access to packets on the loopback interface as well as those inside of IP moving between zones using a shared stack instance model can be achieved. | For each network interface that is used with the IP protocols, a tap | point will be created with a new different datalink type. The datalink | type for this will be DLT_IPNET. Both the structure contents and | datalink type name will be registered with the tcpdump project[4]. | The packet header structure used with DLT_IPNET will be the same | as used with version 2 of the IP Observability devices in the | /dev/ipnet directory. The proposed structure will be called | dl_ipnetinfo_v2_t and the details pertaining to it can be found | below. /dev/bpf ~~~~~~~~ This driver ships using /dev/bpf as the device file for applications to open. Whilst the creation of the device cannot be as a clone (it is not a STREAMS driver), it is still possible for the driver to assign a new minor number each time the device is opened. Thus even though the driver isn't a clone driver, it is not necessary to have /dev/bpf0-15 for proper BPF semantics. driver.conf file ~~~~~~~~~~~~~~~~ This project intends to use the driver.conf as the means by which the default and maximum buffer sizes can be changed. By default these are 32k and 16M repectively. They can be changed using the names "buf_size" and "max_buf_size". It is not envisaged that these will ever need changing but scope is provided for those that either need or wish to. Whilst there are numerous new avenues being explored for datalink and IP administration, it needs to be remembered that whilst this device is centered around networking, it is neither a datalink nor an IP interface and thus isn't managed by dladm and friends. The need to change (increase) the value from that shipped is expected to be a rare event. ~~~~~~~~~~~ This file contains all of the structure and ioctl definitions that make up the programming interface for BPF. Four of the ioctls listed below are being introduced as "Project Private" as they form the foundation for supporting 32bit applications running against 64bit kernels. This is necessary because some of the structures exchanged between the bpf driver and applications contain pointers. ~~~~~~~~~~~~~~~ This is file contains the definitions for structures used in the kernel and is thus private to the project. Observability ~~~~~~~~~~~~~ dtrace probes ------------- This case will introduce 1 new SDT dtrace, "bpf-capture", probe that is uncommitted. The arguments for this probe are: Arg# Type Comment ---- --------------- ----------------------------------------------- 0 struct bpf_if * pointer to the BPF interface structure 1 struct bpf_d * pointer to the BPF descriptor holding all of the application relevant information 2 void * pointer to buffer holding the packet 3 u_int actual packet length 4 u_int length to copy into capture buffer (0 == packet did not match the filter) kstats ------ A single set of statistics will be exported via kstats that cover the operation of the entire BPF module. The kstats will be installed under a new module name, "bpf" with the name "global" and class "misc". THe kstats that will be introduced are: receive - the number of packets that are passed into the BPF filter capture - the number of packets successfully captured and written into a capture buffer dropped - the number of captured packets that are not recorded by BPF because there is not enough room left in the capture buffer readWait - this counter is bumped up every time read(2) is called and the BPF driver has to wait for data writeOk - a packet was successfully delivered to the mac layer for transmission by the driver writeError - an error occured trying to write the packet via mac_tx The first three, receive, capture and dropped, are module wide accumulations of the per open file descriptor counters that are kept. Manual page ~~~~~~~~~~~ The manual page for bpf will be delivered to section 7d. A copy from BSD is provided with this case. Exported Interface Table ~~~~~~~~~~~~~~~~~~~~~~~~ +---------------------------+-----------------------+-------------+ | Interface | Commitment | Comments | +---------------------------+-----------------------+-------------+ | usr/kernel/drv/bpf | Project Private | | | usr/kernel/drv/bpf.conf | Project Private | | +---------------------------+-----------------------+-------------+ | /dev/bpf | Uncommitted | | | | Committed | | | | Project Private | | +---------------------------+-----------------------+-------------+ | BPF_MAJOR_VERSION | Committed | | | BPF_MINOR_VERSION | Committed | | | BIOCGBLEN | Committed | | | BIOCSBLEN | Committed | | | BIOCSETF | Committed | | | BIOCFLUSH | Committed | | | BIOCPROMISC | Committed | | | BIOCGDLT | Committed | | | BIOCGETIF | Committed | | | BIOCSETIF | Committed | | | | BIOCGETLIF | Committed | | | | BIOCSETLIF | Committed | | | BIOCSORTIMEOUT | Committed | | | BIOCGORTIMEOUT | Committed | | | BIOCGSTATS | Committed | | | BIOCIMMEDIATE | Committed | | | BIOCVERSION | Committed | | | BIOCSTCPF | Committed | | | BIOCSUDPF | Committed | | | BIOCGHDRCMPLT | Committed | | | BIOCSHDRCMPLT | Committed | | | BIOCSDLT | Committed | | | BIOCGDLTLIST | Committed | | | BIOCGSEESENT | Committed | | | BIOCSSEESENT | Committed | | | BIOCSRTIMEOUT | Committed | | | BIOCGRTIMEOUT | Committed | | | BBIOCSETF32 | Project private | | | BIOCGDLTLIST32 | Project private | | | BIOCSRTIMEOUT32 | Project private | | | BIOCGRTIMEOUT32 | Project private | | +---------------------------+-----------------------+-------------+ | struct bpf_dltlist | Committed | | | struct bpf_hdr | Committed | | | struct bpf_insn | Committed | | | struct bpf_program | Committed | | | struct bpf_stat | Committed | | | struct bpf_timeval | Committed | | | struct bpf_version | Committed | | +---------------------------+-----------------------+-------------+ | NH_OBSERVING | Committed | | | hook_pkt_observe_t | Committed | | | | DLT_IPNET | Committed | | | dl_ipnetinfo_v2_t | Committed | | +---------------------------+-----------------------+-------------+ | bpf-capture | Uncommitted | SDT probe | +---------------------------+-----------------------+-------------+ | MAC_PROMISC_FLAGS_NO_COPY | Consolidation Private | | | | dls_set_bpfattach | Project Private | | | | mac_addr_length | Consolidation Private | | | | mac_type | Consolidation Private | | +---------------------------+-----------------------+-------------+ Structure Definitions ~~~~~~~~~~~~~~~~~~~~~ struct bpf_dltlist ------------------ struct bpf_dltlist { u_int bfl_len; /* number of bfd_list array */ u_int *bfl_list; /* array of DLTs */ }; struct bpf_hdr -------------- struct bpf_hdr { struct bpf_timeval bh_tstamp; /* time stamp */ uint32_t bh_caplen; /* length of captured portion */ uint32_t bh_datalen; /* original length of packet */ uint16_t bh_hdrlen; /* length of bpf header (this struct plus alignment padding) */ }; struct bpf_insn --------------- struct bpf_insn { uint16_t code; /* Instruction */ u_char jt; /* Jump true */ u_char jf; /* Jump false */ uint32_t k; /* space for constant */ }; struct bpf_stat --------------- struct bpf_stat { uint64_t bs_recv; /* number of packets received */ uint64_t bs_drop; /* number of packets dropped */ uint64_t bs_capt; /* number of packets captured */ uint64_t bs_padding[13]; }; struct bpf_timeval ------------------ struct bpf_timeval { int32_t tv_sec; int32_t tv_usec; }; struct bpf_version ------------------ struct bpf_version { u_short bv_major; u_short bv_minor; }; struct bpf_program ------------------ struct bpf_program { u_int bf_len; /* length of program to load */ struct bpf_insn *bf_insns; /* pointer to program to load */ }; Structure supplied with NH_OBSERVE events ----------------------------------------- typedef struct hook_pkt_observe_s { uint8_t hpo_version; uint8_t hpo_family; uint16_t hpo_htype; uint32_t hpo_pktlen; uint32_t hpo_ifindex; uint32_t hpo_grifindex; uint32_t hpo_zsrc; uint32_t hpo_zdst; mblk_t *hpo_pkt; } hook_pkt_observe_t; hpo_family - protocol family (AF_INET/AF_INET6) hpo_zsrc - zone identifier for the source of the packet hpo_zdst - zone identifier for the destination of the packet hpo_ifindex - interface index number hpo_grifindex - group interface index number (for IPMP interfaces) hpo_htype - hook type (in, out, local) hpo_pkt - start of the mblk_t chain with the packet struct dl_ipnetinfo { uint8_t dli_version; uint8_t dli_family; uint16_t dli_htype; uint32_t dli_pktlen; uint32_t dli_ifindex; uint32_t dli_grifindex; uint32_t dli_zsrc; uint32_t dli_zdst; }; typedef struct dl_ipnetinfo dl_ipnetinfo_t; dli_version - version number (2) dli_family - protocol family (AF_INET, AF_INET6, etc) dli_htype - hook type (in, out, local) dli_pktlen - length of the packet excluding this header dli_ifindex - interface index number dli_grifindex - group interface index number (for IPMP interfaces) dli_zsrc - zone identifier for the source of the packet dli_zdst - zone identifier for the destination of the packet Imported Interface Table ~~~~~~~~~~~~~~~~~~~~~~~~ +---------------------------+ | Interface | +---------------------------+ | dls_mgmt_get_linkid | | mac_client_close | | mac_client_handle_t | | mac_client_name | | mac_client_open | | mac_handle_t | | mac_multicast_add | | mac_multicast_remove | | mac_name | | mac_open | | mac_promisc_add | | mac_promisc_handle_t | | mac_promisc_remove | | mac_sdu_get | | mac_tx | | | mac_perim_enter_by_mh | | | mac_perim_exit | | | dls_devnet_macname2linkid | +---------------------------+ | MAC_PROMISC_FLAGS_NO_PHYS | | MAC_CLIENT_PROMISC_ALL | +---------------------------+ References ========== [1] http://sac.eng.sun.com/sac/PSARC/2006/436 [2] http://sac.eng.sun.com/sac/PSARC/2006/475, http://opensolaris.org/os/project/clearview/ipnet/ [3] http://arc.opensolaris.org/caselog/PSARC/2008/219 [4] http://www.tcpdump.org/ [5] http://arc.opensolaris.org/caselog/PSARC/2008/288 [6] http://opensolaris.org/os/project/clearview/uv/ | [7] http://en.wikipedia.org/wiki/Berkeley_Packet_Filter