1. What specifically is the proposal that we are reviewing? - What is the technical content of the project? This project delivers the resource management feature originally defined in PSARC/2004/253 "Advanced DDI Interrupt Functions" with minor changes. It provides a mechanism for device drivers to get more interrupt vectors (and increased performance). By participating with the new feature, their number of available interrupts becomes dynamic and can increase or decrease. One goal is to maximimize utilization of interrupt vectors in a fair manner. Another goal is to avoid exhausting the finite number of interrupts in the system so that new devices can still be hotplugged. The project rebalances the interrupts given to other devices when devices are added or removed. The project defines new DDI interfaces for device drivers to register and unregister a generic callback handler. A registered callback is required for a driver to be notified when more or less interrupts are available. The project also defines how a driver specifies how many interrupts it wants from the system. During its attach(9E) routine, when a driver initializes its interrupt handling, it requests an initial number of vectors through ddi_intr_alloc(9F). And at any later time if the needs of the driver change, a new DDI function is provided for the driver to explicitly change its total number of requested interrupt vectors. The project provides new NDI interfaces for nexus drivers to define the supplies of interrupt vectors that exist in the system, and a new bus nexus interrupt operation (INTROP) command to map devices to their corresponding interrupt supplies. - Is this a new product, or a change to a pre-existing one? If it is a change, would you consider it a "major", "minor", or "micro" change? This project is a new feature. But it is also a minor change to interfaces in PSARC/2004/253 "Advanced DDI Interrupt Functions" which were previously approved, but never implemented. - If your project is an evolution of a previous project, what changed from one version to another? This project changes the resource management feature originally architected. The callback mechanism is now more generic, instead of being for exclusive use by this feature. And the callback mechanism cannot be temporarily disabled as originally proposed. Finally, the original architecture did not explicitly define how a driver specified or changed its total number of requested interrupts. - What is the motivation for it, in general as well as specific terms? Drivers can request large numbers of interrupts depending on what their devices support. The sum of all requests in the system could exceed the total number available in the system. Therefore, the current solution in Solaris is to conservatively give interrupts to drivers to avoid exhausting the supply. But Solaris is too conservative, and only gives 2 interrupts per device thus putting a limit on I/O performance. This feature removes that performance limitation by allowing drivers to get more interrupts, potentially receiving their requests in full. But the new callback mechanism and the ability to rebalance the allocations prevents resource exhaustion. (Because previously given interrupts can be taken back when necessary to allow new devices to attach.) - What are the expected benefits for Sun? Increased I/O performance, but depending upon device drivers properly requesting the right numbers of interrupts and using them efficiently. - By what criteria will you judge its success? The success of this project is judged by how well the framework dispenses interrupt vectors to drivers in response to what they requested, but only in terms of raw numbers of interrupt vectors. When drivers consume the interfaces, their own individual performance tests will further judge the success of how well they requested and utilized their interrupts. 2. Describe how your project changes the user experience, upon installation and during normal operation. - What does the user perceive when the system is upgraded from a previous release? N/A 3. What is its plan? - What is its current status? Has a design review been done? Are there multiple delivery phases? The project only has 1 phase. Delivery to S11 and S10U8 is planned. A design review has been completed by the project team internally, and by driver developers who would consume the interfaces. 4. Are there related projects in Sun? - If so, what is the proposal's relationship to their work? Which not-yet- delivered Sun (or non-Sun) projects (libraries, hardware, etc.) does this project depend upon? What other projects, if any, depend on this one? A follow-on project exists to convert the existing Atlas/Neptune driver to consume the interfaces from this project. The driver currently uses PSARC/2007/453 "MSI-X interrupt limit override" as a workaround, but it will be changed. The workaround from PSARC/2007/453 is still preserved by this project to accomodate drivers not yet converted. - Are you updating, copying or changing functional areas maintained by other groups? How are you coordinating and communicating with them? Do they "approve" of what you propose? If not, please explain the areas of disagreement. No. 5. How is the project delivered into the system? - Identify packages, directories, libraries, databases, etc. The new DDI and NDI interfaces are in the kernel, and are delivered in existing packages. And a new MDB module for debugging is also added to existing packages. The affected packages are: - SUNWcakr.i - SUNWcakr.u - SUNWcakr.v - SUNWmdb - SUNWmdbr 6. Describe the project's hardware platform dependencies. - Explain any reasons why it would not work on both SPARC and Intel? The project is only for drivers that use MSI-X interrupts. And the project only works when device drivers using the new DDI interfaces are used in combination with nexus drivers using the new NDI interfaces. The project provides an implementation in the SPARC PCIe nexus drivers to enable support of this feature for any properly modified drivers consuming MSI-X interrupts attached to SPARC based systems with PCIe buses. A follow-on project will later add an implementation for I/O buses on Intel platforms. But that will be part of a much larger project that first must redesign the low-level interrupt handling on Intel, to remove existing limitations. The use of per-CPU vector tables (instead of a single global table shared by all CPUs), and changes to priority assignments to interrupt vectors will be the bulk of that project. Then implementing use of the new NDI interfaces from this project will be a simple step to then enable interrupt resource management features on Intel platforms. 7. System administration - How will the project's deliverables be installed and (re)configured? Core Solaris packages, during OS install. - How will the project's deliverables be uninstalled? N/A - Does it use inetd to start itself? No. - Does it need installation within any global system tables? No. - Does it use a naming service such as NIS, NIS+ or LDAP? No. - What are its on-going maintenance requirements (e.g. Keeping global tables up to date, trimming files)? N/A - How does this project's administrative mechanisms fit into Sun's system administration strategies? E.g., how does it fit under the Solaris Management Console (SMC) and Web-Based Enterprise Management (WBEM), how does it make use of roles, authorizations and rights profiles? Additionally, how does it provide for administrative audit in support of the Solaris BSM configuration? N/A - What tunable parameters are exported? Can they be changed without rebooting the system? Examples include, but are not limited to, entries in /etc/system and ndd(8) parameters. What ranges are appropriate for each tunable? What are the commitment levels associated with each tunable (these are interfaces)? Two tunables are exported in /etc/system. Both are CONSOLIDATION PRIVATE. The tunables are: - int irm_enable; - int irm_default_policy; The variable 'irm_enable' can be used to globally disable the use of the feature, if set to 0. By default this tunable is set to 1, to enable the use of IRM. The variable 'irm_default_policy' can be used to select the algorithm used to rebalance interrupt pools. Setting it to 1 selects the use of the "Reduce Large Requests First" algorithm described in the Design Specifications. Setting it to 2 selects the other algorithm described as "Reduce Evenly." By default this tunable is set to 1. Changes to these tunables will not take effect unless the system is rebooted. These parameters cannot be dynamically tuned on a running system. 8. Reliability, Availability, Serviceability (RAS) - Does the project make any material improvement to RAS? No. - How can users/administrators diagnose failures or determine operational state? (For example, how could a user tell the difference between a failure and very slow performance?) The failure scenario is related to a driver not correctly implementing its callback feature. If a driver does not release interrupts properly in response to a notification, then the rebalancing of an interrupt pool will not work properly. The result would be that newly attaching drivers would not be able to get their proper share of MSI-X interrupts. Drivers could adapt by using FIXED interrupts instead, which could lead to less performance. And that could be detected using the exising MDB command "::interrupts" to see what types of interrupts are used by the device. Or the drivers could fail to attach altogether, in which case the administrator would encounter failures to attach the new drivers during a hotplug operation. - What are the project's effects on boot time requirements? The project utilizes background threads to maintain the balance of each interrupt pool. The pools remain idle until signalled to wake up and rebalance in response to a configuration change (e.g. due to a driver changing its request size, or drivers being hotplugged). Until these threads are spawned, drivers will not be given more than a small default number of interrupts. (The same number they would be given if they did not use the new feature.) And no balancing will occur either, so drivers will not receive any callbacks yet. Spawning the threads is deferred until after the system is booted. So no extra load on the system exists during boottime, no balancing operations occur, and no dynamic changes to interrupt handling by device drivers occur. But once the system is booted, there will be one final burst of balancing and callback notifications to drivers to dispense additional interrupts. - How does the project handle dynamic reconfiguration (DR) events? Whenever a device is added or removed (by either a DR or hotplug operation), if the device was associated with a defined interrupt pool, the interrupt pool may need to be rebalanced. Interrupts previously given to a detaching device may be redistributed to other remaining devices. Or interrupts may need to be taken back from other devices for the sake of a newly attaching device. So there may be callbacks to seemingly unrelated driver instances as a consequencing of doing DR or hotplugging. These rebalancing operations only occur if the pressure on the interrupt pool is very high. That is, if requests for interrupts from device drivers exceeds the total number of interrupts available in the system. If the interrupt pools are not fully utilized, then no rebalancing occurs and simple mathematical operations on counters in each interrupt pool are done to quickly adjust the interrupt usage accounting. - What mechanisms are provided for continuous availability of service? The callback mechanism specifically facilitates DR and hotplugging so that I/O subsystems can be serviced even while interrupt utilization is very high. - Does the project call panic()? Explain why these panics cannot be avoided. No. - How are significant administrative or error conditions transmitted? SNMP traps? Email notification? If a driver does not correctly respond to a callback notification to release interrupts, the following warning message will be sent to the console: WARNING: : failed to release interrupts for IRM (nintrs = ##, navail=##). If any subsequent devices fail to attach due to a lack of interrupts, those failures can then be traced back to the offending driver that failed to release interrupts in response to its callback. - How does the project deal with failure and recovery? The project only manages the number of interrupts that should be available to drivers, setting the upper limit on what a driver can then allocate through ddi_intr_alloc(9F). Once a driver allocates an interrupt, it cannot be forcibly taken back. A callback notice may ask the driver to release some of its interrupts, but it is up to the driver to call ddi_intr_free(9F) to do so. Therefore, previously attached drivers can remain attached and will not fail as a side effect of a rebalance of the interrupt pool. The failures are limited to newly attaching drivers who may be unable to successfully allocate interrupts if other (previously attached) drivers failed to release them. - Does it ever require reboot? If so, explain why this situation cannot be avoided. No. - How does your project deal with network failures (including partition and re- integration)? How do you handle the failure of hardware that your project depends on? N/A - Can it save/restore or checkpoint and recover? N/A - Can its files be corrupted by failures? Does it clean up any locks/files after crashes? N/A 9. Observability - Does the project export status, either via observable output (e.g., netstat) or via internal data structures (kstats)? Internal data structures that model the available pools of interrupts and the individual requests from drivers against those pools can be viewed through two new MDB macros that this project provides. The dcmd ::irmpools traverses a global list of all defined interrupt pools, and displays overall statistics of interrupt utilization for each pool. And the dcmd ::irmreqs when applied against the address of an individual interrupt pool will display the list of individual requests contained in that pool, and display statistics on the numbers of interrupts requested and made available to each such request. Here is example output of these dcmds: > ::irmpools ADDR OWNER TYPE SIZE REQUESTED RESERVED 00000600507f1618 px#6 MSI/X 256 0 0 00000600507f1690 px#5 MSI/X 256 16 16 00000600507f1708 px#2 MSI/X 256 4 4 0000060050614458 px#1 MSI/X 256 16 16 0000060050236548 px#3 MSI/X 256 54 54 0000060050236d40 px#7 MSI/X 256 4 4 0000060050236ea8 px#4 MSI/X 256 10 10 00000300a244de08 px#0 MSI/X 256 10 10 > 0000060050236548::irmreqs ADDR OWNER TYPE CALLBACK REQUESTED AVAILABLE 000006005192f300 nxge#8 MSI-X Yes 32 32 0000060053260b80 e1000g#3 MSI No 2 2 0000060052047680 e1000g#1 MSI No 2 2 000006005271bb80 e1000g#0 MSI No 2 2 00000600522c1380 e1000g#2 MSI No 2 2 00000600506c2980 nxge#9 MSI-X No 2 2 0000060052274f80 nxge#7 MSI-X No 2 2 00000600527e3a80 nxge#6 MSI-X No 2 2 0000060050f60b80 emlxs#3 MSI No 2 2 000006005060c180 emlxs#2 MSI No 2 2 00000600507c9580 qlc#1 MSI-X No 2 2 00000600503a5600 qlc#0 MSI-X No 2 2 - How would a user or administrator tell that this subsystem is or is not behaving as anticipated? This project is a behind-the-scenes sort of feature only affecting how many interrupts are given to each device driver. Although, a savy and curious administrator could explore the status of this subsystem using the previously described MDB macros to gain an understanding of how interrupts are being utilized. Potentially finding opportunities to migrate hardware devices from one bus to another to improve the overall utilization. - What statistics does the subsystem export, and by what mechanism? Just the previously described statistics (visible through MDB), to show how full/empty each interrupt pool is, and possibly showing opportunities to migrate I/O hardware to improve overall utilization of interrupts. - What state information is logged? None. - In principle, would it be possible for a program to tune the activity of your project? No. Only moving I/O hardware or changing the rebalancing policies could affect the outcome of interrupt availability to devices. 10. What are the security implications of this project? - What security issues do you address in your project? N/A - The Solaris BSM configuration carries a Common Criteria (CC) Controlled Access Protection Profile (CAPP) -- Orange Book C2 -- and a Role Based Access Control Protection Profile (RBAC) -- rating, does the addition of your project effect this rating? E.g., does it introduce interfaces that make access or privilege decisions that are not audited, does it introduce removable media support that is not managed by the allocate subsystem, does it provide administration mechanisms that are not audited? No. - Is system or subsystem security compromised in any way if your project's configuration files are corrupt or missing? No. - Please justify the introduction of any (all) new setuid executables. N/A - Include a thorough description of the security assumptions, capabilities and any potential risks (possible attack points) being introduced by your project. A separate Security Questionnaire http://sac.sfbay/cgi-bin/bp.cgi?NAME=Security.bp is provided for more detailed guidance on the necessary information. Cases are encouraged to fill out and include the Security questionnaire (leveraging references to existing documentation) in the case materials. Projects must highlight information for the following important areas: - What features are newly visible on the network and how are they protected from exploitation (e.g. unauthorized access, eavesdropping) None. - If the project makes decisions about which users, hosts, services, ... are allowed to access resources it manages, how is the requestor's identity determined and what data is used to determine if the access granted. Also how this data is protected from tampering. N/A. - What privileges beyond what a common user (e.g. 'noaccess') can perform does this project require and why those are necessary. None. - What parts of the project are active upon default install and how it can be turned off. The dynamic balancing of interrupt pools are enabled by default. The entire functionality of the project can be globally disabled through the /etc/system tunable 'irm_enable.' 11. What is its UNIX operational environment: - Which Solaris release(s) does it run on? Solaris 11 and Solaris 10 Update 8. - Environment variables? Exit status? Signals issued? Signals caught? (See signal(3HEAD).) None. - Device drivers directly used (e.g. /dev/audio)? .rc/defaults or other resource/configuration files or databases? None. - Does it use any "hidden" (filename begins with ".") or temp files? No. - Does it use any locking files? No. - Command line or calling syntax: What options are supported? (please include man pages if available) Does it conform to getopt() parsing requirements? N/A - Is there support for standard forms, e.g. "-display" for X programs? Are these propagated to sub-environments? N/A - What shared libraries does it use? (Hint: if you have code use "ldd" and "dump -Lv")? N/A - Identify and justify the requirement for any static libraries. N/A - Does it depend on kernel features not provided in your packages and not in the default kernel (e.g. Berkeley compatibility package, /usr/ccs, /usr/ucblib, optional kernel loadable modules)? No. - Is your project 64-bit clean/ready? If not, are there any architectural reasons why it would not work in a 64-bit environment? Does it interoperate with 64-bit versions? The project is 64-bit clean. - Does the project depend on particular versions of supporting software (especially Java virtual machines)? If so, do you deliver a private copy? What happens if a conflicting or incompatible version is already or subsequently installed on the system? No. - Is the project internationalized and localized? N/A - Is the project compatible with IPV6 interfaces and addresses? N/A 12. What is its window/desktop operational environment? - Is it ICCCM compliant (ICCCM is the standard protocol for interacting with window managers)? N/A - X properties: Which ones does it depend on? Which ones does it export, and what are their types? N/A - Describe your project's support for User Interface facilities including Help, Undo, Cut/Paste, Drag and Drop, Props, Find, Stop? N/A - How do you respond to property change notification and ICCCM client messages (e.g. Do you respond to "save workspace")? N/A - Which window-system toolkit/desktop does your project depend on? N/A - Can it execute remotely? Is the user aware that the tool is executing remotely? Does it matter? N/A - Which X extensions does it use (e.g. SHM, DGA, Multi-Buffering? (Hint: use "xdpyinfo") N/A - How does it use colormap entries? Can you share them? N/A - Does it handle 24-bit operation? N/A 13. What interfaces does your project import and export? - Please provide a table of imported and exported interfaces, including stability levels. Interfaces Imported Interface Stability Comments -----------------------------------+----------+------------------------- ddi_intr_alloc() Committed Added hooks into IRM ddi_intr_free() Committed Added hooks into IRM -----------------------------------+----------+------------------------- Interfaces Exported Interface Stability Comments -----------------------------------+---------------+------------------------- ndi_irm_create() Cons. Private Create an IRM pool ndi_irm_destroy() Cons. Private Destroy an IRM pool DDI_INTROP_GETPOOL Cons. Private Get IRM pool INTROP ddi_cb_register() Cons. Private Install callback handler ddi_cb_unregister() Cons. Private Remove callback handler ddi_cb_action_t Cons. Private Callback action type ddi_cb_flags_t Cons. Private Callback flags type ddi_cb_func_t Cons. Private Callback function type ddi_cb_handle_t Cons. Private Callback handle type ddi_intr_set_nreq() Cons. Private Set IRM request size irm_enable Cons. Private /etc/system tunable irm_default_policy Cons. Private /etc/system tunable -----------------------------------+---------------+------------------------- Interfaces Removed Interface Stability Comments -----------------------------------+----------+------------------------- ddi_intr_register_management_cb() Committed Register callback ddi_intr_unregister_management_cb() Committed Unregister callback ddi_intr_enable_management_cb() Committed Enable callback ddi_intr_disable_management_cb() Committed Disable callback -----------------------------------+----------+------------------------- - Exported public library APIs and ABIs Protocols (public or private) Drag and Drop ToolTalk Cut/Paste None. - Other interfaces None. - What other applications should it interoperate with? How will it do so? N/A - Is it "pipeable"? How does it use stdin, stdout, stderr? N/A - Explain the significant file formats, names, syntax, and semantics. N/A - Is there a public namespace? (Can third parties create names in your namespace?) How is this administered? No. - Are the externally visible interfaces documented clearly enough for a non-Sun client to use them successfully? N/A 14. What are its other significant internal interfaces inter-subsystem and inter-invocation)? - Protocols (public or private) N/A - Private ToolTalk usage N/A - Files N/A - Other N/A - Are the interfaces re-entrant? N/A 15. Is the interface extensible? How will the interface evolve? - How is versioning handled? The generic DDI callback mechanism will define new ddi_cb_action_t values for future use, and define new callback argument data types on a per-action basis. - What was the commitment level of the previous version? Previously approved interfaces (from PSARC/2004/253 "Advanced DDI Interrupt Functions") have been removed, and replaced with by this project. Those interfaces were COMMITTED, but were never implemented. - Can this version co-exist with existing standards and with earlier and later versions or with alternative implementations (perhaps by other vendors)? N/A - What are the clients over which a change should be managed? Device drivers utilizing the new DDI interfaces need to be managed if the interfaces change. The interfaces are CONSOLIDATION PRIVATE and therefore only drivers in the ON Consolidation need to be changed. - How is transition to a new version to be accomplished? What are the consequences to ISV's and their customers? Any changes to the interfaces will include updates to drivers in the ON Consolidation that use the interfaces. There are no consequences to ISVs, since the interfaces are not available to them. 16. How do the interfaces adapt to a changing world? - What is its relationship with (or difficulties with) multimedia? 3D desktops? Nomadic computers? Storage-less clients? A networked file system model (i.e., a network-wide file manager)? N/A 17. Interoperability - If applicable, explain your project's interoperability with the other major implementations in the industry. In particular, does it interoperate with Microsoft's implementation, if one exists? N/A - What would be different about installing your project in a heterogeneous site instead of a homogeneous one (such as Sun)? N/A - Does your project assume that a Solaris-based system must be in control of the primary administrative node? N/A 18. Performance - How will the project contribute (positively or negatively) to "system load" and "perceived performance"? By giving more interrupt vectors to device drivers, the performance of those devices could be improved. But it depends entirely on the proper usage of interrupts by those drivers, which is beyond the scope of this project. - What are the performance goals of the project? How were they evaluated? What is the test or reference platform? The performance goal of this project is to maximize the interrupt availability for each device driver using the new DDI interfaces. It is only measurable through the statistics of how many interrupts are given to each driver in relation to each driver's total request size. Performance of individual devices based on properly using the interrupts will be separately measured as drivers are converted. - Does the application pause for significant amounts of time? Can the user interact with the application while it is performing long-duration tasks? N/A - What is your project's MT model? How does it use threads internally? How does it expect its client to use threads? If it uses callbacks, can the called entity create a thread and recursively call back? For each defined interrupt pool, there exists a balancing thread which runs in the background. When device driver threads initiate a change against the configuration of a pool (by adding a new request, removing a previous request, or changing their current request size), the driver threads initiate a signal to the balancing thread to wake it up and inform it that the pool needs to be rebalanced. The thread synchronization is based on mutex's for each pool. Changes to the pools configuration can only occur when the mutex is held. And the balancing thread locks the mutex while performing a rebalance. During a rebalance, individual request structures are separately locked to synchronize reading the current interrupt availability (by the driver) or writing the current interrupt availability (by the balancing thread). The balancing thread locks the requests to compute new availability, then unlocks them before making callbacks to drivers. Thus, drivers can then individually read their current availability without deadlock during a callback. - What is the impact on overall system performance? What is the average working set of this component? How much of this is shared/sharable by other apps? The additional data structures to model interrupt pools and interrupt requests consumes additional memory in the kernel cage. And the balancing threads consume additional processor cycles when performing a rebalance, and while making callbacks to device drivers. The memory usage is less than 100 bytes per device plus ~100 bytes per bus nexus. The processor load only occurs during rebalancing, for a short period once the system is booted, and during each DR or hotplug event. Individual rebalancing of interrupt handling by device drivers in response to callbacks would impact performance of those devices. Drivers may need to temporarily quiesce their hardware and reprogram their devices in order to adjust their interrupt usage. But again, this only occurs when rebalance operations occur (which are rare). - Does this application "wake up" periodically? How often and under what conditions? What is the working set associated with this behavior? The balancing threads wake up when a signficant change to the composition of requests mapped to an interrupt pool happens. This occurs if drivers specifically adjust their request size, or if devices are added or removed during DR or hotplug operations. And only if pressure on the interrupt pool is large enough to require a rebalance. (Some adjustments to the pool can be done by simply updating accounting information in the pool data structures, without waking up the balancing threads.) - Will it require large files/databases (for example, new fonts)? N/A - Do files, databases or heap space tend to grow with time/load? What mechanisms does the user have to use to control this? What happens to performance/system load? No. 19. Please identify any issues that you would like the ARC to address. - Interface classification, deviations from standards, architectural conflicts, release constraints... N/A - Are there issues or related projects that the ARC should advise the appropriate steering committees? N/A 20. Appendices to include - One-Pager. - Prototype specification. - References to other documents. (Place copies in case directory.)