Template Version: @(#)sac_nextcase 1.61 05/24/07 SMI This information is Copyright 2007 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Driver open-close exclusion guarantees 1.2. Name of Document Author/Supplier: Author: Chris Horne, Chris Gerhard 1.3 Date of This Document: 29 May, 2007 4. Technical Description: 4.1.1 Summary: This proposal clarifies the interaction between the kernel and a non-stream driver's open(9E) and close(9E) implementation. The proposal focuses on two aspects of this interaction: the execution exclusion guarantees between open(9E) and close(9E) calls, and the last-reference accounting associated with close(9E) calls. The proposal includes open(9E), close(9E), and cb_ops(9S) man page changes. We are also implementing and providing a man page for a new volatile ddi-open-returns-eintr(9P) interface, but will not be delivering the ddi-open-returns-eintr(9P) man page to customers. The proposal requests patch binding. 4.1.1. Problems: UNIX has always used an open-close model where each device open results in an open(9E) call, and the last-reference close results in a single close(9E) call. While this basic model is simple and well understood, what this means for exclusion guarantees between open(9E) and close(9E) in a multi-threaded preemptive kernel environment like Solaris is not documented. UNIX last-reference accounting associated with a close(9E) call historically counted successfully completed open(9E) calls as 'open'. This works well in a single-threaded non-preemptive kernel environment, but it does not work well for Solaris. Solaris last-reference accounting has always treated in-progress open(9E) call as 'open', but this is not clearly documented. Without a clear definition of both the exclusion guarantees and last-reference accounting, it is difficult to write a reliable driver. 4.1.2. Proposal: This proposal defines, for non-streams drivers, execution exclusion guarantees between open(9E) and close(9E) calls, and last-reference accounting associated with close(9E) calls. 4.1.2.1 Exclusion: To provide open-close exclusion in a multi-threaded preemptive kernel environment like Solaris, an executing close(9E) call must act as a barrier to all subsequent open(9E) calls: the last-reference close(9E) call needs to return before the next open(9E) call is allowed to start. Today the kernel implements open-close exclusion for streams drivers, but not for non-streams drivers. Non-streams drivers either incorrectly assume exclusion or are complicated by needing to implement their own exclusion. When exclusion for non-streams drivers is implemented, in situations where an active close(9E) call is preventing a new open(9E) call due to exclusion, having the framework always treat the waiting-open as interruptible is unsafe - applications may not be coded to expect a new EINTR return from open. This proposal provides new interfaces that allow the framework to determine if a waiting-open is safely interruptible. Exclusion is provided at (dev_t, otyp) granularity, where dev_t and otyp refer to open(9E) arguments. The otyp values of interest are OTYP_BLK and OTYP_CHR. If this granularity is too fine-grained, the driver writer is left having to implement his own exclusion and accounting (often at ddi_get_instance(9F) granularity). Providing exclusion guarantees at instance granularity is outside the scope of this proposal. 4.1.2.2 Last-reference accounting: Last-reference accounting occurs at the same (dev_t, otyp) granularity as exclusion. Solaris last-reference accounting has always treated in-progress open(9E) calls as 'open', but this is not clearly documented. No change to last-reference accounting is proposed, however, an explanation of how accounting is implemented is necessary, especially for implementing 'special behaviors' where the driver open(9E) and close(9E) implementations interact. 4.1.2.3 Special Behaviors: Understanding exclusion guarantees and last-reference accounting typically simplify driver writing. However, for some behaviors additional guidance is still needed. Implementing these behaviors involves 'self-clone', where a driver changes the *devp value passed to open(9E). A driver that does a self-clone does not necessarily need to call ddi_create_minor_node(9F) for the new *devp value. o A driver that supports O_NDELAY (FNDELAY) and blocks in open(9E) or close(9E) for an event that takes a long time (or may never occur) must use separate minor nodes for O_NDELAY and non-O_NDELAY access for the applications to get real O_NDELAY behavior. Applications using the device must either match the minor node used with their O_NDELAY flag use, or the driver must self-clone to match O_NDELAY flag use. This guidance is related to both exclusion and last-reference accounting. For exclusion, this guidance prevents a new O_NDELAY open from waiting on completion of a non-O_NDELAY close(9E). For last-reference accounting, this guidance allows an O_NDELAY close(9E) to occur while there is a blocked non-O_NDELAY open(9E) call. This is already a de facto Solaris requirement: an example is the OUTLINE implementation used by serial communications drivers like zs(7D) . In this situation Solaris specific DDI considerations influence how a driver must implement a POSIX compliant O_NDELAY open(9E). An unmodified SVR4 driver's O_NDELAY open(9E) implementation may not be POSIX compliant under Solaris. NOTE: Some drivers (such as sd(7D)) use O_NDELAY to support administrative commands which need to open the device prior to full device initialization. These drivers fail their non-O_NDELAY open(9E) instead of blocking, so they do not need to use separate minor nodes. o A driver that blocks in open(9E) for an event signaled from close(9E) must self-clone. This guidance is related to last-reference accounting. If not followed, the close(9E) call will never occur since an in-progress open(9E) call counts as an 'open'. This is already a de facto Solaris requirement: an example is a queuing exclusive use device, like a printer. Originally, UNIX printer drivers slept in open(9E) if the device was already in use. This provided a driver-based queuing system. o A driver that blocks in close(9E) for an event that takes a long time (or may never occur) is preventing subsequent open(9E) operations. While blocking in close(9E) is not prohibited, the driver writer needs to understand the ramifications, possibly setting the D_OPEN_RETURNS_EINTR cb_ops(9S) flag if it is safe to return EINTR from open. As an alternative to D_OPEN_RETURNS_EINTR, we will also implement a volatile ddi-open-returns-eintr(9P) interface. The motivation for this is to provide relief in the field if the new exclusion behavior causes problems. Getting a new driver with a patched cb_ops flag to the customer in a timely fashion, especially from a third party driver vendor, could prove difficult. Providing ddi-open-returns-eintr(9P) gives us a mechanism to help diagnose problems and provide temporary relief until a driver patch that uses D_OPEN_RETURNS_EINTR is available. This guidance is related to exclusions guarantees. This is already a de facto Solaris requirement for streams: an example is maximum drain times on close for streams. The ramifications of blocking indefinitely in close are not new for streams since streams currently has exclusion. Applications opening streams already expect EINTR, so the waiting-open can be interruptible. In the situations above, implementing multiple minor nodes or doing a 'self-clone' expands the operation beyond the typical (dev_t, otyp) granularity, so exclusion and last-reference accounting are no longer an impediment to implementing atypical behaviors. 4.1.2.4 Legacy non-DDI compliant interface issues: The Solaris open(9E) close(9E) exclusion guarantee is annulled when kernel software, other than specfs, uses the following private non-DDI interfaces: dev_open(), dev_close(), cb_ops(9S) cb_open, or cb_ops(9S) cb_close. If these private non-DDI interfaces are used, no new problems occur, but consumers should switch to use the Layered Driver Interfaces (LDI, PSARC 2001/769). LDI provides a DDI compliant way to perform these operations which does not annul exclusion guarantees. 4.2. Bug/RFE Number(s): 6343604 specfs race: multiple "last-close" of the same device 4127807 DDI: Is there a race between open(9e) and close(9e)? 4.5. Interfaces: ------------------------------------------------------------------------ Interface Level Comments ------------------------------------------------------------------------ Existing: open(9E) Committed Define exclusion and close(9E) " last-reference behavior. New: D_OPEN_RETURNS_EINTR " cb_ops(9S) cb_flag: Driver returns and applications expects EINTR from device open. ddi-open-returns-eintr(9P) Volatile driver.conf(4) property: Driver returns and applications expects EINTR from device open. 6. Resources and Schedule: 6.4. Product Approval Committee requested information: 6.4.1. Consolidation or Component Name: ON 6.5. ARC review type: FastTrack A. Man page changes A.1 open(9E) man page changes: Driver Entry Points open(9E) NAME open - gain access to a device SYNOPSIS Block and Character #include #include #include #include #include #include #include int prefixopen(dev_t *devp, int flag, int otyp, cred_t *cred_p); STREAMS #include #include #include #include int prefixopen(queue_t *q, dev_t *devp, int oflag, int sflag, cred_t *cred_p); ---%<--- DESCRIPTION The driver's open() routine is called by the kernel during an open(2) or a mount(2) on the special file for the > device. A device may be opened simultaneously by multiple > processes and the open() driver routine is called for each > open. Note that a device is referenced once its associated > open(9E) routine is entered, and thus open(9E)'s which have > not yet completed will prevent close(9E) from being called. > | The routine should verify that the minor number component of *devp is valid, that the type of access requested by otyp and flag is appropriate for the device, and, if required, check permissions using the user credentials pointed to by cred_p. > The kernel provides open() close() exclusion guarantees to the > driver at (*devp, otyp) granularity. This delays new open() > calls to the driver while a last-reference close() call is > executing. If the driver has indicated that an EINTR return > is safe via the D_OPEN_RETURNS_EINTR cb_ops(9S) cb_flag then > a delayed open() may be interrupted by a signal, resulting in > an EINTR return. > > Last-reference accounting and open() close() exclusion > typically simplify driver writing, however, in some cases they > may be an impediment for certain types of drivers. To overcome > any impediment the driver can change minor numbers in open(9E), > as described below, or implement multiple minor nodes for the > same device - both techniques give the driver control over > when close() calls will occur and whether additional open() > calls will be delayed while close() is executing. The open() routine is passed a pointer to a device number so that the driver can change the minor number. This allows drivers to dynamically create minor instances of the dev- ice. An example of this might be a pseudo-terminal driver that creates a new pseudo-terminal whenever it is opened. A driver that chooses the minor number dynamically, normally creates only one minor device node in attach(9E) with ddi_create_minor_node(9F) then changes the minor number com- ponent of *devp using makedevice(9F) and getmajor(9F). The driver needs to keep track of available minor numbers > internally. A driver that dynamically creates minor > numbers may want to avoid returning the original minor > number since returning the original minor will result in > postponed dynamic opens when original minor close() call > occurs. ---%<--- SEE ALSO > cb_ops(9S) ---%<--- A.2 close(9E) man page changes: Driver Entry Points close(9E) NAME close - relinquish access to a device SYNOPSIS Block and Character #include #include #include #include #include #include #include int prefixclose(dev_t dev, int flag, int otyp, cred_t *cred_p); ---%<--- DESCRIPTION For STREAMS drivers, the close() routine is called by the kernel through the cb_ops(9S) table entry for the device. (Modules use the fmodsw table.) A non-null value in the d_str field of the cb_ops entry points to a streamtab structure, which points to a qinit(9S) containing a pointer to the close() routine. Non-STREAMS close() routines are called directly from the cb_ops table. close() ends the connection between the user process and the device, and prepares the device (hardware and software) so that it is ready to be opened again. < A device may be opened simultaneously by multiple processes < and the open() driver routine is called for each open, but < the kernel will only call the close() routine when the last < process using the device issues a close(2) or umount(2) < system call or exits. (An exception is a close occurring < with the otyp argument set to OTYP_LYR, for which a close < (also having otyp = OTYP_LYR) occurs for each open.) > A device may be opened simultaneously by multiple processes > and the open() driver routine is called for each open. > For all otyp values other than OTYP_LYR the kernel calls > the close() routine when the last-reference occurs. For > OTYP_LYR each close operation will call the driver. > > Kernel accounting for last-reference occurs at (dev, otyp) > granularity. Note that a device is referenced once its > associated open(9E) routine is entered, and thus open(9E)'s > which have not yet completed will prevent close(9E) from > being called. The driver close(9E) call associated with the > last-reference going away is typically issued as as result > of a close(2), exit(2), munmap(2), or umount(2). However, a > failed open(9E) call can cause this last-reference close(9E) > call to be issued as a result of an open(2) or mount(2). > > The kernel provides open() close() exclusion guarantees > to the driver at the same (dev, otyp) granularity as > last-reference accounting. The kernel delays new calls to the > open() driver routine while the last-reference close() call is > executing - a driver that blocks in close() will not see new > calls to open() until it returns from close(). This > effectively delays invocation of other cb_ops(9S) driver entry > points that depend on an open(9E) established device reference > too. If the driver has indicated that an EINTR return > is safe via the D_OPEN_RETURNS_EINTR cb_ops(9S) cb_flag then a > delayed open() may be interrupted by a signal, resulting in an > EINTR return from open() prior to calling open(9E). > > Last-reference accounting and open() close() exclusion typically > simplify driver writing, however, in some cases they may be > an impediment for certain types of drivers. To overcome any > impediment the driver can change minor numbers in open(9E) > or implement multiple minor nodes for the same device - > both techniques give the driver control over when close() > calls will occur and whether additional open() calls will > be delayed while close() is executing. In general, a close() routine should always check the validity of the minor number component of the dev parameter. The routine should also check permissions as necessary, by using the user credential structure (if pertinent), and the appropriateness of the flag and otyp parameter values. ---%<--- SEE ALSO > cb_ops(9S) ---%<--- A.3 cb_ops(9S) man page change: If the driver properly handles 64-bit offsets, it should also set the D_64BIT flag in the cb_flag field. This speci- fies that the driver will use the uio_loffset field of the uio(9S) structure. + If the driver returns EINTR from open(9E), it should also set the + D_OPEN_RETURNS_EINTR flag in the cb_flag field. This lets the + framework know that it is safe for it to return EINTR when + waiting, to provide exclusion, for a last-reference close(9E) + call to complete before calling open(9E). + mt-streams(9F) describes other flags that can be set in the cb_flag field. cb_rev is the cb_ops structure revision number. This field must be set to CB_REV. A.4 ddi-open-returns-eintr.9p man page (NOT DELIVERED TO CUSTOMERS): Kernel Properties for Drivers ddi-open-returns-eintr(9P) NAME ddi-open-returns-eintr - property indicates that device open can safely return EINTR. DESCRIPTION When ddi-open-returns-eintr is set the kernel knows that an EINTR return from open(9E) is an expected result. This allows the kernel, in its implementation of open/close exclusion, to be interruptible and fail an open with EINTR when an active close(9E) operation, at (dev_t, spectype) granularity, is preventing a new open(9E). Set this property via driver.conf(4) if open(9E) implementation returns EINTR, especially when waiting for an active close(9E) operation. When property is set, kernel behavior is identical to when the D_OPEN_RETURNS_EINTR cb_ops(9S) cb_flag is set. SEE ALSO open(9E), close(9E), cb_ops(9S) Writing Device Drivers