sosd: SCSI OSD Device Driver - 20 Questions ------------------------------------------- 1. What specifically is the proposal that we are reviewing? - What is the technical content of the project? The main goal is to provide object based storage device support in Solaris. For this, the project delivers the 'sosd' target driver, 'scsi_osd' kernel module & 'libsosd' userland library. The libraries are implementations of the API to send T10 OSD commands to the device. The 'sosd' driver is a peer to 'sd', 'st' and 'sgen' device drivers in Solaris which operate betwen the file-system and transport in the host architecture. The main difference between 'sosd' and other drivers is that 'sosd' uses a new object interface where as existing drivers use the traditional block interface. The project also delivers support for long CDBs (256 bytes) and will modify MPxIO and iscsi initiator modules. - A design document is at: http://opensolaris.org/os/project/osd/files/sosd_design_v3.pdf - Project one pager is at: http://sac.sfbay/Archives/CaseLog/arc/PSARC/2008/097/ - The opensolaris project page is at: http://opensolaris.org/os/project/osd/ - Is this a new product, or a change to a pre-existing one? If it is a change, would you consider it a "major", "minor", or "micro" change? It is a new feature addition to existing Solaris product and would be considered as a micro change. - If your project is an evolution of a previous project, what changed from one version to another? N/A - What is the motivation for it, in general as well as specific terms? What are the expected benefits for Sun? By what criteria will you judge its success? The motivation is to improve the scalability of parallel filesystems by pushing block allocation/management down to the device. This offloads the metadata server allowing it to support more clients. 2. Describe how your project changes the user experience, upon installation and during normal operation. - What does the user perceive when the system is upgraded from a previous release? No Change. 3. What is its plan? - What is its current status? Has a design review been done? Are there multiple delivery phases? An early prototype with QFS file system has been done. Most of the interfaces between different modules are being identified. The project will deliver the kernel driver, kernel module and the userland library. There is no requirement for a CLI. There is also no current requirement for bidirectional transfers, although OSD protocol supports it. A later Solaris project (not identified yet) will be delivering the bidirectional support. 4. Are there related projects in Sun? - If so, what is the proposal's relationship to their work? Which not-yet- delivered Sun (or non-Sun) projects (libraries, hardware, etc.) does this project depend upon? What other projects, if any, depend on this one? This project does not depend on other projects at Sun. QFS 5.x is dependent on this project for object support. - Are you updating, copying or changing functional areas maintained by other groups? How are you coordinating and communicating with them? Do they "approve" of what you propose? If not, please explain the areas of disagreement. (MPxIO)scsi_vhci and iscsi initiator will have to be modified to support long CDBs and the respective groups approve the proposed modifications. 5. How is the project delivered into the system? - Identify packages, directories, libraries, databases, etc. SUNWsosdr: SPARC: kernel/drv/sparcv9/sosd SPARC: kernel/drv/sosd.conf SPARC: kernel/drv/sparcv9/scsi_osd ----------------------------------- x86: kernel/drv/amd64/sosd x86: kernel/drv/sosd x86: kernel/drv/sosd.conf x86: kernel/drv/amd64/scsi_osd x86: kernel/drv/scsi_osd SUNWosdu: SPARC: usr/lib/devfsadm/linkmod/SUNW_sosd_link.so SPARC: lib/sparcv9/libsosd.so.1 SPARC: lib/libsosd.so.1 SPARC: usr/include/sys/osd.h ----------------------------------- x86: usr/lib/devfsadm/linkmod/SUNW_sosd_link.so x86: lib/amd64/libsosd.so.1 x86: lib/libsosd.so.1 x86: usr/include/sys/osd.h 6. Describe the project's hardware platform dependencies. None. 7. System administration - How will the project's deliverables be installed and (re)configured? pkgadd - How will the project's deliverables be uninstalled? pkgrm - Does it use inetd to start itself? No - Does it need installation within any global system tables? No - Does it use a naming service such as NIS, NIS+ or LDAP? No - What are its on-going maintenance requirements (e.g. Keeping global tables up to date, trimming files)? N/A - How does this project's administrative mechanisms fit into Sun's system administration strategies? E.g., how does it fit under the Solaris Management Console (SMC) and Web-Based Enterprise Management (WBEM), how does it make use of roles, authorizations and rights profiles? Additionally, how does it provide for administrative audit in support of the Solaris BSM configuration? The client requesting access to the device needs to have SYS_DEVICES privilege. RBAC will be used to enforce access restrictions. More details are provided in Security.txt. On the initiator node, the sosd device minor nodes would be created with permissions "0640 root sys". The device would be enumerated under MPxIO by default and clients would require write privilege set to "SYS_DEVICES" to gain write access. This is enforced inside the driver using drv_priv() for all API entry points. The project uses existing access control profiles to authorize usermode clients. The usermode client's role should have 'Device Management' and 'Device Security' profiles. Additionally if the client is a general purpose filesystem utility, it may also choose to have 'File System Management' and 'Object Access Management' profiles in its role. The initial intended use of the API is over an iSCSI transport. So, IP network security mechanism such as IPSEC and authentication mechanisms like CHAP and RADIUS provide transport level security. - What tunable parameters are exported? Can they be changed without rebooting the system? Examples include, but are not limited to, entries in /etc/system and ndd(8) parameters. What ranges are appropriate for each tunable? What are the commitment levels associated with each tunable (these are interfaces)? No requirement for tunables at this point. 8. Reliability, Availability, Serviceability (RAS) - Does the project make any material improvement to RAS? No. - How can users/administrators diagnose failures or determine operational state? (For example, how could a user tell the difference between a failure and very slow performance?) We will provide static dtrace probes at key points to identify command progress. We also plan to provide mdb modules displaying key data structure information. kstats information provided by the driver can also be used to diagnose performance issues. - What are the project's effects on boot time requirements? The boot time effect will be none to very minimal depending on the number of osd devices configured. The sosd driver only loads when there is a device matching the inquiry dtype. If there are no object devices configured for the system, the sosd driver will not load. - How does the project handle dynamic reconfiguration (DR) events? The driver follows the Solaris DDI model for attaching and detaching luns as well as itself to the OS. - What mechanisms are provided for continuous availability of service? The driver is not loaded unless an OSD is configured for the system. The project does not introduce any SMF service. The driver and API will be available until the last open handle is closed. After the last close, the driver and API module unload themselves. They re-loaded on-demand whenever a lun comes online or whena filesystem request is received. The project relies on MPxIO to handle path (or hardware failures) to provide availability of the device to the clients through this driver. - Does the project call panic()? Explain why these panics cannot be avoided. Does not call panic(). - How are significant administrative or error conditions transmitted? SNMP traps? Email notification? The main interface between the admin and this project is the client, a filesystem. The project has APIs to the client to report error conditions. The filesystem will provide required information to the client. In case of subsystem errors, which are not perceivable at client level, messages are sent to syslog, using the scsi_log(9F) API. - How does the project deal with failure and recovery? The project will implement an internal function called sosd_errmsg() which will output SCSI sense data information relevant to object storage commands. The output will be similar to scsi_errmsg which outputs SCSI sense information relevant to block storage commands. - Does it ever require reboot? If so, explain why this situation cannot be avoided. Does not require reboot. - How does your project deal with network failures (including partition and re-integration)? How do you handle the failure of hardware that your project depends on? Network failures are usually identified at driver level when scsi commands return with fatal (tran) error. The driver (and hence the clients) will not have access to the lun(s) during the period of error. sosd driver will recover its luns when the transport layer retries and restores transport connection. In case of hardware failures, the project depends on Solaris DDI to provide appropriate notification events when hardware/luns/transport state changes. The same is true for DR. In case of path failure, MPxIO software will handle failover and failback scenarios for to provide availability of the device. - Can it save/restore or checkpoint and recover? sosd will support DDI_SUSPEND and DDI_RESUME. - Can its files be corrupted by failures? Does it clean up any locks/files after crashes? No locks or stale information kept on filesystem to clean up. 9. Observability - Does the project export status, either via observable output (e.g., netstat) or via internal data structures (kstats)? Observability is similar to that provided by other target drivers. The project plans to provide dtrace probes for IO and discovery related activity. The project also plans to use a function equivalent to scsi_errmsg(9F) to display sense data to user. This is an implementation specific API to just match the output to scsi_errmsg() in block space. There are no plans to export this function as an API. - How would a user or administrator tell that this subsystem is or is not behaving as anticipated? The clients would get appropriate SCSI sense conditions providing the nature of the problem and device/command associated with it. scsi_log(9F) and sosd_errmsg() will be used to print messages on the console and syslog. - What statistics does the subsystem export, and by what mechanism? Basic IO and error statistics will probably be provided using kstats. - What state information is logged? Driver Attach/Detach and lun discovery information is logged. Information on per command/request is not logged. Number of outstanding opens on devices and number of pending commands are kept track (but not logged). If there a explicit request for unmount/close of device, message will be logged that the request cannot be completed due to pending requests. - In principle, would it be possible for a program to tune the activity of your project? No. No tunables are exported as of now. Considering there is some tunable X, in the future, a program must be authorized and use a API to get/set the value of the tunable X. 10. What are the security implications of this project? - What security issues do you address in your project? None. - The Solaris BSM configuration carries a Common Criteria (CC) Controlled Access Protection Profile (CAPP) -- Orange Book C2 -- and a Role Based Access Control Protection Profile (RBAC) -- rating, does the addition of your project effect this rating? E.g., does it introduce interfaces that make access or privilege decisions that are not audited, does it introduce removable media support that is not managed by the allocate subsystem, does it provide administration mechanisms that are not audited? No. - Is system or subsystem security compromised in any way if your project's configuration files are corrupt or missing? No. - Please justify the introduction of any (all) new setuid executables. N/A - Include a thorough description of the security assumptions, capabilities and any potential risks (possible attack points) being introduced by your project. A separate Security Questionnaire http://sac.sfbay/cgi-bin/bp.cgi?NAME=Security.bp is provided for more detailed guidance on the necessary information. Cases are encouraged to fill out and include the Security questionnaire (leveraging references to existing documentation) in the case materials. API Callers must have SYS_DEVICES privilege to successfully open the device and receive a osd handle. This handle is used in subsequent communication with the API. When the client is finished using the device, it uses the osd_close() API to close (which release) the handle to the device. The driver internally uses LDI interface to open and close the device. A userland library (libsosd) is also being provided by the project which exports the same interfaces as the kernel module (scsi_osd) does. Callers of this library (& uscsicmd interface callers) will be authenticated using RBAC. The project does not open any new ports and does not introduce any SMF services. 11. What is its UNIX operational environment: - Which Solaris release(s) does it run on? Solaris Nevada and SDX releases. Although there are currently no plans to backport sosd at this time, such a backport might be requested in the future. - Environment variables? Exit status? Signals issued? Signals caught? (See signal(3HEAD).) No signals are issued or caught. - Device drivers directly used (e.g. /dev/audio)? .rc/defaults or other resource/configuration files or databases? None. - Does it use any "hidden" (filename begins with ".") or temp files? No. - Does it use any locking files? No. - Command line or calling syntax: What options are supported? (please include man pages if available) Does it conform to getopt() parsing requirements? No CLI is provided. - Is there support for standard forms, e.g. "-display" for X programs? Are these propagated to sub-environments? N/A - What shared libraries does it use? (Hint: if you have code use "ldd" and "dump -Lv")? libc - Identify and justify the requirement for any static libraries. N/A - Does it depend on kernel features not provided in your packages and not in the default kernel (e.g. Berkeley compatibility package, /usr/ccs, /usr/ucblib, optional kernel loadable modules)? No. - Is your project 64-bit clean/ready? If not, are there any architectural reasons why it would not work in a 64-bit environment? Does it interoperate with 64-bit versions? Yes, the project is 64-bit clean/ready. - Does the project depend on particular versions of supporting software (especially Java virtual machines)? If so, do you deliver a private copy? What happens if a conflicting or incompatible version is already or subsequently installed on the system? No. - Is the project internationalized and localized? No internationalization or localization requirements. - Is the project compatible with IPV6 interfaces and addresses? N/A 12. What is its window/desktop operational environment? N/A 13. What interfaces does your project import and export? Imported Interfaces ----------------+--------------------+------------------------------- Interface | Classification | Comments ----------------+--------------------+------------------------------- Solaris DDI | | Solaris LDI | | scsi_init_pkt | | scsi_transport | | scsi_destroy_pkt| | scsi_dmafree | | physio | All are Committed | Interfaces defined in Sections biowait | interfaces | 9F, 9S and 3LIB of Solaris biodone | | man pages bioerror | | buf | | bp_mapout | | getrbuf | | kmem_zalloc | | kmem_free | | libc | | ----------------+--------------------+------------------------------- Exported Interfaces ---------------+--------------------+-------------------------------- Interface | Classification | Comments ---------------+--------------------+-------------------------------- OSD Kernel API | Volatile | Kernel module implementing Filesystem | | the interface between the interface | | filesystem or other kernel | | applications and the sosd driver ...................................................................... osd_setup_inquiry | | osd_setup_list | | osd_setup_format_lun | C | osd_setup_create_partition | O | These interfaces are available osd_setup_remove_partition | N | for kernel clients through osd_setup_create_object | T | scsi_osd module and to user mode osd_setup_remove_object | R P | clients through libsosd. osd_setup_create_and_write | A R | osd_setup_write | C O | osd_setup_write_pagelist | T J | osd_setup_read | E E | osd_setup_read_pagelist | D C | osd_setup_append | T | osd_setup_set_1page_attr | | osd_setup_set_get_1page_attr | P | osd_add_set_page_attr_to_req | R | osd_add_get_page_attr_to_req | I | osd_setup_set_get_list_attr | V | Contracted Project Private osd_submit_req | A | interfaces. osd_get_result | T | osd_free_req | E | osd_close | | osd_get_max_dma_size | | ..................................................................... | | libsosd | Contracted | Library implementing the | Project Private | interface with usermode | | applications / utilities | | USCSI cmd | Committed | ioctl interface between ioctl | | userlibrary and driver using | | USCSI ioctl | | osd_handle | Contracted | User level i/f Used to obtain _by_name | Project Private | a handle for use in the OSD API | | OSD ioctls | Project Private | Ioctls between libsosd and the except for | | driver for usermode osd_handle_* | | implementation ---------------+--------------------+-------------------------------- 14. What are its other significant internal interfaces inter-subsystem and inter-invocation)? N/A 15. Is the interface extensible? How will the interface evolve? - How is versioning handled? The sosd driver and scsi_osd/libsosd libraries use version numbers for the API that is used to exchange commands. The API is used only if the version numbers match. - What was the commitment level of the previous version? N/A - Can this version co-exist with existing standards and with earlier and later versions or with alternative implementations (perhaps by other vendors)? The implementation conforms to the T10 OSD standard and should interoperate with other implementions that also conform to the standard. There are no prior implementations in Solaris. - What are the clients over which a change should be managed? Filesystems and filesystem utilities consume the API. Clients can compare the driver supported API version (OSD_API_REV in osd.h header file) against client supported version to validate that they are compatible with the current API. 16. How do the interfaces adapt to a changing world? The driver and the API packages are delivered by the same build by ON. The client (say, a filesystem) may be outside and must know in advance the version of API (OSD-1 or OSD-2 ...) supported by the driver. The driver has no plans to maintain multiple version compatibility to serve different versions of the API. 17. Interoperability - If applicable, explain your project's interoperability with the other major implementations in the industry. In particular, does it interoperate with Microsoft's implementation, if one exists? sosd will interoperate with OSD devices that comply with the T10 OSD-2 standard. - What would be different about installing your project in a heterogeneous site instead of a homogeneous one (such as Sun)? None. - Does your project assume that a Solaris-based system must be in control of the primary administrative node? No. 18. Performance - How will the project contribute (positively or negatively) to "system load" and "perceived performance"? The driver will not load if no object-based device tries to bind. In case there is an outstanding open on a object device, the driver will be used when there is I/O (and access) made to the device. If there is no outstanding open, the driver gets automatically unloaded. - What are the performance goals of the project? How were they evaluated? What is the test or reference platform? Sosd should provide similar performance to sd in terms of I/Os per second and maximum bandwidth. A higher level goal is to increase the scaleablity of parallel filesystem configurations so that more clients can be supported by a metadata server. - Does the application pause for significant amounts of time? No. IO from userland client is synchronous (on purpose). So, the code will block until the IO request completes. There is no high throughput requirement for usermode client. Typical clients are filesystems in kernel mode. - Can the user interact with the application while it is performing long-duration tasks? Users do not interact directly with the sosd driver. - What is your project's MT model? How does it use threads internally? How does it expect its client to use threads? If it uses callbacks, can the called entity create a thread and recursively call back? The opening of the object-device is done using ldi_open_* interfaces. The open handles are kept track internally and each open is given a handle to setup their osd requests. User mode client access is synchronous and blocking. User mode is primarily for test programs and filesystem utilities that dont demand high throughput. In kernel mode request submission, the client registers a callback with the driver for asynchronous processing fo I/O request. Each I/O request is associated with a osd_req_t datatype that is present in the calling and the callback function. The driver's interface to the transport (HBA) is through scsi_init_pkt(9F) to allocate a scsi_pkt(9S), scsi_transport(9F) to send the command. The responses from HBA are handled in the interrupt service routine. - What is the impact on overall system performance? What is the average working set of this component? How much of this is shared/sharable by other apps? The impact on system performance depends upon the incoming traffic from the clients. - Does this application "wake up" periodically? How often and under what conditions? What is the working set associated with this behavior? Does not wake up periodically. - Will it require large files/databases (for example, new fonts)? No. - Do files, databases or heap space tend to grow with time/load? What mechanisms does the user have to use to control this? What happens to performance/system load? N/A 19. Please identify any issues that you would like the ARC to address. N/A 20. Appendices to include - Project one pager is at: http://sac.sfbay.sun.com/PSARC/2008/097/ - The opensolaris project page is at: http://opensolaris.org/os/project/osd/