Java Solaris Communities My SDN Account Join SDN
 
Article

OSD (Object-Based Storage Device) Support in the Solaris OS

 
By Ramana Srikanth, October 2007  

Find out about object-based storage device (OSD) support in the Solaris OS.

Contents

Introduction

This article focuses on object-based storage device (OSD) support in the Solaris OS. An overview of OSD in general is provided by a previous Sun Developer Network (SDN) article.

In the Solaris OS, the original motivation to support OSD was to build a petascale high-performance computing (HPC) system for the Defense Advanced Research Projects Agency (DARPA). The High Productivity Computing Systems (HPCS) solicitation specified a requirement for an object-based file system. Object storage and the driver that understands object semantics are essential to achieving high I/O bandwidth and scalability requirements. Additionally, Sun endorses a standards-based approach, and OSD is an approved T10 standard.

For background information about the ANSI T10 SCSI OSD Version 1 command set extensions that include object-based semantics, refer to the project T10/1355-D standards.

The Storage Network Industry Association (SNIA) OSD Technical Work Group is working on OSD-2, which is a further extension to this command set under project T10/1729-D. Apart from several clarifications to OSD-1, interesting features in OSD-2 include:

  • Multiple-object operations
  • Snapshots
  • Exception Management
  • Reservations

Other enhancements include 64-bit CDB and attribute list alignments, enhanced security, read-past-end-of-object, setting attribute without data buffer, range based flush, and CLEAR & PUNCH commands. In summary, OSD-2 empowers the storage device with more control and responsibility.

Why Is OSD Important?

Object storage plays a critical role in storage and access technology for future computing. The data access pattern of modern day computing is closely tied with attributes associated with data, and block-based file systems have a huge burden of having to apply extra software and hardware resources for retrieving data based on its attributes. OSD, being a SCSI-based protocol, provides an excellent level of abstraction between a file system and its storage devices. OSD can be implemented on existing infrastructure without having to deal with the complex problems of data migration to different hardware or a different protocol. High-performance computing (HPC) is becoming a major requirement for most financial and scientific industries. OSD, in combination with emerging transports like iSCSI Extensions for RDMA (iSER), can be of tremendous value to such computing platforms and plays a vital role in the high performance computing space.

Before discussing how object-based storage can be supported in the Solaris OS, let us discuss why object-based storage is important and the difference between object storage file systems and block-based-storage (traditional file systems).

Comparing Traditional File Systems and OSD

The motivation behind implementing OSD in the Solaris OS is to provide a standards-based, scalable, high performing file system. The following diagram shows the difference between a traditional file system stack and an object-based file system stack.

Figure showing the difference between traditional file system stack and object-based file system stack

Traditional file systems have two basic components:

  • user component
  • storage component

The file system user component is responsible for:

  • hierarchy directory management and navigation
  • name resolution and file name mapping
  • user authentication and access control (authorization)
  • concurrency
  • coherency
  • data properties (metadata = size, timestamp, access time, and so on)

The file system storage component is responsible for:

  • space and capacity management
  • storage allocation for data entities (block layout)
  • block address conversion
  • caching (pre-fetching)
  • attribute interpretation (partial)
  • strong security
  • migration

On average, 90% of the file system workload is represented by the storage component. Implementing OSD in the Solaris OS means moving the traditional file system storage component to the OSD. Doing so allows:

  • Standard object interface (no proprietary locking)
  • Storing metadata along with the object instead of storing it at user component level (attributes)
  • Off-loaded data path (performance)
  • Shared access by many clients (scaling)
  • Lean file and object management (reduced complexity)
  • Data security (devices authenticate client requests; policy-based key management)
  • Data integrity (end-to-end data check within device)
  • Automatic data migration (policy-based archiving by device)
  • Parallel I/O access (scalability and performance allows OSD in HPC deployments)
Architecture of OSD in the Solaris OS

Traditional block-based drivers (sd for disk and st for tape) are the primary SCSI target drivers in the Solaris OS. The Solaris Device Driver Interface (DDI) framework and Sun Common SCSI Architecture (SCSA) provide the necessary infrastructure for an object-based device driver to be easily integrated into the driver stack with interfaces to several transports that the Solaris OS already supports. Certain changes on the initiator side (Sun StorageTek Shared QFS enhanced for objects) and on the target side (Solaris iSCSI target and Sun StorageTek Object QFS) are also needed to support object storage in the Solaris OS.

The basic command set for OSD is simple. (For more information, see the OpenSolaris web page for OSD.) The underlying protocol is SCSI, and with a few changes for object support, it can run on any transport that is supported by the Solaris OS, such as Fibre Channel, SCSI, iSCSI or IB.

There are two sections to be discussed in the overall architecture:

  • Initiator node architecture
  • Target node architecture
Initiator Node Architecture

The OSD OpenSolaris web page contains the new SCSI OSD (sosd) driver code along with new interfaces and SCSA layer changes. Changes in Network Storage (NWS) for the iSCSI initiator and the MPxIO layer to support BIDI and large CDB will soon be available on OpenSolaris. Object-based Shared QFS code changes for sending object commands are unavailable at this point.

The following diagram shows how OSD is implemented in the initiator stack.

Figure showing how OSD is implemented in the Solaris host initiator stack.

The sd and st drivers are traditional block-based target drivers and sosd is the new object target driver. From an end-to-end point of view, we are really looking at this stack as an initiator. The object-based Shared QFS module is the component of Sun StorageTek Shared QFS, which generates the OSD commands that are processed by the sosd driver and sent down to the transport layer.

There are several transport technologies that could provide connectivity between object file system initiators and OSD targets. Sending OSD commands using iSCSI over a TCP/IP network is a convenient approach, and is similar to using SCSI RDMA Protocol (SRP) or SCSI over Fibre Channel (FCP). iSCSI Extensions for RDMA (iSER) is a powerful transport subsystem that replaces TCP/IP with RDMA operations, and this in combination with OSD allows for high performance and improved scalability for Solaris supported object file systems.

The storage devices in the previous diagram could be Solaris target implementations. So a lot of CPU-intensive operations could still be handled by Solaris servers, instead of a low-end OSD target device.

The key components needed for Solaris OSD initiator support are as follows:

  • New object-based Shared QFS: This file system understands T10 OSD semantics such as objects, partitions, attributes, and so on. It is distributed and scalable to support parallel clients and storage devices.
  • New osd_iotask(9S) data structure: This contains the OSD command information and replaces the buf(9S) structure used by block-based file systems. This is created and initialized by the file system, passed to the driver to perform the OSD operation, and returned to the file system with data or status.
  • New OSD I/O interface: This is the communication path from file system to target driver that replaces the standard bdev_strategy() for I/O requests. The most notable interface functions are the following:
    • osd_iotask_alloc(9F)
    • osd_iotask_free(9F)
    • osd_iotask_start(9F)
    • osd_iotask.ot_iodone (completion function)
    • osd_open_by_dev(9F)
    • osd_close(9F)
  • New target driver (sosd): The SCSI object storage device driver understands the T10 OSD CDB, sense data, and so on. It is derived from the Solaris sd target driver, but it uses the new OSD file system interface. It includes the uscsi(7i) command path, which uses conventional open/close/ioctl interfaces.
  • SCSA, HBA driver, and MPxIO enhancements: The enhancements are to support large CDBs (256 bytes), bidirectional data transfers (BIDI), and descriptor sense data.
  • Link generator to build Solaris dev nodes (SUNW_sosd_link.so): Currently, each OSD LUN is mapped to a single minor device. This is a character node for now and does not present partition semantics.
Target Node Architecture

The OSD OpenSolaris code contains the Solaris iSCSI target code changes to support the OSD command set. The target object QFS changes (object interface with no namespace) are unavailable at this time.

The underlying device also needs to understand OSD semantics to process the commands sent to it. The Solaris iSCSI target, in this example, supports the OSD command set. Even though the disk is traditionally a block device, object storage devices are functionally rich and autonomous by definition. A few required features need to be handled in the iSCSI target emulator to make end-to-end OSD functionality work. In the case of physical object storage devices, such implementation is supported in the disk and controller firmware.

The following diagram shows where the OSD commands are handled by the Solaris iSCSI target.

Figure showing where the OSD commands are handled by the Solaris iSCSI target.

Note the following changes:

  • iSCSI target changes: A new command handler to handle OSD commands is implemented in the same layer where T10 SPC commands are handled. When the command parser within the target receives an OSD command, it forwards the command to the OSD emulator module. Additionally, T10 defined OSD attributes are also handled by the OSD target emulator.
  • File system target changes: Stand-alone QFS is enhanced to support a new object interface. The target stand-alone QFS file system on the target has no namespace. New ioctls support creating and opening of an object. The object is identified by the inode.generation 64-bit number.
Future Work

HPC, as a consumer of Solaris OSD, is the future direction for this technology. Currently, Shared QFS supports large enterprises, grid deployments, and HPC. With Shared QFS and OSD there are distinct advantages in terms of scalability and performance. Storage allocation on the target node increases horizontal scalability of parallel file systems. Traditional block allocation on the metadata server limits scaling. Hence, with OSD-enabled Shared QFS, space allocation moves to the storage node and is done in parallel. The bandwidth scales up as capacity increases.

Currently, block-based Shared QFS supports a single Metadata Server (MDS) for storage allocation and naming services. With storage allocation moving down to the storage node in object-based Shared QFS, the primary responsibility of the MDS now is naming services. Though Shared QFS supports a single MDS at this time, plans are in place to support multiple MDS nodes.

Archiving support provided by the Storage Archive Manager (SAM) is already supported in the block space. The ultimate goal of Shared QFS with OSD is to support thousands of compute nodes while also supporting a high-performing storage platform that can effectively use the features that provide value for storage and the computing subsystem.

Shared QFS as a parallel file system has a major role to play in a multi-node cluster environment. To support high data rates in an HPC deployment, there is lot of dependency even on a single compute node in terms of I/O transfer sizes and transport capabilities. To use the benefits of the object data path to the maximum possible extent, there are some key features that the Solaris OS needs to support. Most important among them are:

  • Multi-segment DATA IN and DATA OUT buffers in most layers of the I/O path
  • BIDI (bidirectional I/O)
  • DMA breakup (support for PKT_DMA_PARTIAL on SPARC and x86 platforms)
Reference

For more details, visit the following OpenSolaris web sites.

Project Pages

Communities

About the Author

Ramana Srikanth is a software engineer in the Solaris Storage Group at Sun Microsystems Inc., developing iSCSI, Fibre Channel, MPxIO and Target device drivers for the NWS and OS/Net consolidations. Ramana has an M.S. from the University of Toledo in the U.S. and has been working for Sun since December 2001 in various test and development roles.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.