--------------------------------------------------------------------------

                         pNFS Functional Specification

1 Project Description

   The pNFS project will deliver both client and server side support for the
   portions of NFSv4 minor version 1 (NFSv4.1) in support of Parallel NFS
   (pNFS). The pNFS project plans to deliver this functionality into the ON
   consoliation.

  1.1 Definition

    NFSv4.1

   pNFS is one of the features defined in the NFSv4.1 protocol specification,
   which is currently an IETF Internet Draft. The pNFS functionality is a
   method of introducing data access parallelism. The NFSv4.1 protocol
   defines a method of separating the metadata (names and attributes) of a
   filesystem from the location of the file data; it goes beyond simple
   metadata and data separation to define a method of striping the data
   amongst a set of data servers and allowing clients to access the data
   servers directly. This differs from the traditional NFS server which holds
   the names of files and their data under the single umbrella of the server
   and all access to the files must be done from the single server which is
   exporting the file system.

    pNFS Definitions

     * Metadata Server (MDS) - Entity that provides metadata information
       about a file including location and attributes (name, modification
       time, etc.) as well as full access to the data.
     * Data Server - Entity that stores the data which may or may not be
       co-located with the MDS.
     * Client - Entity that may capable of retrieving information about the
       location of a file from the metadata server, and accessing the data on
       the data servers directly.
     * Layout - Defines how a file's data is organized on one or more data
       servers.
     * Layout Type - Defines the method for providing the pNFS client direct
       access to the data. The pNFS protocol currently allows three different
       layout types; files, blocks and objects. With files-based layouts a
       client can access the data with files-based protocols like NFSv4.1,
       blocks with iSCSI or FCP and objects with OSD over iSCSI or Fibre
       Channel. Support for blocks and objects based layout types is out of
       scope for the pNFS project.

    pNFS Project Components

   The pNFS project will deliver the following NFSv4.1 components:

     * A NFSv4.1 Server that is capable of being a pNFS Metadata Server (MDS)
       and a regular (non-pNFS) NFSv4.1 server.
     * A data server which supports the files-based layout type.
     * A NFSv4.1 Client which supports the files-based layout type.

   Again, support for blocks and objects based layout types is out of scope
   for the pNFS project.

   In addition to delivering the NFSv4.1 operations in support of pNFS, the
   pNFS project will address the following additional pieces of NFSv4.1
   functionality:

     * Sessions - Provides a framework for client and server such that
       "exactly once semantics" can be achieved at the NFS application level.
       This is a mandatory to implement set of functionality for NFSv4.1.
     * SECINFO_NO_NAME - Modifications to the NFSv4.0 security negotiation
       mechanism.

  1.2 Motivation, Goals, and Requirements

   The goal of the pNFS project is to add the needed NFSv4.1 functionality in
   support of pNFS. The goal of pNFS itself is to provide the capability for
   clients to access data in parallel from many data servers and, in turn,
   increase throughput capability significantly.

  1.3 Changes From the Previous Release

   N/A - This is the first release for the pNFS project.

  1.4 Program Plan Overview

    1.4.1 Development

     Logistical information for the pNFS project is captured on the NFSv4.1
     pNFS OpenSolaris project page.

    1.4.2 Quality Assurance/Testing

     As a summary, the base NFSv4.1 functionality will be tested to make sure
     the pNFS implementation complies with the NFSv4.1 specification. In
     addition to testing the Solaris client and server against each other we
     will be testing interoperability with other NFSv4.1 implementations at
     events such as Connectathon and Bakeathons. Beyond base NFSv4.1 testing,
     performance and scalability will be tested in order to verify that we
     meet the related criteria as specified in section 2.5, below.

     The pNFS test plan has been drafted; the complete details for the plan
     will therefore not be captured in this document.

    1.4.3 Documentation

     Documentation will be developed in the context of the NFSv4.1
     OpenSolaris project and we will be soliciting feedback and contribution
     from members of the community. Proposed documentation includes:

        * Pre-release documentation for development and review by the
          OpenSolaris Community
        * Early access whitepapers for publication on BigAdmin and Sun
          Developer Network.
        * man pages
        * System Administration Guide - The current thought is that pNFS will
          be documented in the Solaris System Administration Guide: Network
          Services manual. If the pNFS content overwhelms this section, it
          may be moved to a new manual.

     All documentation developed will be posted under the OpenSolaris NFSv4.1
     project and reviewed by that project team. This review will be handled
     as it has historically been done with internal reviews. All comments
     will be considered for integration and verified for accuracy.

  1.5 Related Projects

    1.5.1 Dependencies on Other Sun Projects

     The pNFS project is dependent upon NFS/RDMA - Transport version update
     (PSARC/2007/347)

    1.5.2 Dependencies on Non-Sun Projects

     We depend on the NFS version 4 Working Group (NFSv4 WG) of the IETF's
     NFSv4.1 specification effort. The NFSv4 WG is actively working on
     stabilizing the NFSv4.1 specification and it is now considered
     functionally complete. To assist in bringing Internet Draft to closure
     (working group last call), the document editors have been hosting a set
     of formal reviews that will continue through the summer of 2007. It is
     expected that the Working Group will complete the document around the
     end of 2007.

     The pNFS team is very involved in the NFS version 4 Working Group and
     are able to manage this dependency effectively.

    1.5.3 Sun Projects Depending on this Project

     None

    1.5.4 Projects Rendered Obsolete by this Project

     None

    1.5.5 Related Active Projects

     Converting /etc/default/{nfs/autofs} to SMF properties (PSARC/2007/393)

     PSARC/2007/393 is converting the configuration variables stored in
     /etc/default/nfs to SMF properties. The pNFS project will be introducing
     additional configuration variables and will be following the conventions
     set forth in this project.

    1.5.6 Suggested Projects to Enhance this Program

     None

  1.6 Competitive Analysis

   There are two ways in which our implementation of pNFS will compete:

     * It will compete against other products using proprietary protocols
       (e.g. Cluster File Systems' Lustre product)
     * It will compete against other vendor's implementations of pNFS.

   Using an open standard such as pNFS gives us many advantages. Foremost is
   the fact that our product will interoperate with other vendors' products.
   For example, Linux clients will be able to access data on Solaris servers.
   An openly developed standard also gives us a better tested standard to
   implement to, meaning that we have a much lower risk of discovering
   fundamental design flaws in the protocol. And, finally, one of the
   advantages of an open protocol is that customers benefit from not being
   locked into any one implementation.

   Solaris's pNFS implementation will gain advantages from tight integration
   with ZFS, SMF, and other advanced features of Solaris.

2 Technical Description

  2.1 Architecture

     The Parallel NFS or pNFS functionality, as its name implies, is a method
     of introducing data access parallelism. The NFSv4.1 protocol defines a
     method of separating the metadata (names and attributes) of a filesystem
     from the location of the file data; it goes beyond simple name/data
     separation to define a method of striping the data amongst a set of data
     servers. This is obviously very different from the traditional NFS
     server which holds the names of files and their data under the single
     umbrella of the server. There are products in existence that are
     multi-node NFS servers but they are limited in the participation from
     the client in separation of metadata and data. The NFSv4.1 client can be
     enabled to be a direct participant in the exact location of file data
     and avoid sole interaction with the single NFS server when moving data.

     The NFSv4.1 pNFS server is now a collection or community of server
     resources or components; these community members are assumed to be
     controlled by the metadata server.

     The pNFS client still accesses a single metadata server for traversal or
     interaction with the namespace; when the client moves data to and from
     the server it may be directly interacting with the set of data servers
     belonging to the pNFS server community.

         pNFS
              --------------------------------------------------
              | pNFS Server                                    |
              |                                                |
              | .--------------    .--------------             |
              | |data-server  |    |data-server  |             |
              | |             |    |             |             |
              | `-.------------    `-----.--------             |
              |   |     .--------------  |  .--------------    |
              |   |     |data-server  |  |  |data-server  |    |
              |   |     |             |  |  |             |    |
              |   |     `------.-------  |  `------.-------    |
              |  _|____________|_________|_________|__________ |
              |                       |                        |
              |             ,---------'-----------             |
              |             | metadata server    |             |
              |             |____________________|             |
              `---.------.--------.-------------.-------.-------
                  |      |        |             |       |
              ____|______|________|_____________|_______|________
                    |               |                     |
                    |               |                     |
              .-----+-----.   .-----+-----.         .-----+-----.
              |           |   |           |         |           |
              |pNFS Client|   |pNFS Client|  ....   |pNFS Client|
              |           |   |           |         |           |
              `-----------'   `-----------'         `-----------'
    

     As mentioned, the user's view of the "pNFS Server" continues to appear
     on the network as a regular NFS server even though there are multiple,
     distinct and addressable components of the server. There is a single
     server from which a filesystem is mounted and accessed. The
     administrator knows the details of the "pNFS Server" or community. The
     pNFS Client implementation will know the details of the pNFS server
     through NFSv4.1 protocol interaction. When it comes to things like mount
     points and automount maps, the look and feel of the NFS server is the
     same as it has been: single server name and its self-contained
     namespace.

     The pNFS enabled client determines the location of file data by directly
     querying the pNFS server. In pNFS nomenclature, the client requests a
     file "layout". When a file is opened, the pNFS client will ask the
     metadata server for the file's layout. If available it is available, the
     server will give the layout to the client. When the client moves data,
     the layout is consulted as to the data-server(s) upon which the data
     resides; once the offset and range is matched to the appropriate
     data-server(s), the data movement is completed with read and write
     requests. The pNFS protocol's layout definition provides for
     straightforward striping of data only. There is one twist to the
     striping -- the location may be specified by two paths thus allowing for
     a simple multi-pathing feature.

     With the layout in hand, the pNFS client is then enabled to generate
     read/write requests in parallel or by a method of its own choice. The
     layout is thus a simple enablement for the pNFS client to increase its
     overall data throughput. The pNFS server is also a benefactor by the
     nature of horizontal scale for data access along with the reduction of
     read/write operations being directly serviced by the metadata server.
     Obviously, the main purpose or intent of the NFSv4.1 pNFS feature is to
     significantly improve the data throughput capabilities of NFS servers.
     The NFSv4.1 protocol requires that the metadata server always be able to
     service read/write request itself. This requirement allows for NFSv4.1
     clients that are not enabled for pNFS or for cases that the available
     parallelism is not required.

     The NFSv4.1 protocol defines interaction between client and server.
     There is no specification for the interaction between components of the
     pNFS server. This interaction or coordination of the pNFS server
     community members is left as a pNFS server implementation detail. Given
     the lack of an open protocol definition, pNFS server components will be
     homogeneous in their implementation. This isn't necessarily a bad thing
     since there is a variety of server filesystem architectures already
     present in the NFS server market. The lack of protocol definition allows
     for the most effective reuse of existing filesystem and server
     technology. Obviously there is a well-defined set of requirements or
     expectations of the metadata and data servers in the form of the NFSv4.1
     protocol.

     Maintaining the theme of inclusiveness, the pNFS protocol allows for a
     variety of data movement or transfer methods between the client and pNFS
     server. The NFSv4.1 layout mechanism defines layout "types". The types
     are then defined as a particular data movement or transport protocol.
     The layout mechanism also allows for inclusion of newly defined types
     such that the NFSv4.1 protocol can adapt to future requirements or
     ideas.

     There are three types of layouts currently being defined for pNFS; they
     are generically referred to as: files, objects, blocks. The "files"
     layout type uses the NFSv4.1 read/write operations to the data-server.
     The "files" type is being defined within the NFSv4.1 Internet Draft. The
     "objects" layout type refers to the OSD storage protocol as defined by
     the T10 "SCSI Object-Based Storage Device Commands" protocol draft. The
     "blocks" layout refers to the use of SCSI (in its many forms). The pNFS
     OSD and block layout definitions are defined in separate Internet
     Drafts.

     For additional detail, the current Internet Drafts for the items
     mentioned above are:

     "NFSv4 Minor Version 1"
     http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt

     "Object-based pNFS Operations"
     http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-pnfs-obj-03.txt

     "pNFS Block/Volume Layout"
     http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-pnfs-block-03.txt

    Solaris pNFS
    -------------------

     The initial instantiation of NFSv4.1 for Solaris will deliver the
     Sessions and pNFS functionality using the files layout type.

    The Transports
    ----------------------

     With the introduction of NFSv4.1, there will be no change in the network
     transports available to the client or server. The kernel RPC interfaces
     will continue to provide TCP, and RDMA (in the form of Infiniband)
     network connectivity. The current RPCSEC_GSS mechanisms will continue to
     be supported as well.

    The Client
    ---------------

     The Solaris NFSv4.1 client will be a straightforward implementation of
     the Sessions and pNFS capabilities. The administrative model for the
     client will remain the same as it is today. As mentioned in the
     introduction, the client will continue to mount from a single server and
     provide a path (e.g. server.domain:/path/name).

     Since NFSv4.1 constitutes a new version of the protocol, the client will
     negotiate the use of NFSv4.1 with the server as it has done in the past.
     The client will have a preference for the highest version of NFS offered
     by the server. As the client accesses filesystems it will query the
     server for the availability of the pNFS functionality. Then, on OPENing
     a file, the client attempts to obtain a layout and then uses the layout
     information when READing/WRITEing and COMMITing data to the "server" to
     provide data access parallelism.

    The Server
    ----------------

     As mentioned in the introduction, the NFSv4.1 protocol does not define
     the architecture of the pNFS server; only the outward facing protocol
     and behavior is defined. Given this flexibility, a multitude of
     architectures can fit the model of pNFS service. For the Solaris pNFS
     server, a straightforward model will be used. Each member of the pNFS
     server community (metadata server, and data servers) are to be thought
     of as a self-contained storage unit.

    Metadata server
    ------------------------

     For the Solaris pNFS server, there is one metadata server. The metadata
     server may be configured with a high availability component that allows
     for active/passive availability. Solaris pNFS metadata server looks and
     feels like a regular NFS server. There are extensions to the management
     model to integrate the use of the data servers. The metadata server is
     in full control of the pNFS server in the sense that it decides which
     data server to utilize (initial inclusion and allocations for new file
     layouts).

     The metadata server will require the use of ZFS as its underlying
     filesystem for storage of the filesystem namespace and attributes.
     Architecturally, other filesystems may be used but they must provide
     like-functionality for use by the pNFS metadata server; the most
     important features are NFSv4 ACL support and system extended attributes.

    Data server
    -----------------

     The pNFS data servers do not need a traditional filesystem namespace for
     associating names to data given that the metadata server provides this
     service by definition. The data server will then be free to associate
     the file objects that are to be stored upon them with their filehandle
     (the identifier shared between the data server and metadata server) in
     any fashion appropriate. One may think that it is a requirement to have
     the data server mimic the filesystem namespace of the metadata server
     but this is untrue. In fact, the mimicking of the metadata namespace
     would prove to be cumbersome for regular NFS server operation. The
     Solaris pNFS data server will have an architecture that will allow for
     the direct use of ZFS storage pools (zpools) for file data storage.

    Network Requirement
    -------------------------------

     The pNFS diagram above implies that there is a need for two separate
     networks; one internal to the pNFS community and one for client access
     to the pNFS server. This is but a possibility -- a logical
     representation; not a requirement. The only requirement with respect to
     network configuration is that each component within the diagram above be
     addressable or routable to each other. Therefore, there can be variety
     of networking technologies or topologies employed for the pNFS server.
     The choice of topology or interconnect will be based on the workload
     being served by the pNFS server.

    Coordination of the pNFS Community
    ----------------------------------------------------

     To this point, we have a Solaris pNFS client interacting with a pNFS
     server over a flexible network configuration. The metadata server is
     using ZFS as the underlying filesystem for the "regular" filesystem
     information (names, directories, attributes). The data server is using
     the ZFS storage pool (zpool) to organize the data for the various
     layouts the metadata server is handing out to the clients. What is
     coordinating all of the pNFS community members? This is the pNFS Control
     Protocol.

    pNFS Control Protocol
    --------------------------------

     The control protocol is the piece of the pNFS server solution that is
     left to the various implementations to define. Since the Solaris pNFS
     solution is taking a fairly straightforward approach to the construction
     of the pNFS community, this allows for the use of ZFS' special
     combination of features to organize the attached storage devices. This
     will allow for the control protocol to focus on higher level control of
     the pNFS community members. Some of the highlights of the control
     protocol are:

        * Metadata and data server reboot / network partition indication
        * Filehandle, file state, and layout validation
        * Reporting of data server resources
        * Inter data server data movement
        * Metadata server proxy I/O
        * Data server state invalidation

    In Summary
    -----------------

     NFSv4.1 will deliver a range of new features. The two initially
     addressed for Solaris will be Sessions and pNFS. The Solaris pNFS client
     will be a simple implementation of the protocol. The Solaris pNFS server
     will layer on top of existing Solaris technologies to offer a feature
     rich solution.

  2.2 Interfaces

    2.2.1 Exported Interfaces

      

+-------------------------------------------------------------------------------------------------------------------------------+
|            |Proposed       |                                                                               |Former Stability  |
|Interface   |Stability      |                                                                               |Classification or |
|Name        |Classification |Specified in What Document?                                                    |Other Comments    |
|------------+---------------+-------------------------------------------------------------------------------+------------------|
|            |               |Currently                                                                      |The next minor    |
|NFSv4.1     |Standard       |http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt      |version of NFSv4  |
|            |               |                                                                               |(PSARC/1999/289). |
|------------+---------------+-------------------------------------------------------------------------------+------------------|
|            |Evolving,      |                                                                               |                  |
|libnfsd.so  |Consolidation  |/usr/include/libnfsd.h                                                         |                  |
|            |Private (?)    |                                                                               |                  |
|------------+---------------+-------------------------------------------------------------------------------+------------------|
|            |Evolving,      |                                                                               |                  |
|libdserv.so |Consolidation  |/usr/include/libdserv.h                                                        |                  |
|            |Private (?)    |                                                                               |                  |
+-------------------------------------------------------------------------------------------------------------------------------+

      

    2.2.2 Imported Interfaces

     +--------------------------------------------------------------+
     |            |                 |              | Former         |
     |            |                 |              | Stability      |
     |            |                 | Specified in | Classification |
     | Interface  | Stability       | What         | or Other       |
     | Name       | Classification  | Document?    | Comments       |
     |------------+-----------------+--------------+----------------|
     |            |                 |              | We will most   |
     |            |                 |              | likely need a  |
     |            |                 |              | contract for   |
     |            |                 |              | the use of the |
     |            |                 |              | DMU            |
     |            |                 |              | interfaces.    |
     |            |                 |              | Fortunately,   |
     |            |                 |              | we (pNFS and   |
     | ZFS DMU    | Consolidation   | dmu.h, dmu.c | ZFS) will both |
     |            | Private (?)     |              | be in the ON   |
     |            |                 |              | consolidation, |
     |            |                 |              | therefore, any |
     |            |                 |              | incompatible   |
     |            |                 |              | changes in ZFS |
     |            |                 |              | which affect   |
     |            |                 |              | pNFS are bound |
     |            |                 |              | to be caught   |
     |            |                 |              | very early on. |
     |------------+-----------------+--------------+----------------|
     | libscf(3)  | Evolving        | libscf(3LIB) |                |
     |            |                 | man page.    |                |
     +--------------------------------------------------------------+

  2.2.3 Internal Interfaces

     Most significant among pNFS's internal interfaces is the "control
     protocol" mentioned in section 2.1. This is the protocol used between
     the MDS and the data servers. Other internal interfaces will be listed
     in the next subsection.

  2.2.3.1 Control Protocol

     A general description of the RPC-based control protocol is provided.
     This protocol will remain project private and subject to change as the
     functionality evolves. Note the project commits to the general
     architectural tenant for the use of RPC versioning to allow for
     effective upgrated of the various pNFS server components.

       1. DS_EXIBI: called by the data server when it comes online, to
          exchange identities with the MDS.
       2. DS_REPORTAVAIL: called by the data server when it comes online, and
          periodically thereafter, to report the zpools that it has at its
          disposal, and its network addresses.
       3. DS_CHECKSTATE: called by the data server the first time that it
          receives a request from a client for a particular (filehandle,
          client, stateid, principal) tuple. The result from the MDS says
          whether i/o is to be permitted or not.
       4. DS_INVALIDATE: called by the MDS to invalidate state on the data
          server; specifically, the state received on calling DS_CHECKSTATE.
       5. DS_READ: called by the MDS to read from the data server. Used for
          proxy i/o.
       6. DS_WRITE: called by the MDS to write to the data server. Used for
          proxy i/o.
       7. DS_COMMIT: called by the MDS to commit previously written data on
          the data server. Used for proxy i/o.
       8. DS_GETATTR: called by the MDS to query attributes for an object on
          the data server. Currently, "size" is the only attribute that the
          data server can return.
       9. DS_SETATTR: called by the MDS to set attributes for an object on
          the data server. Currently, "size" is the only attribute that may
          be set.
      10. DS_REMOVE: called by the MDS to remove an object on the data
          server.

  2.2.3.2 Other Internal Interfaces

+------------------------------------------------------------------------------+
|Interface   |Stability       |                              |                 |
|Name        |Classification  |Specified in What Document?   |Other Comments   |
|------------+----------------+------------------------------+-----------------|
|            |                |                              |Only used by     |
|/dev/dserv  |Project Private |/usr/include/sys/dserv_impl.h |libdserv.so      |
|ioctl()     |                |                              |and/or           |
|            |                |                              |libnfsd.so.      |
|------------+----------------+------------------------------+-----------------|
|            |                |                              |Already exists;  |
|            |                |                              |new functionality|
|_nfssys()   |                |                              |will be          |
|system call |Project Private |nfssys() in nfs_sys.c         |integrated into  |
|            |                |                              |libnfsd.so and   |
|            |                |                              |possibly         |
|            |                |                              |libdserv.so.     |
+------------------------------------------------------------------------------+

  2.3 User Interface

     The pNFS project will deliver a set of new command line interfaces
     (CLIs) and modifications to existing CLIs. The focus of the for the CLIs
     to be delivered is on the system administrator and not an end user. With
     the exception of the modifications to mount_nfs(1M) (additional
     information below) the CLIs will be used by individuals in order to
     configure and manage a pNFS installation. The pNFS project will not
     provide a GUI for managing pNFS.

     The CLIs can be divided up into the following sub-sections:

        * Extensions to existing commands

          mount_nfs(1M) will be modified in the following ways:
          1. addition of a new value for the "vers=" option in order to allow
          a user to specify a NFS version of "41" (i.e., NFSv4.1). This new
          option will exist in addition to the other "vers=" values, 2, 3 and
          4.
          2. addition of a "nopnfs" option. This option specifies that the
          client should disable the use of pNFS on this mount.

          nfsstat(1M) will be modified to display statistics about the
          NFSv4.1 protocol as well as the control protocol between the
          metadata server and the data servers.

          zfs(1M) will be modified to allow the "list" subcommand to display
          the presence of pNFS datasets in a zpool on the data server.

          zfs(1M) will be modified to allow the "create" subcommand to set of
          a new property at the time of creating a ZFS file system for
          storage of file metadata on the pNFS metadata server. This property
          will flag the file system as one that is to be used for the purpose
          storing file metadata. The name of this property is still to be
          determined, but one possibility is "pnfs". Values for the property
          would be "on" or "off".

          For diagnosability, snoop(1M) will be modified in support the
          NFSv4.1 and control protocols and mdb(1) macros, walkers and dcmds
          (yet to be defined) will be added.

        * New commands

          npool(1M) is a new command that is used to specify the data servers
          that the metadata server will use for determining the layout of a
          file. For further information refer to the draft man page for the
          npool(1M) command.

          dpool(1M) is a new command that is used to specify which storage
          (i.e., which ZFS storage pools or, in the future, which QFS file
          system) on a data server machine should be used for the storage of
          pNFS file data. For further information, refer to the draft man
          page for the dpool(1M) command.

          pnfsalloc(1M) is a new command that allows a user to specify rules
          for the client and the server to consult when determining the
          layout of a newly created file. For further information, refer to
          the pnfsalloc(1M) command specification.

  2.4 Compatibility and Interoperability

    2.4.1 Standards Conformance

        We are implementing the NFSv4.1 specification and we are not
        deviating from or extending the standard in any incompatible way.

    2.4.2 Operating System and Platform Compatibility

        This project will integrate into Solaris Nevada. There are no special
        hardware requirements, but the NFS/RDMA (PSARC/2007/347) project will
        give us the ability to exploit the RDMA capabilities of InfiniBand,
        which is relevant to our target audience.

    2.4.3 Interoperability with Sun Projects/Products

        Consolidation-private APIs, such as libnfsd.so and libdserv.so, may
        give other management frameworks the capability to manage the
        functionality of this project.

    2.4.4 Interoperability with External Products

        A benefit of implementing the NFSv4.1 specification is that we have
        the opportunity to interoperate with other vendor's NFSv4.1
        implementations. The interoperability of the Solaris implementation
        with other vendor's implementations is tested at events such as
        Connectathon and Bakeathons. No interoperability with any other
        external products is planned.

    2.4.5 Coexistence with Similar Functionality

        No similar functionality is currently delivered with Solaris.

    2.4.6 Support for Multiple Concurrent Instances

        We will support a regular NFSv4.1 server and a pNFS metadata server
        to be active simultaneously. This is so that a single NFSv4.1 server
        can support pNFS file systems and non-pNFS file systems
        simultaneously.

        We will support the capablity of a pNFS metadata server and a pNFS
        data server to be active on the same machine and on the same port
        (2049).

        For the pNFS client, multiple instances is only relevant if thought
        of as multiple mounts.

    2.4.7 Compatibility with Earlier and Future Releases

        This is the first release, so earlier releases is not applicable. To
        accomodate future releases, the control protocol will be versioned,
        as will the data server filehandles used to identify the individual
        stripes.

  2.5 Performance and Scalability

    2.5.1 Performance Goals

        The preliminary performance goals for the pNFS project are as
        follows:

           * Data-intensive throughput to a client should be at least 3.3x
             the throughput of a single server/ pipe when transferring to or
             from four servers pipes, and ideally 3.75x. This specification
             is intended for a reference platform of an x2100 client on 1
             Gbit ethernet and a SPARC platform TBD (probably T2000). Stretch
             goals should be adopted for larger clients with more processor
             and especially with more bridge bandwidth (x4600 now)
           * For single-threaded transfer to/from a client, pNFS data-
             intensive performance should be equivalent (within 2%) of
             standard NFSv4 data-intensive performance on the same
             configuration. This should be measured for both 1 Gbit ethernet
             and NFS-over-RDMA-over-IB.
           * For single-threaded transfer to/from a data server, data-
             intensive performance of a pNFS data server should be equivalent
             (within 2%) of standard NFSv4 on the same configuration, whether
             using Ethernet or IB.
           * For aggregate throughput to/from a data server, on small
             configurations (x2100, T2000) the server should be able to
             consume at least 85% of bridge bandwidth; this means that about
             40% of bridge bandwidth should be supplied to the networks. For
             example, if bridge bandwidth is 1GB/sec, the server should be
             able to move about 400 MB/sec in large I/Os, since from disk to
             memory and memory to network uses the bridge twice.
           * For a meta-data weighted workload, the NFSv4.1 pNFS enabled MDS
             will perform at no less than 90% of a NFSv4.1 server without
             pNFS functionality enabled. The CREATE operation is treated
             differently, since it places policy interpretation on the
             performance hot path. With no special policy on open, the MDS
             should be within 10% of an NFSv4.1 server.
           * Metadata server metrics (read/write through MDS) - Read/write
             throughput that is done to/from the NFSv4.1 pNFS enabled MDS
             will perform at no less than 80% of a standard NFSv4.1 NFS
             server.
           * Data server metrics - For read/write throughput, the NFSv4.1
             data server will perform at no less than 98% of a standard
             NFSv4.1 NFS server.
           * NFS over RDMA must comply with the standard and meet or exceed
             HPC customer expectations. 980 MB/sec reads, 650 MB/sec writes
             achieved X2100 -> x2100, HW limited (on-nv to on-nv)

    2.5.2 Performance Measurement

        The exact methods for doing performance measurement and analysis are
        still to be determined.

    2.5.3 Scalability Limits and Potential Bottlenecks

        No exact scalability limits and potential bottlenecks identified at
        this time.

    2.5.4 Static System Behavior

        No exact information available at this time.

    2.5.5 Dynamic System Behavior

        No exact information available at this time.

  2.6 Failure and Recovery

    2.6.1 Resource Exhaustion

        As with other file systems, if disk space is exhausted ENOSPC errors
        will be returned to the application.

        If memory resources are exhausted we may return EAGAIN to the
        application and in cases where memory allocations are done in the
        kernel with use of the KM_SLEEP flag, system hangs are possible.
        Although, this can be prevented by being conscious of the amount of
        memory being allocated when doing KM_SLEEP allocations.

    2.6.2 Software Failures

        One of the main causes for software failures will probably be bugs in
        the code. The team will reduce the risk of failures by exercising due
        diligence during development, via design reviews, code reviews and
        test execution.

        No new software failure avoidence or recovery mechanisms are
        introduced by this project.

    2.6.3 Network Failures

        As pNFS is a network protocol, network failures are taken into
        account, and are well documented in the NFSv4.1 specification.

        The control protocol (between the metadata server and data server) is
        also designed to be resilient to network failures. The control
        protocol is a RPC-based protocol and high-level information about the
        messages in the protocol are documented in section 2.2.3.1.

    2.6.4 Data Integrity

        Data integrity will be no different than it is for ordinary NFS.

    2.6.5 State and Checkpointing

        Recovery from failed pNFS components, including recovery of NFS
        state, is documented in the NFSv4.1 specification.

    2.6.6 Fault Detection

        Our implementation will leverage existing commands as (zpool status),
        extend existing commands (nfsstat, snoop) and introduce new commands
        (dpool status, npool status) to allow a user to detect and diagnose a
        failure.

    2.6.7 Fault Recovery (or Cleanup after Failure)

        NFS currently leverages SMF for the management of NFS related
        services. Our pNFS implementation will continue to build on this in
        order to allow for service restarting and dependency management.

  2.7 Security

     All minor versions of NFSv4 require support for Kerberos. Therefore,
     this is also true for pNFS. Kerberos will be supported for all
     operations between the client and the metadata server as well as for all
     for all operations between the client and the data servers.

     Additionally, Kerberos will be supported will be for the control
     protocol, which occurs between the metadata server and the data servers.
     Host principals will exist for the metadata server and data servers.

     In addition to the support of Kerberos by the control protocol, data
     servers must be approved by an administrator on the metadata server
     before being used within the server community. This is accomplished with
     the npool(1M) command as documented in section 2.3.

  2.8 Software Engineering and Usability

    2.8.1 Namespace Management

        Commands are documented elsewhere. Configuration data will be stored
        in SMF.

    2.8.2 Dependencies on non-Standard System Interfaces

        None.

    2.8.3 Year 2000 Compliance

        We're good!

    2.8.4 Internationalization (I18N)

        All new commands will produce message catalogs.

    2.8.5 64-bit Issues

        No issues.

    2.8.6 Porting to other Platforms

        This project will not be ported to other platforms.

    2.8.7 Accessibility

        This project will be entirely administratable with command line
        interfaces.

3 Release Information

  3.1 Product Packaging

        This product will be bundled in the ON consolidation. No new packages
        will be delivered beyond those that already exist. Those are:
        SUNWnfscr, SUNWnfscu, SUNWnfssr, and SUNWnfssu

        Client side functionality will be delivered in SUNWnfsc*. Server side
        functionality will be delivered in SUNWnfss*.

        Note that all configuration information will be stored in SMF,
        therefore, there will be no addition or modification of configuration
        files.

  3.2 Installation

    3.2.1 Installation procedure

           pNFS will be installed as a part of the Solaris installation
           procedure. No additional installation steps are needed.
           Configuration of pNFS will be done with the command line
           interfaces documented in section 2.3.

    3.2.2 Effects on System Files

           No effect on system files

    3.2.3 Boot-Time Requirements

           None

    3.2.4 Licensing

           Our pre-ON putback source code and binaries (in the form of BFU
           archives) are being posted. You can find the source and binaries
           on our NFSv4.1 pNFS OpenSolaris project page under Downloads.

           The source code is licensed under the Common Development and
           Distribution License (CDDL).

           The pre-ON putback binaries are licensed under the OpenSolaris
           Binary License (OBL).

    3.2.5 Upgrade

    3.2.6 Software Removal

           We will not provide the capability to remove or uninstall pNFS
           from a Solaris machine. Although, a user can disable the use of
           pNFS in any of the following ways:

           1. On the client, execute the mount_nfs(1M) command with "-o
           nopnfs" or "-o vers=[2|3|4]".
           2. On the client, set the NFS_CLIENT_VERSMAX configuration
           variable to 4, 3 or 2. Clients require NFSv4.1 ("41") support in
           order to use pNFS. Setting NFS_CLIENT_VERSMAX as specified sets
           the maximum version of the NFS protocol that the NFS client will
           use.
           3. On the server, set the NFS_SERVER_VERSMAX configuration
           variable to 4, 3 or 2. Servers require support of NFSv4.1 ("41")
           in order to use pNFS. Setting NFS_SERVER_VERSMAX as specfied sets
           the maximum version of the NFS protocol that the NFS server will
           offer.
           4. Disable the NFS server service, data server and client SMF
           services.

           Note that the current method for setting NFS_SERVER_VERSMIN and
           NFS_SERVER_VERSMAX is in the /etc/default/nfs configuration file.
           PSARC/2007/393 is pursuing the conversion of these configuration
           variables into SMF properties.

  3.3 System Administration

        The pNFS project will not deliver a GUI for administration of a pNFS
        configuration. The command line interfaces listed in section 2.3 will
        be the sole administrative interfaces.

4 pNFS Client Architecture

  4.1 Description

        pNFS is tightly integrated into the existing NFS client architecture
        to maximize code re-use. The vast majority of the code will be common
        between NFS4v4.0 and NFSv4.1 since the non-data related interactions
        are roughly the same. The overall NFS client architecture is
        maintained.

        The code paths will diverge slightly when performing an "over the
        wire" I/O request. In the pNFS I/O pathway, the client must evaluate
        the request in the context of the current file layout and split the
        request up into pieces which will go to each distinct data server.
        These pieces can then be dispatched to the various data servers in
        parallel.

  4.2 Interfaces

    4.2.1 User-visible

           The traditional SUS file interfaces remain the primary interface
           by which application programs access pNFS files. No API extensions
           will be delivered by this project.

           The mount command will be extended to provide administrative
           control to the client's selection of version and minor version
           numbers. The administrator can also disable the use of pnfs with
           the -o nopnfs option. The administrator can also control these
           parameters via system wide SMF properties (or configuration
           variables in /etc/default/nfs until PSARC/2007/393 is
           implemented).

           The client also has some control how a new file is to be striped
           over the data servers in the community. This information is
           carried over the network in the NFSv4.1 protocol as hints and is
           strictly advisory to the meta-data server. Further details of this
           interface are given in section 6.

    4.2.2 Internal (optional for ARC review)

           The internal structure of the NFS client remains largely
           unchanged. pNFS I/O requests sent in parallel to the relevant data
           servers using mechanisms similar to the asynchronous thread model
           already in place.

           An internal interface within the NFS client is created to deal
           with the differences between minor version 0 and minor version 1.
           This private interface is not presented here.

  4.3 Operation

        The operational characteristics of a pNFS enabled client are largely
        transparent. The pNFS enabled client will auto-negotiate the version
        and minor version numbers, per server, at mount time. The client will
        get a file's layout prior to initiating any I/O. If any given file
        does not have a layout, the client will drop back to doing normal
        NFSv4.1 I/O operations directly to the meta-data server transparently
        to the application.

        The I/O caching characteristics of the client are unchanged by pNFS.
        The client will use the traditional "close-to-open" semantics for
        flushing data back to the data servers.

5 pNFS Metadata Server (MDS) Architecture

  5.1 Description

        The pNFS Metadata Server is functionally very similar to a
        conventional Solaris NFSv4 server. The main difference is that it
        will store pNFS layout information in system attributes, it will not
        store file data locally, and it will communicate via the control
        protocol with pNFS data servers.

  5.2 Interfaces

    5.2.1 User-visible

           The new npool(1M) command, outlined in section 2.3, is used to
           administer the resources used by the MDS. The zfs(1M) command will
           be modified to allow the setting of a new property at the time of
           creating a ZFS file system (i.e., with zfs create). This property
           will flag the file system is to be used for the purpose storing
           file metadata on the pNFS metadata server. The name of this
           property is still to be determined, but one possibility is "pnfs".
           Values for the property would be "on" or "off".

    5.2.2 Internal (optional for ARC review)

  5.3 Operation

        The pNFS Metadata Server is started the same way that other NFS
        shares are started. That is, either by sharemgr(1M), or by ZFS
        itself.

6 pNFS Data Server Architecture

  6.1 Description

        The pNFS data server consists of a kernel module that registers
        itself with the Solaris NFS server.

  6.2 Interfaces

    6.2.1 User-visible

           The dpool command is used to allocate and control resources used
           by the pNFS data server.

    6.2.2 Internal (optional for ARC review)

           The data server is controlled via libdserv.so, an internal API.
           libdserv.so uses SMF (libscf.so) to store configuration
           information and start/stop the processes needed to function as a
           pNFS data server.

  6.3 Operation

        The pNFS data server consists of a kernel module that registers
        itself with the Solaris NFS server. Requests specific to a pNFS data
        server are routed to this module.

        Changes will also be made to the NFS daemon (nfsd) to support a pNFS
        data server.

7 Layout Allocation Architecture

  7.1 Description

        Layout allocation is used on both a pNFS client and a pNFS metadata
        server. On a client, it governs the contents of a layouthint
        attribute, which a metadata server may use to influence the layout of
        a newly created file. On a metadata server, the policy engine governs
        the newly created layouts directly, with provisions for taking a
        layouthint attribute into account.

  7.2 Interfaces

    7.2.1 User-visible

           The pnfsalloc(1M) command line interface is the most obvious
           user-visible aspect of layout allocation. On a pNFS client, there
           will also be a yet to be determined CLI that will show the layout
           of an existing file.

    7.2.2 Internal (optional for ARC review)

  7.3 Operation

        The layout allocator is accessible to the other components (pNFS
        client and pNFS metadata server) via a project private API.

   --------------------------------------------------------------------------

Appendix A: Standards Supported

     NFS version 4.1 -
     http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt

   --------------------------------------------------------------------------

References

  R.1 Related Projects

        NFS/RDMA - Transport version update (PSARC/2007/347)

        Converting /etc/default/{nfs/autofs} to SMF properties
        (PSARC/2007/393)

  R.2 Background Information for this Project or its Product

        NFS version 4.1 specification -
        http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt

  R.3 Interface Specifications

        npool (1M)

        dpool(1M)

        pnfsalloc(1M)

  R.4 Project Details

        NFSv4.1 pNFS OpenSolaris project