-------------------------------------------------------------------------- pNFS Functional Specification 1 Project Description The pNFS project will deliver both client and server side support for the portions of NFSv4 minor version 1 (NFSv4.1) in support of Parallel NFS (pNFS). The pNFS project plans to deliver this functionality into the ON consoliation. 1.1 Definition NFSv4.1 pNFS is one of the features defined in the NFSv4.1 protocol specification, which is currently an IETF Internet Draft. The pNFS functionality is a method of introducing data access parallelism. The NFSv4.1 protocol defines a method of separating the metadata (names and attributes) of a filesystem from the location of the file data; it goes beyond simple metadata and data separation to define a method of striping the data amongst a set of data servers and allowing clients to access the data servers directly. This differs from the traditional NFS server which holds the names of files and their data under the single umbrella of the server and all access to the files must be done from the single server which is exporting the file system. pNFS Definitions * Metadata Server (MDS) - Entity that provides metadata information about a file including location and attributes (name, modification time, etc.) as well as full access to the data. * Data Server - Entity that stores the data which may or may not be co-located with the MDS. * Client - Entity that may capable of retrieving information about the location of a file from the metadata server, and accessing the data on the data servers directly. * Layout - Defines how a file's data is organized on one or more data servers. * Layout Type - Defines the method for providing the pNFS client direct access to the data. The pNFS protocol currently allows three different layout types; files, blocks and objects. With files-based layouts a client can access the data with files-based protocols like NFSv4.1, blocks with iSCSI or FCP and objects with OSD over iSCSI or Fibre Channel. Support for blocks and objects based layout types is out of scope for the pNFS project. pNFS Project Components The pNFS project will deliver the following NFSv4.1 components: * A NFSv4.1 Server that is capable of being a pNFS Metadata Server (MDS) and a regular (non-pNFS) NFSv4.1 server. * A data server which supports the files-based layout type. * A NFSv4.1 Client which supports the files-based layout type. Again, support for blocks and objects based layout types is out of scope for the pNFS project. In addition to delivering the NFSv4.1 operations in support of pNFS, the pNFS project will address the following additional pieces of NFSv4.1 functionality: * Sessions - Provides a framework for client and server such that "exactly once semantics" can be achieved at the NFS application level. This is a mandatory to implement set of functionality for NFSv4.1. * SECINFO_NO_NAME - Modifications to the NFSv4.0 security negotiation mechanism. 1.2 Motivation, Goals, and Requirements The goal of the pNFS project is to add the needed NFSv4.1 functionality in support of pNFS. The goal of pNFS itself is to provide the capability for clients to access data in parallel from many data servers and, in turn, increase throughput capability significantly. 1.3 Changes From the Previous Release N/A - This is the first release for the pNFS project. 1.4 Program Plan Overview 1.4.1 Development Logistical information for the pNFS project is captured on the NFSv4.1 pNFS OpenSolaris project page. 1.4.2 Quality Assurance/Testing As a summary, the base NFSv4.1 functionality will be tested to make sure the pNFS implementation complies with the NFSv4.1 specification. In addition to testing the Solaris client and server against each other we will be testing interoperability with other NFSv4.1 implementations at events such as Connectathon and Bakeathons. Beyond base NFSv4.1 testing, performance and scalability will be tested in order to verify that we meet the related criteria as specified in section 2.5, below. The pNFS test plan has been drafted; the complete details for the plan will therefore not be captured in this document. 1.4.3 Documentation Documentation will be developed in the context of the NFSv4.1 OpenSolaris project and we will be soliciting feedback and contribution from members of the community. Proposed documentation includes: * Pre-release documentation for development and review by the OpenSolaris Community * Early access whitepapers for publication on BigAdmin and Sun Developer Network. * man pages * System Administration Guide - The current thought is that pNFS will be documented in the Solaris System Administration Guide: Network Services manual. If the pNFS content overwhelms this section, it may be moved to a new manual. All documentation developed will be posted under the OpenSolaris NFSv4.1 project and reviewed by that project team. This review will be handled as it has historically been done with internal reviews. All comments will be considered for integration and verified for accuracy. 1.5 Related Projects 1.5.1 Dependencies on Other Sun Projects The pNFS project is dependent upon NFS/RDMA - Transport version update (PSARC/2007/347) 1.5.2 Dependencies on Non-Sun Projects We depend on the NFS version 4 Working Group (NFSv4 WG) of the IETF's NFSv4.1 specification effort. The NFSv4 WG is actively working on stabilizing the NFSv4.1 specification and it is now considered functionally complete. To assist in bringing Internet Draft to closure (working group last call), the document editors have been hosting a set of formal reviews that will continue through the summer of 2007. It is expected that the Working Group will complete the document around the end of 2007. The pNFS team is very involved in the NFS version 4 Working Group and are able to manage this dependency effectively. 1.5.3 Sun Projects Depending on this Project None 1.5.4 Projects Rendered Obsolete by this Project None 1.5.5 Related Active Projects Converting /etc/default/{nfs/autofs} to SMF properties (PSARC/2007/393) PSARC/2007/393 is converting the configuration variables stored in /etc/default/nfs to SMF properties. The pNFS project will be introducing additional configuration variables and will be following the conventions set forth in this project. 1.5.6 Suggested Projects to Enhance this Program None 1.6 Competitive Analysis There are two ways in which our implementation of pNFS will compete: * It will compete against other products using proprietary protocols (e.g. Cluster File Systems' Lustre product) * It will compete against other vendor's implementations of pNFS. Using an open standard such as pNFS gives us many advantages. Foremost is the fact that our product will interoperate with other vendors' products. For example, Linux clients will be able to access data on Solaris servers. An openly developed standard also gives us a better tested standard to implement to, meaning that we have a much lower risk of discovering fundamental design flaws in the protocol. And, finally, one of the advantages of an open protocol is that customers benefit from not being locked into any one implementation. Solaris's pNFS implementation will gain advantages from tight integration with ZFS, SMF, and other advanced features of Solaris. 2 Technical Description 2.1 Architecture The Parallel NFS or pNFS functionality, as its name implies, is a method of introducing data access parallelism. The NFSv4.1 protocol defines a method of separating the metadata (names and attributes) of a filesystem from the location of the file data; it goes beyond simple name/data separation to define a method of striping the data amongst a set of data servers. This is obviously very different from the traditional NFS server which holds the names of files and their data under the single umbrella of the server. There are products in existence that are multi-node NFS servers but they are limited in the participation from the client in separation of metadata and data. The NFSv4.1 client can be enabled to be a direct participant in the exact location of file data and avoid sole interaction with the single NFS server when moving data. The NFSv4.1 pNFS server is now a collection or community of server resources or components; these community members are assumed to be controlled by the metadata server. The pNFS client still accesses a single metadata server for traversal or interaction with the namespace; when the client moves data to and from the server it may be directly interacting with the set of data servers belonging to the pNFS server community. pNFS -------------------------------------------------- | pNFS Server | | | | .-------------- .-------------- | | |data-server | |data-server | | | | | | | | | `-.------------ `-----.-------- | | | .-------------- | .-------------- | | | |data-server | | |data-server | | | | | | | | | | | | `------.------- | `------.------- | | _|____________|_________|_________|__________ | | | | | ,---------'----------- | | | metadata server | | | |____________________| | `---.------.--------.-------------.-------.------- | | | | | ____|______|________|_____________|_______|________ | | | | | | .-----+-----. .-----+-----. .-----+-----. | | | | | | |pNFS Client| |pNFS Client| .... |pNFS Client| | | | | | | `-----------' `-----------' `-----------' As mentioned, the user's view of the "pNFS Server" continues to appear on the network as a regular NFS server even though there are multiple, distinct and addressable components of the server. There is a single server from which a filesystem is mounted and accessed. The administrator knows the details of the "pNFS Server" or community. The pNFS Client implementation will know the details of the pNFS server through NFSv4.1 protocol interaction. When it comes to things like mount points and automount maps, the look and feel of the NFS server is the same as it has been: single server name and its self-contained namespace. The pNFS enabled client determines the location of file data by directly querying the pNFS server. In pNFS nomenclature, the client requests a file "layout". When a file is opened, the pNFS client will ask the metadata server for the file's layout. If available it is available, the server will give the layout to the client. When the client moves data, the layout is consulted as to the data-server(s) upon which the data resides; once the offset and range is matched to the appropriate data-server(s), the data movement is completed with read and write requests. The pNFS protocol's layout definition provides for straightforward striping of data only. There is one twist to the striping -- the location may be specified by two paths thus allowing for a simple multi-pathing feature. With the layout in hand, the pNFS client is then enabled to generate read/write requests in parallel or by a method of its own choice. The layout is thus a simple enablement for the pNFS client to increase its overall data throughput. The pNFS server is also a benefactor by the nature of horizontal scale for data access along with the reduction of read/write operations being directly serviced by the metadata server. Obviously, the main purpose or intent of the NFSv4.1 pNFS feature is to significantly improve the data throughput capabilities of NFS servers. The NFSv4.1 protocol requires that the metadata server always be able to service read/write request itself. This requirement allows for NFSv4.1 clients that are not enabled for pNFS or for cases that the available parallelism is not required. The NFSv4.1 protocol defines interaction between client and server. There is no specification for the interaction between components of the pNFS server. This interaction or coordination of the pNFS server community members is left as a pNFS server implementation detail. Given the lack of an open protocol definition, pNFS server components will be homogeneous in their implementation. This isn't necessarily a bad thing since there is a variety of server filesystem architectures already present in the NFS server market. The lack of protocol definition allows for the most effective reuse of existing filesystem and server technology. Obviously there is a well-defined set of requirements or expectations of the metadata and data servers in the form of the NFSv4.1 protocol. Maintaining the theme of inclusiveness, the pNFS protocol allows for a variety of data movement or transfer methods between the client and pNFS server. The NFSv4.1 layout mechanism defines layout "types". The types are then defined as a particular data movement or transport protocol. The layout mechanism also allows for inclusion of newly defined types such that the NFSv4.1 protocol can adapt to future requirements or ideas. There are three types of layouts currently being defined for pNFS; they are generically referred to as: files, objects, blocks. The "files" layout type uses the NFSv4.1 read/write operations to the data-server. The "files" type is being defined within the NFSv4.1 Internet Draft. The "objects" layout type refers to the OSD storage protocol as defined by the T10 "SCSI Object-Based Storage Device Commands" protocol draft. The "blocks" layout refers to the use of SCSI (in its many forms). The pNFS OSD and block layout definitions are defined in separate Internet Drafts. For additional detail, the current Internet Drafts for the items mentioned above are: "NFSv4 Minor Version 1" http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt "Object-based pNFS Operations" http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-pnfs-obj-03.txt "pNFS Block/Volume Layout" http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-pnfs-block-03.txt Solaris pNFS ------------------- The initial instantiation of NFSv4.1 for Solaris will deliver the Sessions and pNFS functionality using the files layout type. The Transports ---------------------- With the introduction of NFSv4.1, there will be no change in the network transports available to the client or server. The kernel RPC interfaces will continue to provide TCP, and RDMA (in the form of Infiniband) network connectivity. The current RPCSEC_GSS mechanisms will continue to be supported as well. The Client --------------- The Solaris NFSv4.1 client will be a straightforward implementation of the Sessions and pNFS capabilities. The administrative model for the client will remain the same as it is today. As mentioned in the introduction, the client will continue to mount from a single server and provide a path (e.g. server.domain:/path/name). Since NFSv4.1 constitutes a new version of the protocol, the client will negotiate the use of NFSv4.1 with the server as it has done in the past. The client will have a preference for the highest version of NFS offered by the server. As the client accesses filesystems it will query the server for the availability of the pNFS functionality. Then, on OPENing a file, the client attempts to obtain a layout and then uses the layout information when READing/WRITEing and COMMITing data to the "server" to provide data access parallelism. The Server ---------------- As mentioned in the introduction, the NFSv4.1 protocol does not define the architecture of the pNFS server; only the outward facing protocol and behavior is defined. Given this flexibility, a multitude of architectures can fit the model of pNFS service. For the Solaris pNFS server, a straightforward model will be used. Each member of the pNFS server community (metadata server, and data servers) are to be thought of as a self-contained storage unit. Metadata server ------------------------ For the Solaris pNFS server, there is one metadata server. The metadata server may be configured with a high availability component that allows for active/passive availability. Solaris pNFS metadata server looks and feels like a regular NFS server. There are extensions to the management model to integrate the use of the data servers. The metadata server is in full control of the pNFS server in the sense that it decides which data server to utilize (initial inclusion and allocations for new file layouts). The metadata server will require the use of ZFS as its underlying filesystem for storage of the filesystem namespace and attributes. Architecturally, other filesystems may be used but they must provide like-functionality for use by the pNFS metadata server; the most important features are NFSv4 ACL support and system extended attributes. Data server ----------------- The pNFS data servers do not need a traditional filesystem namespace for associating names to data given that the metadata server provides this service by definition. The data server will then be free to associate the file objects that are to be stored upon them with their filehandle (the identifier shared between the data server and metadata server) in any fashion appropriate. One may think that it is a requirement to have the data server mimic the filesystem namespace of the metadata server but this is untrue. In fact, the mimicking of the metadata namespace would prove to be cumbersome for regular NFS server operation. The Solaris pNFS data server will have an architecture that will allow for the direct use of ZFS storage pools (zpools) for file data storage. Network Requirement ------------------------------- The pNFS diagram above implies that there is a need for two separate networks; one internal to the pNFS community and one for client access to the pNFS server. This is but a possibility -- a logical representation; not a requirement. The only requirement with respect to network configuration is that each component within the diagram above be addressable or routable to each other. Therefore, there can be variety of networking technologies or topologies employed for the pNFS server. The choice of topology or interconnect will be based on the workload being served by the pNFS server. Coordination of the pNFS Community ---------------------------------------------------- To this point, we have a Solaris pNFS client interacting with a pNFS server over a flexible network configuration. The metadata server is using ZFS as the underlying filesystem for the "regular" filesystem information (names, directories, attributes). The data server is using the ZFS storage pool (zpool) to organize the data for the various layouts the metadata server is handing out to the clients. What is coordinating all of the pNFS community members? This is the pNFS Control Protocol. pNFS Control Protocol -------------------------------- The control protocol is the piece of the pNFS server solution that is left to the various implementations to define. Since the Solaris pNFS solution is taking a fairly straightforward approach to the construction of the pNFS community, this allows for the use of ZFS' special combination of features to organize the attached storage devices. This will allow for the control protocol to focus on higher level control of the pNFS community members. Some of the highlights of the control protocol are: * Metadata and data server reboot / network partition indication * Filehandle, file state, and layout validation * Reporting of data server resources * Inter data server data movement * Metadata server proxy I/O * Data server state invalidation In Summary ----------------- NFSv4.1 will deliver a range of new features. The two initially addressed for Solaris will be Sessions and pNFS. The Solaris pNFS client will be a simple implementation of the protocol. The Solaris pNFS server will layer on top of existing Solaris technologies to offer a feature rich solution. 2.2 Interfaces 2.2.1 Exported Interfaces +-------------------------------------------------------------------------------------------------------------------------------+ | |Proposed | |Former Stability | |Interface |Stability | |Classification or | |Name |Classification |Specified in What Document? |Other Comments | |------------+---------------+-------------------------------------------------------------------------------+------------------| | | |Currently |The next minor | |NFSv4.1 |Standard |http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt |version of NFSv4 | | | | |(PSARC/1999/289). | |------------+---------------+-------------------------------------------------------------------------------+------------------| | |Evolving, | | | |libnfsd.so |Consolidation |/usr/include/libnfsd.h | | | |Private (?) | | | |------------+---------------+-------------------------------------------------------------------------------+------------------| | |Evolving, | | | |libdserv.so |Consolidation |/usr/include/libdserv.h | | | |Private (?) | | | +-------------------------------------------------------------------------------------------------------------------------------+ 2.2.2 Imported Interfaces +--------------------------------------------------------------+ | | | | Former | | | | | Stability | | | | Specified in | Classification | | Interface | Stability | What | or Other | | Name | Classification | Document? | Comments | |------------+-----------------+--------------+----------------| | | | | We will most | | | | | likely need a | | | | | contract for | | | | | the use of the | | | | | DMU | | | | | interfaces. | | | | | Fortunately, | | | | | we (pNFS and | | ZFS DMU | Consolidation | dmu.h, dmu.c | ZFS) will both | | | Private (?) | | be in the ON | | | | | consolidation, | | | | | therefore, any | | | | | incompatible | | | | | changes in ZFS | | | | | which affect | | | | | pNFS are bound | | | | | to be caught | | | | | very early on. | |------------+-----------------+--------------+----------------| | libscf(3) | Evolving | libscf(3LIB) | | | | | man page. | | +--------------------------------------------------------------+ 2.2.3 Internal Interfaces Most significant among pNFS's internal interfaces is the "control protocol" mentioned in section 2.1. This is the protocol used between the MDS and the data servers. Other internal interfaces will be listed in the next subsection. 2.2.3.1 Control Protocol A general description of the RPC-based control protocol is provided. This protocol will remain project private and subject to change as the functionality evolves. Note the project commits to the general architectural tenant for the use of RPC versioning to allow for effective upgrated of the various pNFS server components. 1. DS_EXIBI: called by the data server when it comes online, to exchange identities with the MDS. 2. DS_REPORTAVAIL: called by the data server when it comes online, and periodically thereafter, to report the zpools that it has at its disposal, and its network addresses. 3. DS_CHECKSTATE: called by the data server the first time that it receives a request from a client for a particular (filehandle, client, stateid, principal) tuple. The result from the MDS says whether i/o is to be permitted or not. 4. DS_INVALIDATE: called by the MDS to invalidate state on the data server; specifically, the state received on calling DS_CHECKSTATE. 5. DS_READ: called by the MDS to read from the data server. Used for proxy i/o. 6. DS_WRITE: called by the MDS to write to the data server. Used for proxy i/o. 7. DS_COMMIT: called by the MDS to commit previously written data on the data server. Used for proxy i/o. 8. DS_GETATTR: called by the MDS to query attributes for an object on the data server. Currently, "size" is the only attribute that the data server can return. 9. DS_SETATTR: called by the MDS to set attributes for an object on the data server. Currently, "size" is the only attribute that may be set. 10. DS_REMOVE: called by the MDS to remove an object on the data server. 2.2.3.2 Other Internal Interfaces +------------------------------------------------------------------------------+ |Interface |Stability | | | |Name |Classification |Specified in What Document? |Other Comments | |------------+----------------+------------------------------+-----------------| | | | |Only used by | |/dev/dserv |Project Private |/usr/include/sys/dserv_impl.h |libdserv.so | |ioctl() | | |and/or | | | | |libnfsd.so. | |------------+----------------+------------------------------+-----------------| | | | |Already exists; | | | | |new functionality| |_nfssys() | | |will be | |system call |Project Private |nfssys() in nfs_sys.c |integrated into | | | | |libnfsd.so and | | | | |possibly | | | | |libdserv.so. | +------------------------------------------------------------------------------+ 2.3 User Interface The pNFS project will deliver a set of new command line interfaces (CLIs) and modifications to existing CLIs. The focus of the for the CLIs to be delivered is on the system administrator and not an end user. With the exception of the modifications to mount_nfs(1M) (additional information below) the CLIs will be used by individuals in order to configure and manage a pNFS installation. The pNFS project will not provide a GUI for managing pNFS. The CLIs can be divided up into the following sub-sections: * Extensions to existing commands mount_nfs(1M) will be modified in the following ways: 1. addition of a new value for the "vers=" option in order to allow a user to specify a NFS version of "41" (i.e., NFSv4.1). This new option will exist in addition to the other "vers=" values, 2, 3 and 4. 2. addition of a "nopnfs" option. This option specifies that the client should disable the use of pNFS on this mount. nfsstat(1M) will be modified to display statistics about the NFSv4.1 protocol as well as the control protocol between the metadata server and the data servers. zfs(1M) will be modified to allow the "list" subcommand to display the presence of pNFS datasets in a zpool on the data server. zfs(1M) will be modified to allow the "create" subcommand to set of a new property at the time of creating a ZFS file system for storage of file metadata on the pNFS metadata server. This property will flag the file system as one that is to be used for the purpose storing file metadata. The name of this property is still to be determined, but one possibility is "pnfs". Values for the property would be "on" or "off". For diagnosability, snoop(1M) will be modified in support the NFSv4.1 and control protocols and mdb(1) macros, walkers and dcmds (yet to be defined) will be added. * New commands npool(1M) is a new command that is used to specify the data servers that the metadata server will use for determining the layout of a file. For further information refer to the draft man page for the npool(1M) command. dpool(1M) is a new command that is used to specify which storage (i.e., which ZFS storage pools or, in the future, which QFS file system) on a data server machine should be used for the storage of pNFS file data. For further information, refer to the draft man page for the dpool(1M) command. pnfsalloc(1M) is a new command that allows a user to specify rules for the client and the server to consult when determining the layout of a newly created file. For further information, refer to the pnfsalloc(1M) command specification. 2.4 Compatibility and Interoperability 2.4.1 Standards Conformance We are implementing the NFSv4.1 specification and we are not deviating from or extending the standard in any incompatible way. 2.4.2 Operating System and Platform Compatibility This project will integrate into Solaris Nevada. There are no special hardware requirements, but the NFS/RDMA (PSARC/2007/347) project will give us the ability to exploit the RDMA capabilities of InfiniBand, which is relevant to our target audience. 2.4.3 Interoperability with Sun Projects/Products Consolidation-private APIs, such as libnfsd.so and libdserv.so, may give other management frameworks the capability to manage the functionality of this project. 2.4.4 Interoperability with External Products A benefit of implementing the NFSv4.1 specification is that we have the opportunity to interoperate with other vendor's NFSv4.1 implementations. The interoperability of the Solaris implementation with other vendor's implementations is tested at events such as Connectathon and Bakeathons. No interoperability with any other external products is planned. 2.4.5 Coexistence with Similar Functionality No similar functionality is currently delivered with Solaris. 2.4.6 Support for Multiple Concurrent Instances We will support a regular NFSv4.1 server and a pNFS metadata server to be active simultaneously. This is so that a single NFSv4.1 server can support pNFS file systems and non-pNFS file systems simultaneously. We will support the capablity of a pNFS metadata server and a pNFS data server to be active on the same machine and on the same port (2049). For the pNFS client, multiple instances is only relevant if thought of as multiple mounts. 2.4.7 Compatibility with Earlier and Future Releases This is the first release, so earlier releases is not applicable. To accomodate future releases, the control protocol will be versioned, as will the data server filehandles used to identify the individual stripes. 2.5 Performance and Scalability 2.5.1 Performance Goals The preliminary performance goals for the pNFS project are as follows: * Data-intensive throughput to a client should be at least 3.3x the throughput of a single server/ pipe when transferring to or from four servers pipes, and ideally 3.75x. This specification is intended for a reference platform of an x2100 client on 1 Gbit ethernet and a SPARC platform TBD (probably T2000). Stretch goals should be adopted for larger clients with more processor and especially with more bridge bandwidth (x4600 now) * For single-threaded transfer to/from a client, pNFS data- intensive performance should be equivalent (within 2%) of standard NFSv4 data-intensive performance on the same configuration. This should be measured for both 1 Gbit ethernet and NFS-over-RDMA-over-IB. * For single-threaded transfer to/from a data server, data- intensive performance of a pNFS data server should be equivalent (within 2%) of standard NFSv4 on the same configuration, whether using Ethernet or IB. * For aggregate throughput to/from a data server, on small configurations (x2100, T2000) the server should be able to consume at least 85% of bridge bandwidth; this means that about 40% of bridge bandwidth should be supplied to the networks. For example, if bridge bandwidth is 1GB/sec, the server should be able to move about 400 MB/sec in large I/Os, since from disk to memory and memory to network uses the bridge twice. * For a meta-data weighted workload, the NFSv4.1 pNFS enabled MDS will perform at no less than 90% of a NFSv4.1 server without pNFS functionality enabled. The CREATE operation is treated differently, since it places policy interpretation on the performance hot path. With no special policy on open, the MDS should be within 10% of an NFSv4.1 server. * Metadata server metrics (read/write through MDS) - Read/write throughput that is done to/from the NFSv4.1 pNFS enabled MDS will perform at no less than 80% of a standard NFSv4.1 NFS server. * Data server metrics - For read/write throughput, the NFSv4.1 data server will perform at no less than 98% of a standard NFSv4.1 NFS server. * NFS over RDMA must comply with the standard and meet or exceed HPC customer expectations. 980 MB/sec reads, 650 MB/sec writes achieved X2100 -> x2100, HW limited (on-nv to on-nv) 2.5.2 Performance Measurement The exact methods for doing performance measurement and analysis are still to be determined. 2.5.3 Scalability Limits and Potential Bottlenecks No exact scalability limits and potential bottlenecks identified at this time. 2.5.4 Static System Behavior No exact information available at this time. 2.5.5 Dynamic System Behavior No exact information available at this time. 2.6 Failure and Recovery 2.6.1 Resource Exhaustion As with other file systems, if disk space is exhausted ENOSPC errors will be returned to the application. If memory resources are exhausted we may return EAGAIN to the application and in cases where memory allocations are done in the kernel with use of the KM_SLEEP flag, system hangs are possible. Although, this can be prevented by being conscious of the amount of memory being allocated when doing KM_SLEEP allocations. 2.6.2 Software Failures One of the main causes for software failures will probably be bugs in the code. The team will reduce the risk of failures by exercising due diligence during development, via design reviews, code reviews and test execution. No new software failure avoidence or recovery mechanisms are introduced by this project. 2.6.3 Network Failures As pNFS is a network protocol, network failures are taken into account, and are well documented in the NFSv4.1 specification. The control protocol (between the metadata server and data server) is also designed to be resilient to network failures. The control protocol is a RPC-based protocol and high-level information about the messages in the protocol are documented in section 2.2.3.1. 2.6.4 Data Integrity Data integrity will be no different than it is for ordinary NFS. 2.6.5 State and Checkpointing Recovery from failed pNFS components, including recovery of NFS state, is documented in the NFSv4.1 specification. 2.6.6 Fault Detection Our implementation will leverage existing commands as (zpool status), extend existing commands (nfsstat, snoop) and introduce new commands (dpool status, npool status) to allow a user to detect and diagnose a failure. 2.6.7 Fault Recovery (or Cleanup after Failure) NFS currently leverages SMF for the management of NFS related services. Our pNFS implementation will continue to build on this in order to allow for service restarting and dependency management. 2.7 Security All minor versions of NFSv4 require support for Kerberos. Therefore, this is also true for pNFS. Kerberos will be supported for all operations between the client and the metadata server as well as for all for all operations between the client and the data servers. Additionally, Kerberos will be supported will be for the control protocol, which occurs between the metadata server and the data servers. Host principals will exist for the metadata server and data servers. In addition to the support of Kerberos by the control protocol, data servers must be approved by an administrator on the metadata server before being used within the server community. This is accomplished with the npool(1M) command as documented in section 2.3. 2.8 Software Engineering and Usability 2.8.1 Namespace Management Commands are documented elsewhere. Configuration data will be stored in SMF. 2.8.2 Dependencies on non-Standard System Interfaces None. 2.8.3 Year 2000 Compliance We're good! 2.8.4 Internationalization (I18N) All new commands will produce message catalogs. 2.8.5 64-bit Issues No issues. 2.8.6 Porting to other Platforms This project will not be ported to other platforms. 2.8.7 Accessibility This project will be entirely administratable with command line interfaces. 3 Release Information 3.1 Product Packaging This product will be bundled in the ON consolidation. No new packages will be delivered beyond those that already exist. Those are: SUNWnfscr, SUNWnfscu, SUNWnfssr, and SUNWnfssu Client side functionality will be delivered in SUNWnfsc*. Server side functionality will be delivered in SUNWnfss*. Note that all configuration information will be stored in SMF, therefore, there will be no addition or modification of configuration files. 3.2 Installation 3.2.1 Installation procedure pNFS will be installed as a part of the Solaris installation procedure. No additional installation steps are needed. Configuration of pNFS will be done with the command line interfaces documented in section 2.3. 3.2.2 Effects on System Files No effect on system files 3.2.3 Boot-Time Requirements None 3.2.4 Licensing Our pre-ON putback source code and binaries (in the form of BFU archives) are being posted. You can find the source and binaries on our NFSv4.1 pNFS OpenSolaris project page under Downloads. The source code is licensed under the Common Development and Distribution License (CDDL). The pre-ON putback binaries are licensed under the OpenSolaris Binary License (OBL). 3.2.5 Upgrade 3.2.6 Software Removal We will not provide the capability to remove or uninstall pNFS from a Solaris machine. Although, a user can disable the use of pNFS in any of the following ways: 1. On the client, execute the mount_nfs(1M) command with "-o nopnfs" or "-o vers=[2|3|4]". 2. On the client, set the NFS_CLIENT_VERSMAX configuration variable to 4, 3 or 2. Clients require NFSv4.1 ("41") support in order to use pNFS. Setting NFS_CLIENT_VERSMAX as specified sets the maximum version of the NFS protocol that the NFS client will use. 3. On the server, set the NFS_SERVER_VERSMAX configuration variable to 4, 3 or 2. Servers require support of NFSv4.1 ("41") in order to use pNFS. Setting NFS_SERVER_VERSMAX as specfied sets the maximum version of the NFS protocol that the NFS server will offer. 4. Disable the NFS server service, data server and client SMF services. Note that the current method for setting NFS_SERVER_VERSMIN and NFS_SERVER_VERSMAX is in the /etc/default/nfs configuration file. PSARC/2007/393 is pursuing the conversion of these configuration variables into SMF properties. 3.3 System Administration The pNFS project will not deliver a GUI for administration of a pNFS configuration. The command line interfaces listed in section 2.3 will be the sole administrative interfaces. 4 pNFS Client Architecture 4.1 Description pNFS is tightly integrated into the existing NFS client architecture to maximize code re-use. The vast majority of the code will be common between NFS4v4.0 and NFSv4.1 since the non-data related interactions are roughly the same. The overall NFS client architecture is maintained. The code paths will diverge slightly when performing an "over the wire" I/O request. In the pNFS I/O pathway, the client must evaluate the request in the context of the current file layout and split the request up into pieces which will go to each distinct data server. These pieces can then be dispatched to the various data servers in parallel. 4.2 Interfaces 4.2.1 User-visible The traditional SUS file interfaces remain the primary interface by which application programs access pNFS files. No API extensions will be delivered by this project. The mount command will be extended to provide administrative control to the client's selection of version and minor version numbers. The administrator can also disable the use of pnfs with the -o nopnfs option. The administrator can also control these parameters via system wide SMF properties (or configuration variables in /etc/default/nfs until PSARC/2007/393 is implemented). The client also has some control how a new file is to be striped over the data servers in the community. This information is carried over the network in the NFSv4.1 protocol as hints and is strictly advisory to the meta-data server. Further details of this interface are given in section 6. 4.2.2 Internal (optional for ARC review) The internal structure of the NFS client remains largely unchanged. pNFS I/O requests sent in parallel to the relevant data servers using mechanisms similar to the asynchronous thread model already in place. An internal interface within the NFS client is created to deal with the differences between minor version 0 and minor version 1. This private interface is not presented here. 4.3 Operation The operational characteristics of a pNFS enabled client are largely transparent. The pNFS enabled client will auto-negotiate the version and minor version numbers, per server, at mount time. The client will get a file's layout prior to initiating any I/O. If any given file does not have a layout, the client will drop back to doing normal NFSv4.1 I/O operations directly to the meta-data server transparently to the application. The I/O caching characteristics of the client are unchanged by pNFS. The client will use the traditional "close-to-open" semantics for flushing data back to the data servers. 5 pNFS Metadata Server (MDS) Architecture 5.1 Description The pNFS Metadata Server is functionally very similar to a conventional Solaris NFSv4 server. The main difference is that it will store pNFS layout information in system attributes, it will not store file data locally, and it will communicate via the control protocol with pNFS data servers. 5.2 Interfaces 5.2.1 User-visible The new npool(1M) command, outlined in section 2.3, is used to administer the resources used by the MDS. The zfs(1M) command will be modified to allow the setting of a new property at the time of creating a ZFS file system (i.e., with zfs create). This property will flag the file system is to be used for the purpose storing file metadata on the pNFS metadata server. The name of this property is still to be determined, but one possibility is "pnfs". Values for the property would be "on" or "off". 5.2.2 Internal (optional for ARC review) 5.3 Operation The pNFS Metadata Server is started the same way that other NFS shares are started. That is, either by sharemgr(1M), or by ZFS itself. 6 pNFS Data Server Architecture 6.1 Description The pNFS data server consists of a kernel module that registers itself with the Solaris NFS server. 6.2 Interfaces 6.2.1 User-visible The dpool command is used to allocate and control resources used by the pNFS data server. 6.2.2 Internal (optional for ARC review) The data server is controlled via libdserv.so, an internal API. libdserv.so uses SMF (libscf.so) to store configuration information and start/stop the processes needed to function as a pNFS data server. 6.3 Operation The pNFS data server consists of a kernel module that registers itself with the Solaris NFS server. Requests specific to a pNFS data server are routed to this module. Changes will also be made to the NFS daemon (nfsd) to support a pNFS data server. 7 Layout Allocation Architecture 7.1 Description Layout allocation is used on both a pNFS client and a pNFS metadata server. On a client, it governs the contents of a layouthint attribute, which a metadata server may use to influence the layout of a newly created file. On a metadata server, the policy engine governs the newly created layouts directly, with provisions for taking a layouthint attribute into account. 7.2 Interfaces 7.2.1 User-visible The pnfsalloc(1M) command line interface is the most obvious user-visible aspect of layout allocation. On a pNFS client, there will also be a yet to be determined CLI that will show the layout of an existing file. 7.2.2 Internal (optional for ARC review) 7.3 Operation The layout allocator is accessible to the other components (pNFS client and pNFS metadata server) via a project private API. -------------------------------------------------------------------------- Appendix A: Standards Supported NFS version 4.1 - http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt -------------------------------------------------------------------------- References R.1 Related Projects NFS/RDMA - Transport version update (PSARC/2007/347) Converting /etc/default/{nfs/autofs} to SMF properties (PSARC/2007/393) R.2 Background Information for this Project or its Product NFS version 4.1 specification - http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt R.3 Interface Specifications npool (1M) dpool(1M) pnfsalloc(1M) R.4 Project Details NFSv4.1 pNFS OpenSolaris project