pNFS Functional Specification

1 Project Description

The pNFS project will deliver both client and server side support for the portions of NFSv4 minor version 1 (NFSv4.1) in support of Parallel NFS (pNFS). The pNFS project plans to deliver this functionality into the ON consoliation.

1.1 Definition

NFSv4.1

pNFS is one of the features defined in the NFSv4.1 protocol specification, which is currently an IETF Internet Draft. The pNFS functionality is a method of introducing data access parallelism. The NFSv4.1 protocol defines a method of separating the metadata (names and attributes) of a filesystem from the location of the file data; it goes beyond simple metadata and data separation to define a method of striping the data amongst a set of data servers and allowing clients to access the data servers directly. This differs from the traditional NFS server which holds the names of files and their data under the single umbrella of the server and all access to the files must be done from the single server which is exporting the file system.

pNFS Definitions

Metadata Server (MDS) - Entity that provides metadata information about a file including location and attributes (name, modification time, etc.) as well as full access to the data.
Data Server - Entity that stores the data which may or may not be co-located with the MDS.
Client - Entity that may capable of retrieving information about the location of a file from the metadata server, and accessing the data on the data servers directly.
Layout - Defines how a file's data is organized on one or more data servers.
Layout Type - Defines the method for providing the pNFS client direct access to the data. The pNFS protocol currently allows three different layout types; files, blocks and objects. With files-based layouts a client can access the data with files-based protocols like NFSv4.1, blocks with iSCSI or FCP and objects with OSD over iSCSI or Fibre Channel. Support for blocks and objects based layout types is out of scope for the pNFS project.

pNFS Project Components

The pNFS project will deliver the following NFSv4.1 components:

A NFSv4.1 Server that is capable of being a pNFS Metadata Server (MDS) and a regular (non-pNFS) NFSv4.1 server.
A data server which supports the files-based layout type.
A NFSv4.1 Client which supports the files-based layout type.

Again, support for blocks and objects based layout types is out of scope for the pNFS project.

In addition to delivering the NFSv4.1 operations in support of pNFS, the pNFS project will address the following additional pieces of NFSv4.1 functionality:

Sessions - Provides a framework for client and server such that "exactly once semantics" can be achieved at the NFS application level. This is a mandatory to implement set of functionality for NFSv4.1.
SECINFO_NO_NAME - Modifications to the NFSv4.0 security negotiation mechanism.

1.2 Motivation, Goals, and Requirements

The goal of the pNFS project is to add the needed NFSv4.1 functionality in support of pNFS. The goal of pNFS itself is to provide the capability for clients to access data in parallel from many data servers and, in turn, increase throughput capability significantly.

1.3 Changes From the Previous Release

N/A - This is the first release for the pNFS project.

1.4 Program Plan Overview

1.4.1 Development

Logistical information for the pNFS project is captured on the NFSv4.1 pNFS OpenSolaris project page.

1.4.2 Quality Assurance/Testing

As a summary, the base NFSv4.1 functionality will be tested to make sure the pNFS implementation complies with the NFSv4.1 specification. In addition to testing the Solaris client and server against each other we will be testing interoperability with other NFSv4.1 implementations at events such as Connectathon and Bakeathons. Beyond base NFSv4.1 testing, performance and scalability will be tested in order to verify that we meet the related criteria as specified in section 2.5, below.

The pNFS test plan has been drafted; the complete details for the plan will therefore not be captured in this document.

1.4.3 Documentation

Documentation will be developed in the context of the NFSv4.1 OpenSolaris project and we will be soliciting feedback and contribution from members of the community. Proposed documentation includes:

Pre-release documentation for development and review by the OpenSolaris Community
Early access whitepapers for publication on BigAdmin and Sun Developer Network.
man pages
System Administration Guide - The current thought is that pNFS will be documented in the Solaris System Administration Guide: Network Services manual. If the pNFS content overwhelms this section, it may be moved to a new manual.

All documentation developed will be posted under the OpenSolaris NFSv4.1 project and reviewed by that project team. This review will be handled as it has historically been done with internal reviews. All comments will be considered for integration and verified for accuracy.

1.5 Related Projects

1.5.1 Dependencies on Other Sun Projects

The pNFS project is dependent upon NFS/RDMA - Transport version update (PSARC/2007/347)

1.5.2 Dependencies on Non-Sun Projects

We depend on the NFS version 4 Working Group (NFSv4 WG) of the IETF's NFSv4.1 specification effort. The NFSv4 WG is actively working on stabilizing the NFSv4.1 specification and it is now considered functionally complete. To assist in bringing Internet Draft to closure (working group last call), the document editors have been hosting a set of formal reviews that will continue through the summer of 2007. It is expected that the Working Group will complete the document around the end of 2007.

The pNFS team is very involved in the NFS version 4 Working Group and are able to manage this dependency effectively.

1.5.3 Sun Projects Depending on this Project

None

1.5.4 Projects Rendered Obsolete by this Project

None

1.5.5 Related Active Projects

Converting /etc/default/{nfs/autofs} to SMF properties (PSARC/2007/393)

PSARC/2007/393 is converting the configuration variables stored in /etc/default/nfs to SMF properties. The pNFS project will be introducing additional configuration variables and will be following the conventions set forth in this project.

1.5.6 Suggested Projects to Enhance this Program

None

1.6 Competitive Analysis

There are two ways in which our implementation of pNFS will compete:

It will compete against other products using proprietary protocols (e.g. Cluster File Systems' Lustre product)
It will compete against other vendor's implementations of pNFS.

Using an open standard such as pNFS gives us many advantages. Foremost is the fact that our product will interoperate with other vendors' products. For example, Linux clients will be able to access data on Solaris servers. An openly developed standard also gives us a better tested standard to implement to, meaning that we have a much lower risk of discovering fundamental design flaws in the protocol. And, finally, one of the advantages of an open protocol is that customers benefit from not being locked into any one implementation.

Solaris's pNFS implementation will gain advantages from tight integration with ZFS, SMF, and other advanced features of Solaris.

2 Technical Description

2.1 Architecture

The Parallel NFS or pNFS functionality, as its name implies, is a method of introducing data access parallelism. The NFSv4.1 protocol defines a method of separating the metadata (names and attributes) of a filesystem from the location of the file data; it goes beyond simple name/data separation to define a method of striping the data amongst a set of data servers. This is obviously very different from the traditional NFS server which holds the names of files and their data under the single umbrella of the server. There are products in existence that are multi-node NFS servers but they are limited in the participation from the client in separation of metadata and data. The NFSv4.1 client can be enabled to be a direct participant in the exact location of file data and avoid sole interaction with the single NFS server when moving data.

The NFSv4.1 pNFS server is now a collection or community of server resources or components; these community members are assumed to be controlled by the metadata server.

The pNFS client still accesses a single metadata server for traversal or interaction with the namespace; when the client moves data to and from the server it may be directly interacting with the set of data servers belonging to the pNFS server community.

        pNFS
             --------------------------------------------------
             | pNFS Server                                    |
             |                                                |
             | .--------------    .--------------             |
             | |data-server  |    |data-server  |             |
             | |             |    |             |             |
             | `-.------------    `-----.--------             |
             |   |     .--------------  |  .--------------    |
             |   |     |data-server  |  |  |data-server  |    |
             |   |     |             |  |  |             |    |
             |   |     `------.-------  |  `------.-------    |
             |  _|____________|_________|_________|__________ |
             |                       |                        |
             |             ,---------'-----------             |
             |             | metadata server    |             |
             |             |____________________|             |
             `---.------.--------.-------------.-------.-------
                 |      |        |             |       |
             ____|______|________|_____________|_______|________
                   |               |                     |
                   |               |                     |
             .-----+-----.   .-----+-----.         .-----+-----.
             |           |   |           |         |           |
             |pNFS Client|   |pNFS Client|  ....   |pNFS Client|
             |           |   |           |         |           |
             `-----------'   `-----------'         `-----------'

As mentioned, the user's view of the "pNFS Server" continues to appear on the network as a regular NFS server even though there are multiple, distinct and addressable components of the server. There is a single server from which a filesystem is mounted and accessed. The administrator knows the details of the "pNFS Server" or community. The pNFS Client implementation will know the details of the pNFS server through NFSv4.1 protocol interaction. When it comes to things like mount points and automount maps, the look and feel of the NFS server is the same as it has been: single server name and its self-contained namespace.

The pNFS enabled client determines the location of file data by directly querying the pNFS server. In pNFS nomenclature, the client requests a file "layout". When a file is opened, the pNFS client will ask the metadata server for the file's layout. If available it is available, the server will give the layout to the client. When the client moves data, the layout is consulted as to the data-server(s) upon which the data resides; once the offset and range is matched to the appropriate data-server(s), the data movement is completed with read and write requests. The pNFS protocol's layout definition provides for straightforward striping of data only. There is one twist to the striping -- the location may be specified by two paths thus allowing for a simple multi-pathing feature.

With the layout in hand, the pNFS client is then enabled to generate read/write requests in parallel or by a method of its own choice. The layout is thus a simple enablement for the pNFS client to increase its overall data throughput. The pNFS server is also a benefactor by the nature of horizontal scale for data access along with the reduction of read/write operations being directly serviced by the metadata server. Obviously, the main purpose or intent of the NFSv4.1 pNFS feature is to significantly improve the data throughput capabilities of NFS servers. The NFSv4.1 protocol requires that the metadata server always be able to service read/write request itself. This requirement allows for NFSv4.1 clients that are not enabled for pNFS or for cases that the available parallelism is not required.

The NFSv4.1 protocol defines interaction between client and server. There is no specification for the interaction between components of the pNFS server. This interaction or coordination of the pNFS server community members is left as a pNFS server implementation detail. Given the lack of an open protocol definition, pNFS server components will be homogeneous in their implementation. This isn't necessarily a bad thing since there is a variety of server filesystem architectures already present in the NFS server market. The lack of protocol definition allows for the most effective reuse of existing filesystem and server technology. Obviously there is a well-defined set of requirements or expectations of the metadata and data servers in the form of the NFSv4.1 protocol.

Maintaining the theme of inclusiveness, the pNFS protocol allows for a variety of data movement or transfer methods between the client and pNFS server. The NFSv4.1 layout mechanism defines layout "types". The types are then defined as a particular data movement or transport protocol. The layout mechanism also allows for inclusion of newly defined types such that the NFSv4.1 protocol can adapt to future requirements or ideas.

There are three types of layouts currently being defined for pNFS; they are generically referred to as: files, objects, blocks. The "files" layout type uses the NFSv4.1 read/write operations to the data-server. The "files" type is being defined within the NFSv4.1 Internet Draft. The "objects" layout type refers to the OSD storage protocol as defined by the T10 "SCSI Object-Based Storage Device Commands" protocol draft. The "blocks" layout refers to the use of SCSI (in its many forms). The pNFS OSD and block layout definitions are defined in separate Internet Drafts.

For additional detail, the current Internet Drafts for the items mentioned above are:

"NFSv4 Minor Version 1" http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt

"Object-based pNFS Operations" http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-pnfs-obj-03.txt

"pNFS Block/Volume Layout" http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-pnfs-block-03.txt

Solaris pNFS
-------------------

The Transports
----------------------

The Client
---------------

The Server
----------------

Metadata server
------------------------

Data server
-----------------

Network Requirement
-------------------------------

Coordination of the pNFS Community
----------------------------------------------------

pNFS Control Protocol
--------------------------------

Metadata and data server reboot / network partition indication
Filehandle, file state, and layout validation
Reporting of data server resources
Inter data server data movement
Metadata server proxy I/O
Data server state invalidation

In Summary
-----------------

2.2 Interfaces

2.2.1 Exported Interfaces

Interface Name	Proposed Stability Classification	Specified in What Document?	Former Stability Classification or Other Comments
NFSv4.1	Standard	Currently http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt	The next minor version of NFSv4 (PSARC/1999/289).
libnfsd.so	Evolving, Consolidation Private (?)	/usr/include/libnfsd.h
libdserv.so	Evolving, Consolidation Private (?)	/usr/include/libdserv.h

2.2.2 Imported Interfaces

Interface Name	Stability Classification	Specified in What Document?	Former Stability Classification or Other Comments
ZFS DMU	Consolidation Private (?)	dmu.h, dmu.c	We will most likely need a contract for the use of the DMU interfaces. Fortunately, we (pNFS and ZFS) will both be in the ON consolidation, therefore, any incompatible changes in ZFS which affect pNFS are bound to be caught very early on.
libscf(3)	Evolving	libscf(3LIB) man page.

2.2.3 Internal Interfaces

Most significant among pNFS's internal interfaces is the "control protocol" mentioned in section 2.1. This is the protocol used between the MDS and the data servers. Other internal interfaces will be listed in the next subsection.

2.2.3.1 Control Protocol

A general description of the RPC-based control protocol is provided. This protocol will remain project private and subject to change as the functionality evolves. Note the project commits to the general architectural tenant for the use of RPC versioning to allow for effective upgrated of the various pNFS server components.

DS_EXIBI: called by the data server when it comes online, to exchange identities with the MDS.
DS_REPORTAVAIL: called by the data server when it comes online, and periodically thereafter, to report the zpools that it has at its disposal, and its network addresses.
DS_CHECKSTATE: called by the data server the first time that it receives a request from a client for a particular (filehandle, client, stateid, principal) tuple. The result from the MDS says whether i/o is to be permitted or not.
DS_INVALIDATE: called by the MDS to invalidate state on the data server; specifically, the state received on calling DS_CHECKSTATE.
DS_READ: called by the MDS to read from the data server. Used for proxy i/o.
DS_WRITE: called by the MDS to write to the data server. Used for proxy i/o.
DS_COMMIT: called by the MDS to commit previously written data on the data server. Used for proxy i/o.
DS_GETATTR: called by the MDS to query attributes for an object on the data server. Currently, "size" is the only attribute that the data server can return.
DS_SETATTR: called by the MDS to set attributes for an object on the data server. Currently, "size" is the only attribute that may be set.
DS_REMOVE: called by the MDS to remove an object on the data server.

2.2.3.2 Other Internal Interfaces

Interface Name	Stability Classification	Specified in What Document?	Other Comments
/dev/dserv ioctl()	Project Private	/usr/include/sys/dserv_impl.h	Only used by libdserv.so and/or libnfsd.so.
_nfssys() system call	Project Private	nfssys() in nfs_sys.c	Already exists; new functionality will be integrated into libnfsd.so and possibly libdserv.so.

2.3 User Interface

The pNFS project will deliver a set of new command line interfaces (CLIs) and modifications to existing CLIs. The focus of the for the CLIs to be delivered is on the system administrator and not an end user. With the exception of the modifications to mount_nfs(1M) (additional information below) the CLIs will be used by individuals in order to configure and manage a pNFS installation. The pNFS project will not provide a GUI for managing pNFS.

The CLIs can be divided up into the following sub-sections:

Extensions to existing commands

mount_nfs(1M) will be modified in the following ways:
1. addition of a new value for the "vers=" option in order to allow a user to specify a NFS version of "41" (i.e., NFSv4.1). This new option will exist in addition to the other "vers=" values, 2, 3 and 4.
2. addition of a "nopnfs" option. This option specifies that the client should disable the use of pNFS on this mount.

nfsstat(1M) will be modified to display statistics about the NFSv4.1 protocol as well as the control protocol between the metadata server and the data servers.

zfs(1M) will be modified to allow the "list" subcommand to display the presence of pNFS datasets in a zpool on the data server.

zfs(1M) will be modified to allow the "create" subcommand to set of a new property at the time of creating a ZFS file system for storage of file metadata on the pNFS metadata server. This property will flag the file system as one that is to be used for the purpose storing file metadata. The name of this property is still to be determined, but one possibility is "pnfs". Values for the property would be "on" or "off".

For diagnosability, snoop(1M) will be modified in support the NFSv4.1 and control protocols and mdb(1) macros, walkers and dcmds (yet to be defined) will be added.
New commands

npool(1M) is a new command that is used to specify the data servers that the metadata server will use for determining the layout of a file. For further information refer to the draft man page for the npool(1M) command.

dpool(1M) is a new command that is used to specify which storage (i.e., which ZFS storage pools or, in the future, which QFS file system) on a data server machine should be used for the storage of pNFS file data. For further information, refer to the draft man page for the dpool(1M) command.

pnfsalloc(1M) is a new command that allows a user to specify rules for the client and the server to consult when determining the layout of a newly created file. For further information, refer to the pnfsalloc(1M) command specification.

2.4 Compatibility and Interoperability

2.4.1 Standards Conformance

We are implementing the NFSv4.1 specification and we are not deviating from or extending the standard in any incompatible way.

2.4.2 Operating System and Platform Compatibility

This project will integrate into Solaris Nevada. There are no special hardware requirements, but the NFS/RDMA (PSARC/2007/347) project will give us the ability to exploit the RDMA capabilities of InfiniBand, which is relevant to our target audience.

2.4.3 Interoperability with Sun Projects/Products

Consolidation-private APIs, such as libnfsd.so and libdserv.so, may give other management frameworks the capability to manage the functionality of this project.

2.4.4 Interoperability with External Products

A benefit of implementing the NFSv4.1 specification is that we have the opportunity to interoperate with other vendor's NFSv4.1 implementations. The interoperability of the Solaris implementation with other vendor's implementations is tested at events such as Connectathon and Bakeathons. No interoperability with any other external products is planned.

2.4.5 Coexistence with Similar Functionality

No similar functionality is currently delivered with Solaris.

2.4.6 Support for Multiple Concurrent Instances

We will support a regular NFSv4.1 server and a pNFS metadata server to be active simultaneously. This is so that a single NFSv4.1 server can support pNFS file systems and non-pNFS file systems simultaneously.

We will support the capablity of a pNFS metadata server and a pNFS data server to be active on the same machine and on the same port (2049).

For the pNFS client, multiple instances is only relevant if thought of as multiple mounts.

2.4.7 Compatibility with Earlier and Future Releases

This is the first release, so earlier releases is not applicable. To accomodate future releases, the control protocol will be versioned, as will the data server filehandles used to identify the individual stripes.

2.5 Performance and Scalability

2.5.1 Performance Goals

The preliminary performance goals for the pNFS project are as follows:

Data-intensive throughput to a client should be at least 3.3x the throughput of a single server/ pipe when transferring to or from four servers pipes, and ideally 3.75x. This specification is intended for a reference platform of an x2100 client on 1 Gbit ethernet and a SPARC platform TBD (probably T2000). Stretch goals should be adopted for larger clients with more processor and especially with more bridge bandwidth (x4600 now)
For single-threaded transfer to/from a client, pNFS data- intensive performance should be equivalent (within 2%) of standard NFSv4 data-intensive performance on the same configuration. This should be measured for both 1 Gbit ethernet and NFS-over-RDMA-over-IB.
For single-threaded transfer to/from a data server, data- intensive performance of a pNFS data server should be equivalent (within 2%) of standard NFSv4 on the same configuration, whether using Ethernet or IB.
For aggregate throughput to/from a data server, on small configurations (x2100, T2000) the server should be able to consume at least 85% of bridge bandwidth; this means that about 40% of bridge bandwidth should be supplied to the networks. For example, if bridge bandwidth is 1GB/sec, the server should be able to move about 400 MB/sec in large I/Os, since from disk to memory and memory to network uses the bridge twice.
For a meta-data weighted workload, the NFSv4.1 pNFS enabled MDS will perform at no less than 90% of a NFSv4.1 server without pNFS functionality enabled. The CREATE operation is treated differently, since it places policy interpretation on the performance hot path. With no special policy on open, the MDS should be within 10% of an NFSv4.1 server.
Metadata server metrics (read/write through MDS) - Read/write throughput that is done to/from the NFSv4.1 pNFS enabled MDS will perform at no less than 80% of a standard NFSv4.1 NFS server.
Data server metrics - For read/write throughput, the NFSv4.1 data server will perform at no less than 98% of a standard NFSv4.1 NFS server.
NFS over RDMA must comply with the standard and meet or exceed HPC customer expectations. 980 MB/sec reads, 650 MB/sec writes achieved X2100 -> x2100, HW limited (on-nv to on-nv)

2.5.2 Performance Measurement

The exact methods for doing performance measurement and analysis are still to be determined.

2.5.3 Scalability Limits and Potential Bottlenecks

No exact scalability limits and potential bottlenecks identified at this time.

2.5.4 Static System Behavior

No exact information available at this time.

2.5.5 Dynamic System Behavior

No exact information available at this time.

2.6 Failure and Recovery

2.6.1 Resource Exhaustion

As with other file systems, if disk space is exhausted ENOSPC errors will be returned to the application.

If memory resources are exhausted we may return EAGAIN to the application and in cases where memory allocations are done in the kernel with use of the KM_SLEEP flag, system hangs are possible. Although, this can be prevented by being conscious of the amount of memory being allocated when doing KM_SLEEP allocations.

2.6.2 Software Failures

One of the main causes for software failures will probably be bugs in the code. The team will reduce the risk of failures by exercising due diligence during development, via design reviews, code reviews and test execution.

No new software failure avoidence or recovery mechanisms are introduced by this project.

2.6.3 Network Failures

As pNFS is a network protocol, network failures are taken into account, and are well documented in the NFSv4.1 specification.

The control protocol (between the metadata server and data server) is also designed to be resilient to network failures. The control protocol is a RPC-based protocol and high-level information about the messages in the protocol are documented in section 2.2.3.1.

2.6.4 Data Integrity

Data integrity will be no different than it is for ordinary NFS.

2.6.5 State and Checkpointing

Recovery from failed pNFS components, including recovery of NFS state, is documented in the NFSv4.1 specification.

2.6.6 Fault Detection

Our implementation will leverage existing commands as (zpool status), extend existing commands (nfsstat, snoop) and introduce new commands (dpool status, npool status) to allow a user to detect and diagnose a failure.

2.6.7 Fault Recovery (or Cleanup after Failure)

NFS currently leverages SMF for the management of NFS related services. Our pNFS implementation will continue to build on this in order to allow for service restarting and dependency management.

2.7 Security

All minor versions of NFSv4 require support for Kerberos. Therefore, this is also true for pNFS. Kerberos will be supported for all operations between the client and the metadata server as well as for all for all operations between the client and the data servers.

Additionally, Kerberos will be supported will be for the control protocol, which occurs between the metadata server and the data servers. Host principals will exist for the metadata server and data servers.

In addition to the support of Kerberos by the control protocol, data servers must be approved by an administrator on the metadata server before being used within the server community. This is accomplished with the npool(1M) command as documented in section 2.3.

2.8 Software Engineering and Usability

2.8.1 Namespace Management

Commands are documented elsewhere. Configuration data will be stored in SMF.

2.8.2 Dependencies on non-Standard System Interfaces

None.

2.8.3 Year 2000 Compliance

We're good!

2.8.4 Internationalization (I18N)

All new commands will produce message catalogs.

2.8.5 64-bit Issues

No issues.

2.8.6 Porting to other Platforms

This project will not be ported to other platforms.

2.8.7 Accessibility

This project will be entirely administratable with command line interfaces.

3 Release Information

3.1 Product Packaging

This product will be bundled in the ON consolidation. No new packages will be delivered beyond those that already exist. Those are: SUNWnfscr, SUNWnfscu, SUNWnfssr, and SUNWnfssu

Client side functionality will be delivered in SUNWnfsc*. Server side functionality will be delivered in SUNWnfss*.

Note that all configuration information will be stored in SMF, therefore, there will be no addition or modification of configuration files.

3.2 Installation

3.2.1 Installation procedure

pNFS will be installed as a part of the Solaris installation procedure. No additional installation steps are needed. Configuration of pNFS will be done with the command line interfaces documented in section 2.3.

3.2.2 Effects on System Files

No effect on system files

3.2.3 Boot-Time Requirements

None

3.2.4 Licensing

Our pre-ON putback source code and binaries (in the form of BFU archives) are being posted. You can find the source and binaries on our NFSv4.1 pNFS OpenSolaris project page under Downloads.

The source code is licensed under the Common Development and Distribution License (CDDL).

The pre-ON putback binaries are licensed under the OpenSolaris Binary License (OBL).

3.2.5 Upgrade

3.2.6 Software Removal

We will not provide the capability to remove or uninstall pNFS from a Solaris machine. Although, a user can disable the use of pNFS in any of the following ways:

1. On the client, execute the mount_nfs(1M) command with "-o nopnfs" or "-o vers=[2|3|4]".
2. On the client, set the NFS_CLIENT_VERSMAX configuration variable to 4, 3 or 2. Clients require NFSv4.1 ("41") support in order to use pNFS. Setting NFS_CLIENT_VERSMAX as specified sets the maximum version of the NFS protocol that the NFS client will use.
3. On the server, set the NFS_SERVER_VERSMAX configuration variable to 4, 3 or 2. Servers require support of NFSv4.1 ("41") in order to use pNFS. Setting NFS_SERVER_VERSMAX as specfied sets the maximum version of the NFS protocol that the NFS server will offer.
4. Disable the NFS server service, data server and client SMF services.

Note that the current method for setting NFS_SERVER_VERSMIN and NFS_SERVER_VERSMAX is in the /etc/default/nfs configuration file. PSARC/2007/393 is pursuing the conversion of these configuration variables into SMF properties.

3.3 System Administration

The pNFS project will not deliver a GUI for administration of a pNFS configuration. The command line interfaces listed in section 2.3 will be the sole administrative interfaces.

4 pNFS Client Architecture

4.1 Description

pNFS is tightly integrated into the existing NFS client architecture to maximize code re-use. The vast majority of the code will be common between NFS4v4.0 and NFSv4.1 since the non-data related interactions are roughly the same. The overall NFS client architecture is maintained.

The code paths will diverge slightly when performing an "over the wire" I/O request. In the pNFS I/O pathway, the client must evaluate the request in the context of the current file layout and split the request up into pieces which will go to each distinct data server. These pieces can then be dispatched to the various data servers in parallel.

4.2 Interfaces

4.2.1 User-visible

The traditional SUS file interfaces remain the primary interface by which application programs access pNFS files. No API extensions will be delivered by this project.

The mount command will be extended to provide administrative control to the client's selection of version and minor version numbers. The administrator can also disable the use of pnfs with the -o nopnfs option. The administrator can also control these parameters via system wide SMF properties (or configuration variables in /etc/default/nfs until PSARC/2007/393 is implemented).

The client also has some control how a new file is to be striped over the data servers in the community. This information is carried over the network in the NFSv4.1 protocol as hints and is strictly advisory to the meta-data server. Further details of this interface are given in section 6.

4.2.2 Internal (optional for ARC review)

The internal structure of the NFS client remains largely unchanged. pNFS I/O requests sent in parallel to the relevant data servers using mechanisms similar to the asynchronous thread model already in place.

An internal interface within the NFS client is created to deal with the differences between minor version 0 and minor version 1. This private interface is not presented here.

4.3 Operation

The operational characteristics of a pNFS enabled client are largely transparent. The pNFS enabled client will auto-negotiate the version and minor version numbers, per server, at mount time. The client will get a file's layout prior to initiating any I/O. If any given file does not have a layout, the client will drop back to doing normal NFSv4.1 I/O operations directly to the meta-data server transparently to the application.

The I/O caching characteristics of the client are unchanged by pNFS. The client will use the traditional "close-to-open" semantics for flushing data back to the data servers.

5 pNFS Metadata Server (MDS) Architecture

5.1 Description

The pNFS Metadata Server is functionally very similar to a conventional Solaris NFSv4 server. The main difference is that it will store pNFS layout information in system attributes, it will not store file data locally, and it will communicate via the control protocol with pNFS data servers.

5.2 Interfaces

5.2.1 User-visible

The new npool(1M) command, outlined in section 2.3, is used to administer the resources used by the MDS. The zfs(1M) command will be modified to allow the setting of a new property at the time of creating a ZFS file system (i.e., with zfs create). This property will flag the file system is to be used for the purpose storing file metadata on the pNFS metadata server. The name of this property is still to be determined, but one possibility is "pnfs". Values for the property would be "on" or "off".

5.2.2 Internal (optional for ARC review)

5.3 Operation

The pNFS Metadata Server is started the same way that other NFS shares are started. That is, either by sharemgr(1M), or by ZFS itself.

6 pNFS Data Server Architecture

6.1 Description

The pNFS data server consists of a kernel module that registers itself with the Solaris NFS server.

6.2 Interfaces

6.2.1 User-visible

The dpool command is used to allocate and control resources used by the pNFS data server.

6.2.2 Internal (optional for ARC review)

The data server is controlled via libdserv.so, an internal API. libdserv.so uses SMF (libscf.so) to store configuration information and start/stop the processes needed to function as a pNFS data server.

6.3 Operation

The pNFS data server consists of a kernel module that registers itself with the Solaris NFS server. Requests specific to a pNFS data server are routed to this module.

Changes will also be made to the NFS daemon (nfsd) to support a pNFS data server.

7 Layout Allocation Architecture

7.1 Description

Layout allocation is used on both a pNFS client and a pNFS metadata server. On a client, it governs the contents of a layouthint attribute, which a metadata server may use to influence the layout of a newly created file. On a metadata server, the policy engine governs the newly created layouts directly, with provisions for taking a layouthint attribute into account.

7.2 Interfaces

7.2.1 User-visible

The pnfsalloc(1M) command line interface is the most obvious user-visible aspect of layout allocation. On a pNFS client, there will also be a yet to be determined CLI that will show the layout of an existing file.

7.2.2 Internal (optional for ARC review)

7.3 Operation

The layout allocator is accessible to the other components (pNFS client and pNFS metadata server) via a project private API.

Appendix A: Standards Supported

NFS version 4.1 - http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt

References

R.1 Related Projects

NFS/RDMA - Transport version update (PSARC/2007/347)

Converting /etc/default/{nfs/autofs} to SMF properties (PSARC/2007/393)

pNFS Functional Specification

1 Project Description

1.1 Definition

NFSv4.1

pNFS Definitions

pNFS Project Components

1.2 Motivation, Goals, and Requirements

1.3 Changes From the Previous Release

1.4 Program Plan Overview

1.4.1 Development

1.4.2 Quality Assurance/Testing

1.4.3 Documentation

1.5 Related Projects

1.5.1 Dependencies on Other Sun Projects

1.5.2 Dependencies on Non-Sun Projects

1.5.3 Sun Projects Depending on this Project

1.5.4 Projects Rendered Obsolete by this Project

1.5.5 Related Active Projects

1.5.6 Suggested Projects to Enhance this Program

1.6 Competitive Analysis

2 Technical Description

2.1 Architecture

Solaris pNFS -------------------

The Transports ----------------------

The Client ---------------

The Server ----------------

Metadata server ------------------------

Data server -----------------

Network Requirement -------------------------------

Coordination of the pNFS Community ----------------------------------------------------

pNFS Control Protocol --------------------------------

In Summary -----------------

2.2 Interfaces

2.2.1 Exported Interfaces

2.2.2 Imported Interfaces

2.2.3 Internal Interfaces

2.2.3.1 Control Protocol

2.2.3.2 Other Internal Interfaces

2.3 User Interface

Extensions to existing commands

New commands

2.4 Compatibility and Interoperability

2.4.1 Standards Conformance

2.4.2 Operating System and Platform Compatibility

2.4.3 Interoperability with Sun Projects/Products

2.4.4 Interoperability with External Products

2.4.5 Coexistence with Similar Functionality

2.4.6 Support for Multiple Concurrent Instances

2.4.7 Compatibility with Earlier and Future Releases

2.5 Performance and Scalability

2.5.1 Performance Goals

2.5.2 Performance Measurement

2.5.3 Scalability Limits and Potential Bottlenecks

2.5.4 Static System Behavior

2.5.5 Dynamic System Behavior

2.6 Failure and Recovery

2.6.1 Resource Exhaustion

2.6.2 Software Failures

2.6.3 Network Failures

2.6.4 Data Integrity

2.6.5 State and Checkpointing

2.6.6 Fault Detection

2.6.7 Fault Recovery (or Cleanup after Failure)

2.7 Security

2.8 Software Engineering and Usability

2.8.1 Namespace Management

2.8.2 Dependencies on non-Standard System Interfaces

2.8.3 Year 2000 Compliance

2.8.4 Internationalization (I18N)

2.8.5 64-bit Issues

2.8.6 Porting to other Platforms

2.8.7 Accessibility

3 Release Information

3.1 Product Packaging

3.2 Installation

3.2.1 Installation procedure

3.2.2 Effects on System Files

3.2.3 Boot-Time Requirements

3.2.4 Licensing

3.2.5 Upgrade

Solaris pNFS
-------------------

The Transports
----------------------

The Client
---------------

The Server
----------------

Metadata server
------------------------

Data server
-----------------

Network Requirement
-------------------------------

Coordination of the pNFS Community
----------------------------------------------------

pNFS Control Protocol
--------------------------------

In Summary
-----------------