The pNFS project will deliver both client and server side support for the portions of NFSv4 minor version 1 (NFSv4.1) in support of Parallel NFS (pNFS). The pNFS project plans to deliver this functionality into the ON consoliation.
pNFS is one of the features defined in the NFSv4.1 protocol specification, which is currently an IETF Internet Draft. The pNFS functionality is a method of introducing data access parallelism. The NFSv4.1 protocol defines a method of separating the metadata (names and attributes) of a filesystem from the location of the file data; it goes beyond simple metadata and data separation to define a method of striping the data amongst a set of data servers and allowing clients to access the data servers directly. This differs from the traditional NFS server which holds the names of files and their data under the single umbrella of the server and all access to the files must be done from the single server which is exporting the file system.
The pNFS project will deliver the following NFSv4.1 components:
Again, support for blocks and objects based layout types is out of scope for the pNFS project.
In addition to delivering the NFSv4.1 operations in support of pNFS, the pNFS project will address the following additional pieces of NFSv4.1 functionality:
The goal of the pNFS project is to add the needed NFSv4.1 functionality in support of pNFS. The goal of pNFS itself is to provide the capability for clients to access data in parallel from many data servers and, in turn, increase throughput capability significantly.
N/A - This is the first release for the pNFS project.
Logistical information for the pNFS project is captured on the NFSv4.1 pNFS OpenSolaris project page.
As a summary, the base NFSv4.1 functionality will be tested to make sure the pNFS implementation complies with the NFSv4.1 specification. In addition to testing the Solaris client and server against each other we will be testing interoperability with other NFSv4.1 implementations at events such as Connectathon and Bakeathons. Beyond base NFSv4.1 testing, performance and scalability will be tested in order to verify that we meet the related criteria as specified in section 2.5, below.
The pNFS test plan has been drafted; the complete details for the plan will therefore not be captured in this document.
Documentation will be developed in the context of the NFSv4.1 OpenSolaris project and we will be soliciting feedback and contribution from members of the community. Proposed documentation includes:
All documentation developed will be posted under the OpenSolaris NFSv4.1 project and reviewed by that project team. This review will be handled as it has historically been done with internal reviews. All comments will be considered for integration and verified for accuracy.
The pNFS project is dependent upon NFS/RDMA - Transport version update (PSARC/2007/347)
We depend on the NFS version 4 Working Group (NFSv4 WG) of the IETF's NFSv4.1 specification effort. The NFSv4 WG is actively working on stabilizing the NFSv4.1 specification and it is now considered functionally complete. To assist in bringing Internet Draft to closure (working group last call), the document editors have been hosting a set of formal reviews that will continue through the summer of 2007. It is expected that the Working Group will complete the document around the end of 2007.
The pNFS team is very involved in the NFS version 4 Working Group and are able to manage this dependency effectively.
None
None
Converting /etc/default/{nfs/autofs} to SMF properties
(PSARC/2007/393)
PSARC/2007/393 is converting the configuration variables stored in
/etc/default/nfs to SMF properties. The pNFS project will be
introducing additional configuration variables and will be
following the conventions set forth in this project.
None
There are two ways in which our implementation of pNFS will compete:
Using an open standard such as pNFS gives us many advantages. Foremost is the fact that our product will interoperate with other vendors' products. For example, Linux clients will be able to access data on Solaris servers. An openly developed standard also gives us a better tested standard to implement to, meaning that we have a much lower risk of discovering fundamental design flaws in the protocol. And, finally, one of the advantages of an open protocol is that customers benefit from not being locked into any one implementation.
Solaris's pNFS implementation will gain advantages from tight integration with ZFS, SMF, and other advanced features of Solaris.
The Parallel NFS or pNFS functionality, as its name implies, is a method of introducing data access parallelism. The NFSv4.1 protocol defines a method of separating the metadata (names and attributes) of a filesystem from the location of the file data; it goes beyond simple name/data separation to define a method of striping the data amongst a set of data servers. This is obviously very different from the traditional NFS server which holds the names of files and their data under the single umbrella of the server. There are products in existence that are multi-node NFS servers but they are limited in the participation from the client in separation of metadata and data. The NFSv4.1 client can be enabled to be a direct participant in the exact location of file data and avoid sole interaction with the single NFS server when moving data.
The NFSv4.1 pNFS server is now a collection or community of server resources or components; these community members are assumed to be controlled by the metadata server.
The pNFS client still accesses a single metadata server for traversal or interaction with the namespace; when the client moves data to and from the server it may be directly interacting with the set of data servers belonging to the pNFS server community.
pNFS
--------------------------------------------------
| pNFS Server |
| |
| .-------------- .-------------- |
| |data-server | |data-server | |
| | | | | |
| `-.------------ `-----.-------- |
| | .-------------- | .-------------- |
| | |data-server | | |data-server | |
| | | | | | | |
| | `------.------- | `------.------- |
| _|____________|_________|_________|__________ |
| | |
| ,---------'----------- |
| | metadata server | |
| |____________________| |
`---.------.--------.-------------.-------.-------
| | | | |
____|______|________|_____________|_______|________
| | |
| | |
.-----+-----. .-----+-----. .-----+-----.
| | | | | |
|pNFS Client| |pNFS Client| .... |pNFS Client|
| | | | | |
`-----------' `-----------' `-----------'
As mentioned, the user's view of the "pNFS Server" continues to appear on the network as a regular NFS server even though there are multiple, distinct and addressable components of the server. There is a single server from which a filesystem is mounted and accessed. The administrator knows the details of the "pNFS Server" or community. The pNFS Client implementation will know the details of the pNFS server through NFSv4.1 protocol interaction. When it comes to things like mount points and automount maps, the look and feel of the NFS server is the same as it has been: single server name and its self-contained namespace.
The pNFS enabled client determines the location of file data by directly querying the pNFS server. In pNFS nomenclature, the client requests a file "layout". When a file is opened, the pNFS client will ask the metadata server for the file's layout. If available it is available, the server will give the layout to the client. When the client moves data, the layout is consulted as to the data-server(s) upon which the data resides; once the offset and range is matched to the appropriate data-server(s), the data movement is completed with read and write requests. The pNFS protocol's layout definition provides for straightforward striping of data only. There is one twist to the striping -- the location may be specified by two paths thus allowing for a simple multi-pathing feature.
With the layout in hand, the pNFS client is then enabled to generate read/write requests in parallel or by a method of its own choice. The layout is thus a simple enablement for the pNFS client to increase its overall data throughput. The pNFS server is also a benefactor by the nature of horizontal scale for data access along with the reduction of read/write operations being directly serviced by the metadata server. Obviously, the main purpose or intent of the NFSv4.1 pNFS feature is to significantly improve the data throughput capabilities of NFS servers. The NFSv4.1 protocol requires that the metadata server always be able to service read/write request itself. This requirement allows for NFSv4.1 clients that are not enabled for pNFS or for cases that the available parallelism is not required.
The NFSv4.1 protocol defines interaction between client and server. There is no specification for the interaction between components of the pNFS server. This interaction or coordination of the pNFS server community members is left as a pNFS server implementation detail. Given the lack of an open protocol definition, pNFS server components will be homogeneous in their implementation. This isn't necessarily a bad thing since there is a variety of server filesystem architectures already present in the NFS server market. The lack of protocol definition allows for the most effective reuse of existing filesystem and server technology. Obviously there is a well-defined set of requirements or expectations of the metadata and data servers in the form of the NFSv4.1 protocol.
Maintaining the theme of inclusiveness, the pNFS protocol allows for a variety of data movement or transfer methods between the client and pNFS server. The NFSv4.1 layout mechanism defines layout "types". The types are then defined as a particular data movement or transport protocol. The layout mechanism also allows for inclusion of newly defined types such that the NFSv4.1 protocol can adapt to future requirements or ideas.
There are three types of layouts currently being defined for pNFS; they are generically referred to as: files, objects, blocks. The "files" layout type uses the NFSv4.1 read/write operations to the data-server. The "files" type is being defined within the NFSv4.1 Internet Draft. The "objects" layout type refers to the OSD storage protocol as defined by the T10 "SCSI Object-Based Storage Device Commands" protocol draft. The "blocks" layout refers to the use of SCSI (in its many forms). The pNFS OSD and block layout definitions are defined in separate Internet Drafts.
For additional detail, the current Internet Drafts for the items
mentioned above are:
"NFSv4 Minor Version 1"
http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt
"Object-based pNFS Operations"
http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-pnfs-obj-03.txt
"pNFS Block/Volume Layout"
http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-pnfs-block-03.txt
| Interface Name | Proposed Stability Classification | Specified in What Document? | Former Stability Classification or Other Comments |
| NFSv4.1 | Standard | Currently http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt | The next minor version of NFSv4 (PSARC/1999/289). |
| libnfsd.so | Evolving, Consolidation Private (?) | /usr/include/libnfsd.h | |
| libdserv.so | Evolving, Consolidation Private (?) | /usr/include/libdserv.h |
| Interface Name | Stability Classification | Specified in What Document? | Former Stability Classification or Other Comments |
| ZFS DMU | Consolidation Private (?) | dmu.h, dmu.c | We will most likely need a contract for the use of the DMU interfaces. Fortunately, we (pNFS and ZFS) will both be in the ON consolidation, therefore, any incompatible changes in ZFS which affect pNFS are bound to be caught very early on. |
| libscf(3) | Evolving | libscf(3LIB) man page. |
Most significant among pNFS's internal interfaces is the "control protocol" mentioned in section 2.1. This is the protocol used between the MDS and the data servers. Other internal interfaces will be listed in the next subsection.
A general description of the RPC-based control protocol is provided. This protocol will remain project private and subject to change as the functionality evolves. Note the project commits to the general architectural tenant for the use of RPC versioning to allow for effective upgrated of the various pNFS server components.
| Interface Name | Stability Classification | Specified in What Document? | Other Comments |
| /dev/dserv ioctl() | Project Private | /usr/include/sys/dserv_impl.h | Only used by libdserv.so and/or libnfsd.so. |
| _nfssys() system call | Project Private | nfssys() in nfs_sys.c | Already exists; new functionality will be integrated into libnfsd.so and possibly libdserv.so. |
The pNFS project will deliver a set of new command line interfaces (CLIs) and modifications to existing CLIs. The focus of the for the CLIs to be delivered is on the system administrator and not an end user. With the exception of the modifications to mount_nfs(1M) (additional information below) the CLIs will be used by individuals in order to configure and manage a pNFS installation. The pNFS project will not provide a GUI for managing pNFS.
The CLIs can be divided up into the following sub-sections:
mount_nfs(1M) will be modified in the following
ways:
1. addition of a new value for the "vers=" option in
order to allow a user to specify a NFS version of
"41" (i.e., NFSv4.1). This new option will exist in
addition to the other "vers=" values, 2, 3 and 4.
2. addition of a "nopnfs" option. This option
specifies that the client should disable the use of
pNFS on this mount.
nfsstat(1M) will be modified to display statistics about the NFSv4.1 protocol as well as the control protocol between the metadata server and the data servers.
zfs(1M) will be modified to allow the "list" subcommand to display the presence of pNFS datasets in a zpool on the data server.
zfs(1M) will be modified to allow the "create" subcommand to set of a new property at the time of creating a ZFS file system for storage of file metadata on the pNFS metadata server. This property will flag the file system as one that is to be used for the purpose storing file metadata. The name of this property is still to be determined, but one possibility is "pnfs". Values for the property would be "on" or "off".
For diagnosability, snoop(1M) will be modified in support the NFSv4.1 and control protocols and mdb(1) macros, walkers and dcmds (yet to be defined) will be added.
npool(1M) is a new command that is used to specify the data servers that the metadata server will use for determining the layout of a file. For further information refer to the draft man page for the npool(1M) command.
dpool(1M) is a new command that is used to specify which storage (i.e., which ZFS storage pools or, in the future, which QFS file system) on a data server machine should be used for the storage of pNFS file data. For further information, refer to the draft man page for the dpool(1M) command.
pnfsalloc(1M) is a new command that allows a user to specify rules for the client and the server to consult when determining the layout of a newly created file. For further information, refer to the pnfsalloc(1M) command specification.
We are implementing the NFSv4.1 specification and we are not deviating from or extending the standard in any incompatible way.
This project will integrate into Solaris Nevada. There are no special hardware requirements, but the NFS/RDMA (PSARC/2007/347) project will give us the ability to exploit the RDMA capabilities of InfiniBand, which is relevant to our target audience.
Consolidation-private APIs, such as libnfsd.so and libdserv.so, may give other management frameworks the capability to manage the functionality of this project.
A benefit of implementing the NFSv4.1 specification is that we have the opportunity to interoperate with other vendor's NFSv4.1 implementations. The interoperability of the Solaris implementation with other vendor's implementations is tested at events such as Connectathon and Bakeathons. No interoperability with any other external products is planned.
No similar functionality is currently delivered with Solaris.
We will support a regular NFSv4.1 server and a pNFS metadata server to be active simultaneously. This is so that a single NFSv4.1 server can support pNFS file systems and non-pNFS file systems simultaneously.
We will support the capablity of a pNFS metadata server and a pNFS data server to be active on the same machine and on the same port (2049).
For the pNFS client, multiple instances is only relevant if thought of as multiple mounts.
This is the first release, so earlier releases is not applicable. To accomodate future releases, the control protocol will be versioned, as will the data server filehandles used to identify the individual stripes.
The preliminary performance goals for the pNFS project are as follows:
The exact methods for doing performance measurement and analysis are still to be determined.
No exact scalability limits and potential bottlenecks identified at this time.
No exact information available at this time.
No exact information available at this time.
As with other file systems, if disk space is exhausted ENOSPC errors will be returned to the application.
If memory resources are exhausted we may return EAGAIN to the application and in cases where memory allocations are done in the kernel with use of the KM_SLEEP flag, system hangs are possible. Although, this can be prevented by being conscious of the amount of memory being allocated when doing KM_SLEEP allocations.
One of the main causes for software failures will probably be bugs in the code. The team will reduce the risk of failures by exercising due diligence during development, via design reviews, code reviews and test execution.
No new software failure avoidence or recovery mechanisms are introduced by this project.
As pNFS is a network protocol, network failures are taken into account, and are well documented in the NFSv4.1 specification.
The control protocol (between the metadata server and data server) is also designed to be resilient to network failures. The control protocol is a RPC-based protocol and high-level information about the messages in the protocol are documented in section 2.2.3.1.
Data integrity will be no different than it is for ordinary NFS.
Recovery from failed pNFS components, including recovery of NFS state, is documented in the NFSv4.1 specification.
Our implementation will leverage existing commands as (zpool status), extend existing commands (nfsstat, snoop) and introduce new commands (dpool status, npool status) to allow a user to detect and diagnose a failure.
NFS currently leverages SMF for the management of NFS related services. Our pNFS implementation will continue to build on this in order to allow for service restarting and dependency management.
All minor versions of NFSv4 require support for Kerberos. Therefore, this is also true for pNFS. Kerberos will be supported for all operations between the client and the metadata server as well as for all for all operations between the client and the data servers.
Additionally, Kerberos will be supported will be for the control protocol, which occurs between the metadata server and the data servers. Host principals will exist for the metadata server and data servers.
In addition to the support of Kerberos by the control protocol, data servers must be approved by an administrator on the metadata server before being used within the server community. This is accomplished with the npool(1M) command as documented in section 2.3.
Commands are documented elsewhere. Configuration data will be stored in SMF.
None.
We're good!
All new commands will produce message catalogs.
No issues.
This project will not be ported to other platforms.
This project will be entirely administratable with command line interfaces.
This product will be bundled in the ON consolidation. No new packages will be delivered beyond those that already exist. Those are: SUNWnfscr, SUNWnfscu, SUNWnfssr, and SUNWnfssu
Client side functionality will be delivered in SUNWnfsc*. Server side functionality will be delivered in SUNWnfss*.
Note that all configuration information will be stored in SMF, therefore, there will be no addition or modification of configuration files.
pNFS will be installed as a part of the Solaris installation procedure. No additional installation steps are needed. Configuration of pNFS will be done with the command line interfaces documented in section 2.3.
No effect on system files
None
Our pre-ON putback source code and binaries (in the form of BFU archives) are being posted. You can find the source and binaries on our NFSv4.1 pNFS OpenSolaris project page under Downloads.
The source code is licensed under the Common Development and Distribution License (CDDL).
The pre-ON putback binaries are licensed under the OpenSolaris Binary License (OBL).
We will not provide the capability to remove or uninstall pNFS from
a Solaris machine. Although, a user can disable the use of pNFS in any
of the following ways:
1. On the client, execute the mount_nfs(1M) command with "-o nopnfs" or
"-o vers=[2|3|4]".
2. On the client, set the NFS_CLIENT_VERSMAX configuration variable
to 4, 3 or 2. Clients require NFSv4.1 ("41") support in order to use
pNFS. Setting NFS_CLIENT_VERSMAX as specified sets the maximum version
of the NFS protocol that the NFS client will use.
3. On the server, set the NFS_SERVER_VERSMAX configuration variable
to 4, 3 or 2. Servers require support of NFSv4.1 ("41") in
order to use pNFS. Setting NFS_SERVER_VERSMAX as specfied sets the
maximum version of the NFS protocol that the NFS server will offer.
4. Disable the NFS server service, data server and client SMF
services.
Note that the current method for setting NFS_SERVER_VERSMIN and
NFS_SERVER_VERSMAX is in the /etc/default/nfs configuration file.
PSARC/2007/393 is pursuing the conversion of these configuration
variables into SMF properties.
The pNFS project will not deliver a GUI for administration of a pNFS configuration. The command line interfaces listed in section 2.3 will be the sole administrative interfaces.
pNFS is tightly integrated into the existing NFS client architecture to maximize code re-use. The vast majority of the code will be common between NFS4v4.0 and NFSv4.1 since the non-data related interactions are roughly the same. The overall NFS client architecture is maintained.
The code paths will diverge slightly when performing an "over the wire" I/O request. In the pNFS I/O pathway, the client must evaluate the request in the context of the current file layout and split the request up into pieces which will go to each distinct data server. These pieces can then be dispatched to the various data servers in parallel.
The traditional SUS file interfaces remain the primary interface by which application programs access pNFS files. No API extensions will be delivered by this project.
The mount command will be extended to provide administrative control to the client's selection of version and minor version numbers. The administrator can also disable the use of pnfs with the -o nopnfs option. The administrator can also control these parameters via system wide SMF properties (or configuration variables in /etc/default/nfs until PSARC/2007/393 is implemented).
The client also has some control how a new file is to be striped over the data servers in the community. This information is carried over the network in the NFSv4.1 protocol as hints and is strictly advisory to the meta-data server. Further details of this interface are given in section 6.
The internal structure of the NFS client remains largely unchanged. pNFS I/O requests sent in parallel to the relevant data servers using mechanisms similar to the asynchronous thread model already in place.
An internal interface within the NFS client is created to deal with the differences between minor version 0 and minor version 1. This private interface is not presented here.
The operational characteristics of a pNFS enabled client are largely transparent. The pNFS enabled client will auto-negotiate the version and minor version numbers, per server, at mount time. The client will get a file's layout prior to initiating any I/O. If any given file does not have a layout, the client will drop back to doing normal NFSv4.1 I/O operations directly to the meta-data server transparently to the application.
The I/O caching characteristics of the client are unchanged by pNFS. The client will use the traditional "close-to-open" semantics for flushing data back to the data servers.
The pNFS Metadata Server is functionally very similar to a conventional Solaris NFSv4 server. The main difference is that it will store pNFS layout information in system attributes, it will not store file data locally, and it will communicate via the control protocol with pNFS data servers.
The new npool(1M) command, outlined in section 2.3, is used to administer the resources used by the MDS. The zfs(1M) command will be modified to allow the setting of a new property at the time of creating a ZFS file system (i.e., with zfs create). This property will flag the file system is to be used for the purpose storing file metadata on the pNFS metadata server. The name of this property is still to be determined, but one possibility is "pnfs". Values for the property would be "on" or "off".
The pNFS Metadata Server is started the same way that other NFS shares are started. That is, either by sharemgr(1M), or by ZFS itself.
The pNFS data server consists of a kernel module that registers itself with the Solaris NFS server.
The dpool command is used to allocate and control resources used by the pNFS data server.
The data server is controlled via libdserv.so, an internal API. libdserv.so uses SMF (libscf.so) to store configuration information and start/stop the processes needed to function as a pNFS data server.
The pNFS data server consists of a kernel module that registers itself with the Solaris NFS server. Requests specific to a pNFS data server are routed to this module.
Changes will also be made to the NFS daemon (nfsd) to support a pNFS data server.
Layout allocation is used on both a pNFS client and a pNFS metadata server. On a client, it governs the contents of a layouthint attribute, which a metadata server may use to influence the layout of a newly created file. On a metadata server, the policy engine governs the newly created layouts directly, with provisions for taking a layouthint attribute into account.
The pnfsalloc(1M) command line interface is the most obvious user-visible aspect of layout allocation. On a pNFS client, there will also be a yet to be determined CLI that will show the layout of an existing file.
The layout allocator is accessible to the other components (pNFS client and pNFS metadata server) via a project private API.
NFS version 4.1 - http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt
NFS/RDMA - Transport version update (PSARC/2007/347)
Converting /etc/default/{nfs/autofs} to SMF properties (PSARC/2007/393)
NFS version 4.1 specification - http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-11.txt