Support for Reparse Points (PSARC 2009/387) Draft 4, July 28th, 2009 Afshin.Salek@Sun.COM Robert.Thurlow@Sun.COM Dai.Ngo@Sun.COM Alan.M.Wright@Sun.COM 1. Introduction The explosive growth of data storage and massive proliferation of file servers and NAS appliances has created a namespace management nightmare for company data centers and storage administrators. IT administrators spend a great deal of time on file management tasks (adding file servers, rebalancing storage, setting up failover, etc.) and data movement tasks (replication, migration, consolidation, data distribution). These are tedious and time-consuming for administrators, disruptive to users, and expensive for companies [1]. Companies are looking for better ways to scale and manage their file systems, and global namespaces provide a means to ease and simplify the problem. A namespace is a logical abstraction layer between clients (users and applications) and file systems; it provides a method of viewing and accessing files that is independent of the physical file locations. This is a powerful concept, as it means an administrator can use a namespace to logically arrange and present data to users, irrespective of where the data is located. It also gives administrators with the ability to add, change, move and reconfigure physical file storage without affecting how it is viewed or accessed by users [1]. The two most popular file sharing protocols, NFS and SMB, offer a mechanism to implement such a global namespace. NFSv4.x offers it primarily for UNIX environments through referrals [2] and SMB offers it primarily for Windows environments through DFS (Distributed File System) [3]. In this context, a namespace is a group of exports or shared folders, possibly located on different servers, that is presented as a virtual directory tree - it appears to users/clients as if they are accessing a local file system on the server that is hosting the namespace. A virtual namespace on a host server is typically comprised of three components: - A root folder that is exported or shared by the host server so that the namespace is visible over network - Regular folders - Links that represent exports or shares [typically on other systems] that hold the actual data behind the virtual namespace This unified namespace is managed centrally on the server that hosts it. The links in a namespace provide the mapping between the virtual namespace and the physical backend servers, serving as a layer of abstraction to provide administration relief. Solaris is in the unique position of offering both NFS and SMB services natively, which means it can offer a true heterogeneous unified namespace for these network file protocols: With the addition of reparse points and referrals, Solaris will support global namespaces that can be shared over both NFS and SMB simultaneously or separately. This case does not deal with the specifics of NFS or DFS referrals, it introduces reparse points as the infrastructure required to support referrals. - Reparse points are file system objects within the directory hierarchy on a local file system on the host server - Reparse points must be usable for both NFS and DFS referrals. Since NFS and DFS referrals have different formats, reparse points must support a mechanism to map between the object content and the format required by the protocols. From a more abstract perspective, NFS and DFS referrals are examples of a more general mechanism to indicate that data is not present at a particular location but can be found at some alternate location(s). Other examples of services that could utilize this infrastructure include Hierarchical Storage Management (HSM), data migration, conventional symbolic links and mount points. A reparse point is a generic marker for namespace redirection that contains the metadata required to determine the redirection target or destination. Reparse points are intended to be a generic and extensible platform for location redirection and, as such, the file system in which they exist need not be cognizant of the reparse point format or content. Services that use reparse points must know how to interpret and use reparse point content. 2. Scope This case presents an underlying infrastructure for a generic location redirection mechanism. As such, details of how consumer services, such as NFS/DFS referrals, are implemented is out of the scope of this case, as is any mechanism to follow reparse points locally. Note that there is nothing in this case that precludes support for local services following reparse points. An umbrella case, PSARC/2009/399, has been submitted to cover follow-on cases. 3. Reparse Point Representation Various options have been considered as mechanisms to represent reparse points in the file system: 1) introduce a new object type 2) mount points 3) use one of the existing types i.e. file, directory or symbolic link 4) use one of the existing types tagged as a reparse point The following criteria have been used to select the implementation proposed in this case: a) reparse points represent specific functionality: location redirection b) reparse points have associated data c) minimum effect on existing utilities and applications d) ability to use existing backup utilities and software without modification The idea of a new object type was rejected based on criteria (c) and (d) since utilities, particularly backup tools, would have to be modified to deal with a new object type. The idea of using a mount point is also problematic because there is no obvious way to satisfy (d), i.e. to assign the reparse data in such a way that it could be backed up or restored using existing tools. The use of an existing object type without any kind of tag or qualifier is also not feasible due to requirement (a). Without some form of qualifier, services may be unable to determine that the object should be treated as a reparse point rather than as its native form would suggest. Even with a tag or qualifier attempting to use normal files or directories is problematic when it comes to defining the expected behavior for each operation on the object depending on whether or not it has been tagged as a reparse point. For example, a directory can only be tagged as a reparse point if it is empty, otherwise its content would be inaccessible after it has been tagged, and when a directory has been tagged as a reparse point then the file system must ensure that nothing can be created in it. Automounter extensions were also considered but rejected because of the desire to create centrally administered namespaces, served by a group of file servers to near-zero-administration clients. It is expected to be easier to keep the namespaces uniform if only a small number of servers need to participate. Also, for both NFS and SMB referrals it is the client that selects the target rather than the server. The server only provides target information, which may include several possible targets, and it is up to clients to select a specific target to access data at the alternate location. After considering the potential options for representing reparse points and taking the criteria listed above into account the consensus is to represent reparse points in the file system via symbolic links, with a tag to mark it as reparse point. This particular object is appealing because a symbolic link already represents a form of namespace redirection and could be implemented as a form of reparse point. To distinguish a regular symlink from a reparse point, an extensible system attribute (XAT_REPARSE) will be set on the symlink. This system attribute is a single bit that indicates whether or not a symlink contains reparse data. The reparse data will be stored as the link target (note that symlinks do not support extended attributes). The reparse data is not in file system path format, which is the typical format for a symlink target. In order to avoid coming up with a new format for reparse data as the link target the proposal is to adopt the format used by magic links in BSD: (http://www.daemon-systems.org/man/symlink.7.html) @{REPARSE@{svc-type1:svc-data} [@{svc-type2:svc-data}]...} The data for each service will be in string format, which, typically, is expected to be a UUID string. This string can be a valid filename which implies the following: - No POSIX rules have been broken. - Existing software that treats a reparse point as a regular symlink and follows it would behave as if they had followed a symlink with a non-existent target, so nothing should break. - There is an extremely remote likelihood that a file name exists in the directory in which a reparse point is being created that exactly matches the reparse point content, in which case, reparse point creation will fail. There is also an extremely remote chance that such a file will be created after the reparse point has been created. No attempt will be made by the reparse point implementation to detect this situation because it seems so unlikely that it is not worth creating the infrastructure necessary to detect it. The pattern above starts with "REPARSE" to distinguish it from magic links such as those supported by BSD. Note that this case is not a proposal to support BSD magic links; the intent is to avoid precluding the future addition of full BSD magic link support. Multiple service entries can co-exist within the reparse point data. It is expected but not required that, normally, all entries would resolve to the same logical location, e.g. NFS and SMB clients would find the same files. 4. Interfaces There is a need for both user space and kernel APIs to support reparse points. 4.1. Userspace API In userspace the symlink(2) system call will be used to set a reparse point. The readlink(2) system call will be used in turn to read reparse point data. 4.2. Kernel API In the kernel, VOP_SYMLINK and VOP_READLINK will be used to set/get reparse data. These interfaces will support all replication, archive and copy operations to preserve reparse points without further changes. fop_symlink() will be modified to recognize the reparse @{REPARSE...} tag in the target and mark the symlink as a reparse point by passing the XAT_REPARSE extensible attribute to VOP_SYMLINK, which will be set on the symlink. Thus, when a reparse point is created, XAT_REPARSE will be set automatically by the VFS and the caller not need be concerned with or even aware of this attribute. An important implication of this implementation is that backup tools do not need know how to backup or restore XAT_REPARSE. Reparse points will be restored correctly as symlinks with XAT_REPARSE set by virtue of the VFS behavior. 4.3 API modifications VFS feature registration can be used to determine whether or not a file system supports reparse points. Two things are needed to obtain the reparse point data in the kernel: - The consumer needs to know that a reparse point has been encountered. - The consumer needs the vnode pointer to the symlink. VOP_LOOKUP will be enhanced to return the vnode attributes. Thus, when the vnode is available, the caller can check the attributes to determine if the returned vnode is a reparse point or a regular symlink. The current VOP_LOOKUP signature is: int VOP_LOOKUP(vnode_t *dvp, char *nm, vnode_t **vpp, pathname_t *pnp, int flags, vnode_t *rdir, cred_t *cr, caller_context_t *ct, int *deflags, pathname_t *ppnp) The new VOP_LOOKUP signature is: int VOP_LOOKUP(vnode_t *dvp, char *nm, vnode_t **vpp, pathname_t *pnp, int flags, vnode_t *rdir, cred_t *cr, caller_context_t *ct, int *deflags, pathname_t *ppnp, vattr_t *vap) If the new vattr_t pointer argument is non-NULL, VOP_LOOKUP will return the attributes, which avoids the need for consumers to make an additional call to VOP_GETATTR to determine how to interpret a symlink vnode. 5. Reparse Daemon and Shared Libraries A library, /usr/lib/libreparse.so, will be introduced to support reparse point creation, manipulation and interpretation. libreparse.so will load additional, service specific, shared object plugins to provide service specific behavior. The plugin infrastructure will be modelled after the existing libshare architecture used by sharemgr(1M), with the plugin shared objects residing in /usr/lib/reparse as described in section 5.4. Services will be identified by a service type using short strings of the form: nfs-local nfs-fedfs smb-local smb-ad The library interfaces use nvlist_t lists in which the list entry names are the service types. The API will parse symlink text to nvlists for manipulation and unparse nvlists back to strings. 5.1 Library Interfaces Functions returning int will return zero on success or a non-zero errno value on failure. nvlist_t *reparse_init(void); Allocate an empty nvlist_t suitable for libreparse.so routines to manipulate. This routine will allocate memory, which must be freed by reparse_free(). int reparse_parse(const char *string, nvlist_t *list); Parse the specified string and populate the nvlist with the svc_types and data from the string. The string could be read from the reparse point symlink body. Existing or duplicate svc_type entries in the nvlist will be replaced. This routine will allocate memory that must be freed by reparse_free(). int reparse_add(nvlist_t *list, const char *svc_type, const char *string); Add a service type entry to an nvlist with a copy of "string", replacing one of the same type if already present. This routine will allocate and free memory as needed. int reparse_remove(nvlist_t *list, const char *svc_type); Remove a service type entry from the nvlist, if present. This routine will free memory that is no longer needed. int reparse_unparse(const nvlist_t *list, char **stringp); Convert an nvlist back to a string format suitable to write to the reparse point symlink body. The string returned is in allocated memory and must be freed by the caller. void reparse_free(nvlist_t *list); Frees all of the resources in the nvlist. int reparse_create(const char *path, const char *string); Create a reparse point at a given pathname; the string format is validated. This function will fail if path refers to an existing file system object or an object named string already exists at the given path. int reparse_delete(const char *path); Delete a reparse point at a given pathname. It will fail if a reparse point does not exist at the given path or the pathname is not a symlink. int reparse_deref(const char *svc_type, const char *svc_data, char *buf, size_t *bufsize); Accept and parse the symlink data, and return a type-specific piece of data in buf. The caller specifies the size of the buffer provided via *bufsize; the routine will fail with EOVERFLOW if the results will not fit in the buffer, in which case, *bufsize will contain the number of bytes needed to hold the results. 5.2 Kernel Interfaces Functions returning int will return zero on success or a non-zero errno value on failure. int reparse_kderef(const char *svc_type, const char *svc_data, char *buf, size_t *bufsize); This is the kernel version of reparse_deref() with the same signature and behaviour; it will make a door upcall to access reparse_deref(). nvlist_t *reparse_init(void); int reparse_parse(const char *string, nvlist_t *list); void reparse_free(nvlist_t *list); These routines will be available within the kernel to any service that wishes to examine reparse points without an upcall. Memory allocation is the same as previously described. 5.3 Upcall Daemon and CLI The upcall daemon will link with libreparse.so to make use of the dereference call. A CLI, rp, will support automated tests only, with the following actions: rp add Add an entry to reparse point, creating if necessary. rp remove Remove an entry from the reparse point, removing the reparse point if the last entry was removed. rp list List the type and data fields in the reparse point. rp deref Show the data that would be returned for the given service type. 5.4 Service Type Delivery Services making use of reparse points will deliver kernel code to dereference them. This kernel code will read the symlink text and call reparse_kderef() with the text to access the desired data. A svc_type argument will provide for cases where a reparse point has multiple parts (e.g. if a service intends to provide both NFS and SMB referrals). reparse_kderef() will make a door upcall to return the service data. The upcall will be implemented in a new daemon, /usr/lib/reparse/reparsed. Services making use of reparse points will also provide a shared object plugin to provide service specific behavior. The plugin architecture is modelled after the existing implementation used by sharemgr(1M). libreparse.so will scan the /usr/lib/reparse directory for file names of the form libreparse_xxx.so, where xxx is a short name defined by the service, for example, libreparse_nfs.so and libreparse_smb.so. libreparse.so will define a versioned ops table of the form: #define REPARSE_PLUGIN_V1 1 typedef struct reparse_plugin_ops { int rpo_version; /* version number */ int (*rpo_init)(void); void (*rpo_fini)(void); char *(*rpo_svc_types)(void); boolean_t (*rpo_supports_svc)(const char *); char *(*rpo_form)(const char *, const char *); int (*rpo_deref)(const char *, const char *, char *, size_t *); } reparse_plugin_ops_t; Each candidate plugin will be opened and validated; a valid plugin must must define an ops structure named reparse_plugin_ops, for example: reparse_plugin_ops_t reparse_plugin_ops = { REPARSE_PLUGIN_V1, nfs_init, nfs_fini, nfs_svc_types, nfs_supports_svc, nfs_form, nfs_deref }; The version 1 ops table supports the following operations: int (*rpo_init)(void); This is a one-time initialization function that will be called by libreparse.so upon loading the plugin prior to any other operations. This provides the plugin with an opportunity to perform service specific initialization. This function must return zero on success or non-zero errno values to indicate an error. void (*rpo_fini)(void); This is a one-time termination function that will be called by libreparse.so prior closing the plugin. Once called, libreparse.so will not call any other operations on the plugin. char *(*rpo_svc_types)(void); Returns a pointer to a string containing a list of comma separated svc_types. svc_types names are case-insensitive and white space in the returned string is irrelevant and must be ignored by parsers. boolean_t (*rpo_supports_svc)(const char *svc_type); This function will return true if the plugin supports the specified service type, otherwise it must return false. char *(*rpo_form)(const char *svc_type, const char *string); Returns a string with the appropriate service-specific syntax to create a reparse point of the given svc_type, using the string from the reparse_add() call as part of the string. int (*rpo_deref)(const char *svc_type, const char *svc_data, char *buf, size_t *buflen); Accepts the service-specific item from the reparse point and returns the service-specific data requested. The caller specifies the size of the buffer provided via *bufsize; the routine will fail with EOVERFLOW if the results will not fit in the buffer, in which case, *bufsize will contain the number of bytes needed to hold the results. 5.5 Examples A service would set up a reparse point this way: nvlist_t *nvp; char *text; int rc; nvp = reparse_init(); rc = reparse_add(nvp, "smb-ad", smb_ad_data); rc = reparse_add(nvp, "nfs-fedfs", nfs_fedfs_data); rc = reparse_unparse(nvp, &text); rc = reparse_create(path, text); reparse_free(nvp); /* use "text" here */ free(text); A kernel service would use a reparse point this way: char *symtext, *svc_type, *svc_val; void *buf; fs_locations *fsl; XDR xdrs; int rc, len; nvlist_t *nvl; nvpair_t *cur; if ((nvl = reparse_init()) == NULL) return ENOMEM; /* * symlink text is present in 'symtext' * ex: "@{REPARSE@{nfs-fedfs:keyX}@{smb-ad:keyY}} */ rc = reparse_parse(symtext, nvl); cur = nvlist_next_nvpair(nvl, NULL); while (cur != NULL) { /* get service type string; "nfs-fedfs" or "smb-ad" */ svc_type = nvpair_name(cur); if (strncasecmp(svc_type, "nfs", 3) == 0) { /* get service type data; "keyX" or "keyY" */ rc = nvpair_value_string(cur, &svc_val); break; } cur = nvlist_next_nvpair(nvl, cur); } if (cur != NULL) { len = RESSZ; buf = kmem_alloc(len, KM_SLEEP); rc = reparse_kderef(svc_type, svc_data, buf, len); if (rc == 0) { xdrmem_create(&xdrs, buf, len, XDR_DECODE); if (!xdr_fs_locations(&xdrs, fsl)) rc = ENOMEM; xdr_destroy(&xdrs); } } 5.6 Interface List New kernel routines (Consolidation Private): int reparse_kderef(const char *, const char *, char *, size_t); nvlist_t *reparse_init(void); int reparse_parse(const char *, nvlist_t *); void reparse_free(nvlist_t *); New daemon (Consolidation Private): /usr/lib/reparse/reparsed New library /usr/lib/libreparse.so (Consolidation Private), containing: nvlist_t *reparse_init(void); int reparse_parse(const char *, nvlist_t *); int reparse_add(nvlist_t *, const char *, const char *); int reparse_remove(nvlist_t *, const char *); int reparse_unparse(const nvlist_t *, char **); void reparse_free(nvlist_t *); int reparse_create(const char *, const char *); int reparse_delete(const char *); int reparse_deref(const char *, const char *, char *, size_t *); Plugins - /usr/lib/reparse/libreparse_*.so (Consolidation Private), containing: typedef struct reparse_plugin_ops { int rpo_version; int (*rpo_init)(void); void (*rpo_fini)(void); char *(*rpo_svc_types)(void); boolean_t (*rpo_supports_svc)(const char *); char *(*rpo_form)(const char *, const char *); int (*rpo_deref)(const char *, const char *, char *, size_t *); } reparse_plugin_ops_t; CLI Interface (for testing only): rp add rp remove rp list rp deref 5.7 Reparse SMF Service The SMF service is named reparse, with an SMF manifest file installed under /var/svc/manifest/system/filesystem. The service will have a single instance of the reparsed daemon, which is a multi-threaded process. The service does not require additional administrative intervention for configuration before it starts for the first time as default values are specified for all properties. The SMF service instance will be delivered with a default status of 'enabled'. The reparse SMF service is not dependent upon any other SMF services to start. Starting the reparse SMF service has a dependency on the completion of filesystem-minimal. The reparse SMF service 'start' method invokes /usr/lib/reparse/reparsed. This starts 'reparsed' as a background process. The 'start' method will specify a method context, which will define a method credential. The daemon will will surrender privileges unnecessary for reparsed runtime operation after process startup as described under RBAC Configuration. The reparse SMF service 'stop' method calls the ':kill' builtin, which kills all processes (i.e. reparsed) started by the reparse SMF service start method. The reparse SMF service 'stop' method is run if CIFS SMF service fails or an administrator requests disable or restart. The reparse SMF service 'refresh' method calls the ':kill -HUP' method, which sends a SIGHUP signal to the reparsed daemon. There are no faults for the reparse SMF service that need to be ignored by the re-starter. The reparse SMF service will follow the default 'contract' service model. 5.8 RBAC Configuration The reparse service RBAC configuration covers the authorization requirements for managing the service state: starting, stopping or refreshing the service. 5.8.1 Authorizations The following authorization will be added to the auth_attr(4) file: solaris.smf.manage.reparse:::Manage Reparse Service States::help=SmfReparseStates.html The solaris.smf.manage.reparse authorization covers activities that change the state of the service such as starting, stopping or refreshing the reparsed daemon. No property values have been identified for the reparse service. 5.8.2 Profiles In order to perform reparse administration functions, for example, to start, stop or refresh the service, a user will need the reparse management profile. The following profile will be added to the prof_attr(4) file: Reparse Management:::Manage the reparse service:auths=solaris.smf.manage.reparse:help=RtReparseMngmnt.html The Reparse Management profile will be added to the File System Management profile. 5.8.3 Privileges The daemon will surrender privileges unnecessary for runtime operation after process startup, for example, PRIV_PROC_EXEC and PRIV_PROC_FORK. The reparsed daemon will run as root:sys: user root is required to write to /var/run, which is only writable as root. 5.9 Auditing No auditing requirements have been identified for the reparse service. 5.10 Zones, Trusted Extensions and Solaris Virtualization Technologies The reparse service will only run in the global zone. If an attempt is made to start the service in a non-global zone, the service will exit with an SMF error. No Trusted Extensions requirements have been identified for the reparse service. There should be no impact on other Solaris virtualization technologies. 6. Security Considerations Referrals are similar to regular symbolic links in that they are only pointers to data that could be discovered in some other way. The presence of such a pointer does not compromise the security of the target object or data; the target service or file system must still enforce security. 7. Interface Table |Proposed |Specified | |Stability |in what | Interface Name |Classification |Document? | Comments ============================================================================== XAT_REPARSE |Consolidation |This |Reparse extensible |Private |Document |attribute | | | VOP_LOOKUP, fop_lookup |Contracted | |Added new argument: |Consolidation | |vattr_t *vap |Private | | | | | Reparse token syntax |Committed | | |Private | | | | | reparse_kderef |Consolidation | |Reparse kernel reparse_init |Private | |routines reparse_parse | | | reparse_free | | | | | | reparse_init |Consolidation | |Reparse library reparse_parse |Private | |routines reparse_add | | | reparse_remove | | | reparse_unparse | | | reparse_free | | | reparse_create | | | reparse_delete | | | reparse_deref | | | | | | reparse_plugin_ops_t |Consolidation | |Reparse plugin |Private | |ops | | | rp add |Consolidation | |Reparse CLI rp remove |Private | | rp list | | | rp deref | | | Note: There is no need to define a name in the file system to support a user space door service that is called from the kernel, and not defining a name avoids potential security concerns. Thus the door interface is private to the implementation and not mentioned in the interface table. 8. References [1] Global namespace: The future of distributed file server management http://findarticles.com/p/articles/mi_m0BRZ/is_2_23/ai_98709766/ [2] Implementation Guide for Referrals in NFSv4 http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00 [3] DFS Technical Reference http://technet.microsoft.com/en-us/library/cc757042%28WS.10%29.aspx