== PROBLEM OVERVIEW nftw(3C) (and nftw64) is a routine in libc which is the | "new file tree walk". It recursively calls walk() to traverse a | directory tree. One main consumer of it is find(1). | There are several flags that control how walk() behaves: FTW_MOUNT directs walk() not to cross mountpoints FTW_PHYS directs walk() not to follow symbolic links. The walk() routine uses stat() to test each component that it encounters to ensure that it does not violate the requested behavior. The following code snippet succinctly captures the security test and the window of opportunity: struct stat statPre; struct stat statFile; DIR *pdir; stat(szPath, &statPre); pdir = opendir(szPath); fstat(pdir->dd_fd, &statFile); if (statPre.st_ino != statFile.st_ino || statPre.st_dev != statFile.st_dev) { return(EAGAIN); } There is a window between the stat() and opendir() calls when the user might move directory contents (an innocent case we need to avoid) or use a symlink to get outside of the directory hierarchy (a security breach). If the results of the stat() do not match those of the fstat(), then assume that there is some problem and return to the caller. find(1) will for example report: find: cannot open /mnt: Resource temporarily unavailable A problem with this test occurs when the filesystem is of type "autofs" (PSARC 1992/024). In that case, the directory entry, whose name is given by szPath, is a trigger mount - a mount occurs when the | directory is entered. By definition, getting attributes on the directory (i.e., stat()) does not constitute entering the directory, but the opendir() does, which triggers an autofs mount. This leads to a false positive case. The code is not able to detect that a trigger mount occured beneath it - the st_ino and st_dev are expected to not match. As expected, if the user were to immediately retry the application, it would now succeed. The mount has been established and the results from the stat() will match the fstat(). The current code addresses this by doing a strcmp() on st_fstype to determine if it is an autofs filesystem (see fix 6198351). If so, then statPre is refreshed after the opendir(). This was deemed safe in | that the kernel owns the contents of the autofs filesystem. While | the kernel does own the contents, it is possible for changes to occur | for example in /net which would allow a symlink type exploit to | happen to a directory which was a trigger mount. By unilaterally | allowing the exception, we blind ourselves to this act. | If we add the test from the current code for ntfw()/walk(), the code snippet would now look like this: struct stat statPre; struct stat statFile; DIR *pdir; stat(szPath, &statPre); pdir = opendir(szPath); if (statPre.st_fstype[0] == 'a' && strcmp(statPre.st_fstype, "autofs") == 0) { /* * this dir is on autofs */ fstat(pdir->fd->dd_fd, &statPre) } fstat(pdir->dd_fd, &statFile); if (statPre.st_ino != statFile.st_ino || statPre.st_dev != statFile.st_dev) { return(EAGAIN); } With the addition of mirror mounts for NFSv4 (see PSARC 2007/416), we have another case where trigger mounts can cause a false positive. Also note that other NFSv4 features, such as referrals and migration will employ trigger mounts as the integral interface to remote filesystems. We could once again try checking the st_fstype for "nfs4" to for exception checking, but this check will fail for these reasons: 1) st_fstype for "nfs3" and "nfs4" is truncated to "nfs" for backwards compatibility in 3rd party applications. I.e., this would lead to us allowing exemptions for all directory entries on all versions of nfs. The problem is that we should only allow exemptions for directories | which are "nfs4" and mirror mount trigger points. 2) All nfs filesystems are not strictly controlled in the kernel as with the autofs filesystem. I.e., it is possible for an user application to mangle the directory tree. The point here is that an autofs filesystem is not directly writable by the user. The only objects in an autofs filesystem are | automount trigger points, and then cannot be manipulated. The user can not move directory hierarchies around in an autofs filesystem. So walk() can be a bit relaxed. With a nfs filesystem, walk() does not have that luxury. === PROPOSED SOLUTION The solution is to atomically force the stat() call to mount the | new filesystem and return the attributes of the root vnode rather than | that of the vnode which has been mounted on. This guarantees that | we have accurate information and we do not introduce further windows | of opportunity. | In order to do this, we propose to add a new bit, _AT_TRIGGER, to the | flags field of the parameters to fstatat(2). This information would be | passed down through the system call and into the VOP interface to | VOP_GETATTR() using a new bit, ATTR_TRIGGER. This bit informs the file | system that, if the vnode is a trigger-mount, the file system should | mount the file system before performing the operation. | In particular, we would add to sys/fcntl.h: | #define _AT_TRIGGER 0x2 | And we would add to sys/vnode.h: | #define ATTR_TRIGGER 0x40 /* If vnode is a trigger mount, mount first */ | The code snippet would now look like this: struct stat statPre; struct stat statFile; DIR *pdir; fstatat(0, szPath, &statPre, _AT_TRIGGER); | pdir = opendir(szPath); fstat(pdir->dd_fd, &statFile); * if (statPre.st_ino != statFile.st_ino || statPre.st_dev != statFile.st_dev) { return(EAGAIN); } Note that this proposal provides for better security than what is | currently in Nevada. The resetting of the stat buffer as an | exception did introduce a hole. This proposal closes that hole, | but does not seek to address any other security issue. In particular, | the code will react to a compromised server in exactly the same | as before. | === EXPORTED INTERFACE TABLE |Proposed |Specified | |Stability |in what | Interface Name |Classification |Document? | Comments =============================================================================== | | | | _AT_TRIGGER | Committed | This | New bit value | | | Document | for flag parameter | | | | to fstatat() | | | | | ATTR_TRIGGER | Consolidation | | New bit passed to | | Private | | VOP_GETATTR() | | | | indicating that the| | | | file system should | | | | mount the new FS if| | | | the vnode is a | | | | trigger mount. |