== PROBLEM OVERVIEW

nftw(3C) (and nftw64) is a routine in libc which is the	                       |
"new file tree walk". It recursively calls walk() to traverse a                |
directory tree. One main consumer of it is find(1).                            |

There are several flags that control how walk() behaves:

        FTW_MOUNT directs walk() not to cross mountpoints

        FTW_PHYS directs walk() not to follow symbolic links.

The walk() routine uses stat() to test each component that it
encounters to ensure that it does not violate the requested behavior.

The following code snippet succinctly captures the security test and
the window of opportunity:

        struct stat     statPre;
        struct stat     statFile;
        DIR             *pdir;

        stat(szPath, &statPre);
        pdir = opendir(szPath);
        fstat(pdir->dd_fd, &statFile);

        if (statPre.st_ino != statFile.st_ino ||
            statPre.st_dev != statFile.st_dev) {
                return(EAGAIN);
        }

There is a window between the stat() and opendir() calls when the user
might move directory contents (an innocent case we need to avoid) or
use a symlink to get outside of the directory hierarchy (a security
breach).  If the results of the stat() do not match those of the
fstat(), then assume that there is some problem and return to the
caller.

find(1) will for example report:

        find: cannot open /mnt: Resource temporarily unavailable

A problem with this test occurs when the filesystem is of type "autofs"
(PSARC 1992/024). In that case, the directory entry, whose name is
given by szPath, is a trigger mount - a mount occurs when the                  |
directory is entered.  By definition, getting attributes on the
directory (i.e., stat()) does not constitute entering the directory,
but the opendir() does, which triggers an autofs mount.

This leads to a false positive case. The code is not able to detect
that a trigger mount occured beneath it - the st_ino and st_dev are
expected to not match. As expected, if the user were to immediately
retry the application, it would now succeed. The mount has been
established and the results from the stat() will match the fstat().

The current code addresses this by doing a strcmp() on st_fstype to
determine if it is an autofs filesystem (see fix 6198351). If so, then
statPre is refreshed after the opendir(). This was deemed safe in              |
that the kernel owns the contents of the autofs filesystem. While              |
the kernel does own the contents, it is possible for changes to occur          |
for example in /net which would allow a symlink type exploit to                |
happen to a directory which was a trigger mount. By unilaterally               |
allowing the exception, we blind ourselves to this act.                        |

If we add the test from the current code for ntfw()/walk(), the code
snippet would now look like this:

        struct stat     statPre;
        struct stat     statFile;
        DIR             *pdir;

        stat(szPath, &statPre);
        pdir = opendir(szPath);

        if (statPre.st_fstype[0] == 'a' &&
            strcmp(statPre.st_fstype, "autofs") == 0) {
                /*
                 * this dir is on autofs
                 */
                fstat(pdir->fd->dd_fd, &statPre)
        }

        fstat(pdir->dd_fd, &statFile);

        if (statPre.st_ino != statFile.st_ino ||
            statPre.st_dev != statFile.st_dev) {
                return(EAGAIN);
        }

With the addition of mirror mounts for NFSv4 (see PSARC 2007/416), we
have another case where trigger mounts can cause a false positive.
Also note that other NFSv4 features, such as referrals and migration
will employ trigger mounts as the integral interface to remote
filesystems.

We could once again try checking the st_fstype for "nfs4" to
for exception checking, but this check will fail for these reasons:

    1) st_fstype for "nfs3" and "nfs4" is truncated to "nfs" for
    backwards compatibility in 3rd party applications. I.e., this would
    lead to us allowing exemptions for all directory entries on all
    versions of nfs.

    The problem is that we should only allow exemptions for directories        |
    which are "nfs4" and mirror mount trigger points.

    2) All nfs filesystems are not strictly controlled in the kernel as
    with the autofs filesystem. I.e., it is possible for an user
    application to mangle the directory tree.

    The point here is that an autofs filesystem is not directly
    writable by the user. The only objects in an autofs filesystem are         |
    automount trigger points, and then cannot be manipulated.

    The user can not move directory hierarchies around in an autofs
    filesystem. So walk() can be a bit relaxed. With a nfs filesystem,
    walk() does not have that luxury.


=== PROPOSED SOLUTION

The solution is to atomically force the stat() call to mount the               |
new filesystem and return the attributes of the root vnode rather than         |
that of the vnode which has been mounted on. This guarantees that              |
we have accurate information and we do not introduce further windows           |
of opportunity.                                                                |

In order to do this, we propose to add a new bit, _AT_TRIGGER, to the          |
flags field of the parameters to fstatat(2). This information would be         |
passed down through the system call and into the VOP interface to              |
VOP_GETATTR() using a new bit, ATTR_TRIGGER.  This bit informs the file        |
system that, if the vnode is a trigger-mount, the file system should           |
mount the file system before performing the operation.                         |

In particular, we would add to sys/fcntl.h:                                    |

#define _AT_TRIGGER                      0x2                                   |

And we would add to sys/vnode.h:                                               |

#define ATTR_TRIGGER      0x40 /* If vnode is a trigger mount, mount first */  |

The code snippet would now look like this:

        struct stat     statPre;
        struct stat     statFile;
        DIR             *pdir;

        fstatat(0, szPath, &statPre, _AT_TRIGGER);                             |
        pdir = opendir(szPath);

        fstat(pdir->dd_fd, &statFile);                                         *

        if (statPre.st_ino != statFile.st_ino ||
            statPre.st_dev != statFile.st_dev) {
                return(EAGAIN);
        }

Note that this proposal provides for better security than what is              |
currently in Nevada. The resetting of the stat buffer as an                    |
exception did introduce a hole. This proposal closes that hole,                |
but does not seek to address any other security issue. In particular,          |
the code will react to a compromised server in exactly the same                |
as before.                                                                     |


=== EXPORTED INTERFACE TABLE

                        |Proposed        |Specified       |                    
                        |Stability       |in what         |                    
Interface Name          |Classification  |Document?       | Comments           
===============================================================================
                        |                |                |                    |
_AT_TRIGGER             | Committed      | This           | New bit value      |
                        |                | Document       | for flag parameter |
                        |                |                | to fstatat()       |
                        |                |                |                    |
ATTR_TRIGGER            | Consolidation  |                | New bit passed to  |
                        | Private        |                | VOP_GETATTR()      |
                        |                |                | indicating that the|
                        |                |                | file system should |
                        |                |                | mount the new FS if|
                        |                |                | the vnode is a     |
                        |                |                | trigger mount.     |