Unified POSIX and Windows Credentials for Solaris
Mike Shapiro (mws@sun.com)  Draft 5 1Nov07

0. Contents

    1. Introduction
    2. Problem Overview
    3. Analysis of Samba ID Mapping
        (a) Name Mapping
        (b) Algorithmic Mapping
        (c) Centralized Identity
    4. Analysis of NetApp ID Mapping
    5. Proposed Solution Constraints
    6. Proposed Solution for Solaris
    7. Canonical SID Representation
    8. Filesystem SID Representation
    9. Credential SID Representation
    10. POSIX ID Partition and Mapping
    11. ID Mapping Service
    12. CIFS Implementation
    13. NFSv3 Implementation
    14. NFSv4 Implementation
    15. Legacy Solaris Filesystems
    16. Solaris Data Formats
    17. Backup Formats
    18. Zones Integration
    19. Case Summary and Next Steps
    20. Acknowledgements
    21. References

1. Introduction

As part of Sun's effort to enhance the capabilities of the OpenSolaris system,
the Sun Software organization is working to integrate support for CIFS,
based on the stack acquired from Procom, into Solaris.  The Procom CIFS stack
previously ran on Procom's operating system, code-named Montana, currently
the basis of the Sun StorageTek 5310 and 5320 NAS products.  In order to port
a CIFS stack to Solaris, and more generally provide support for Windows-style
RPC services, Solaris must also provide support for Windows user credentials,
which have a significantly different design than traditional POSIX uids/gids.

The objective of this paper is to discuss the underlying design issue in terms
of our long-term strategic objectives for Solaris, and to make recommendations
for a set of technical changes and enhancements that will span the CIFS effort,
underlying filesystems (particularly ZFS), authentication and ID mapping
services, and kernel and user credential representation, in order to make
the CIFS effort successful in CY07 and provide a strategic foundation for
future work.  This paper also describes the history and details of the
approaches to this problem adopted by the two most successful technologies
existing in the market: NetApp's filers, and the Samba OpenSource project,
in order to validate our approach and compare it to these existing products.

This paper stops short of actually specifying the complete project-level
details of all of the changes; these details should be specified in subsequent
PSARC cases.  A summary of a the set of decisions corresponding to this case
and details that should be covered in subsequent cases is found in Section 19.

2. Problem Overview

Since the dawn of time, UNIX systems have represented users and groups using
names stored in passwd(4) and group(4) that are converted to integer uid and
gid values used in system call interfaces, the kernel, and in the filesystem.
Although the size of the type has changed over time, the fundamental design
remains the same: a single UNIX system views the set of user identifiers as a
linear sequence of integers in some range, and the ownership of files is
represented by storing the integer identifiers with the file.  The system
administrator is responsible for configuring the name service and nsswitch
so that the appropriate set of identifiers are visible for uid/gid <-> name
mapping, and to provide authentication services for establishing credentials.

However, the consequence of the UNIX/POSIX user identification model is that
when data is moved between machines over the network, by archiving files and
then restoring them, or by physically moving disk drives, the identity of the
file owner is evaluated within the single integer namespace maintained by the
name service of the destination machine.  That is, if I move a file with uid
12345 to a different machine (by NFSv3, which sends UIDs over the wire, by
cpio(1) which stores UIDs in archives, or by moving a disk with a UFS or ZFS
filesystem on it), then that file is now owned by whatever user has UID 12345
according to the new machine's name service configuration.  If the system
administrator there is using a different passwd(4) file or a different name
service, then that data may now be owned by a semantically different user.
The burden of preventing and/or solving this problem is left to administrators.

The UNIX/POSIX UID model also means that a single system and any persistent
representation derived from it can only support a flat namespace of users of
some fixed size.  For large enterprises, this may and historically has become
cumbersome, leading to a series of painful incompatible changes to the size
and limits associated with the uid_t and gid_t types.  Here is a historical
view of the size and legitimate values supported by uid_t on UNIX systems:

6th Edition UNIX: 8-bit signed, no restrictions on values
AT&T SVR3: 16-bit unsigned, no restrictions on values
4.3 BSD: 16-bit unsigned, no restrictions on values
SunOS 4.1.3: 16-bit unsigned, values ranging from [0, 65533]
Solaris 2.0: 32-bit signed, values ranging from [0, 60002]
Solaris 2.5: 32-bit signed, values ranging from [0, 60002]
Solaris 2.5.1: 32-bit signed, values ranging from [0, 0x7fffffff]
Solaris 10: 32-bit signed, values ranging from [0, 0x7fffffff]
Linux as of Jan 2007: 32-bit unsigned, no restrictions on values
FreeBSD as of Jan 2007: 32-bit unsigned, no restrictions on values

In 1995, Sun went through the pain of extending UIDs up to 0x7fffffff [A].
In 1998, Sun received the first large-scale customer escalation (Boeing) of
the need to map multiple large UID spaces on to the single UID space of a
single NFS server [B].  Amazingly, this problem had been known about and
considered as early as a decade before: the 1988 edition of the AT&T SVID
includes a description of idload(RS_CMD), a command used by administrators to
load a uid/gid translation table for Remote File Sharing into the system.
Today, Sun also wants to position Solaris and its Trusted Extensions as a way
to meet modern security and SOX compliance requirements, where enterprises
wish to assign single identities to employees to support one-click hire/fire
and implement restrictions on and auditing of information visible to certain
identities.  And we want to support CIFS as a first-class citizen in Solaris.

In Windows, both for use in CIFS and elsewhere in the operating system, user
credentials consist of Security Identifiers (SIDs) that represent users and
groups.  These are stored in the Windows equivalent of a kernel credential,
in the Windows filesystem, and in filesystem data structures such as Access
Control Entries (ACEs).  A Windows SID typically looks something like this:

S-1-5-12-7623811015-3361044348-030300820-1013

and decomposes into the following pieces:

S - The string is a SID
1 - The revision level (1 is the only value in present use)
5 - 48-bit identifier authority value (5 refers to "Windows NT")
12-7623811015-3361044348-030300820 - Identifier for domain or local computer
1013 - 32-bit Relative ID (RID) within the previously described domain

In other words, an SID is a universally unique identifier for a user
or group.  (Some generic terms are UUID and GUID, I'll just use SID here.)
As in the UNIX system, some well-known SIDs are defined, as shown in [1].
Every Windows system computes a universally unique identifier for itself,
so it can allocate unique local user accounts, and when one or more Windows
systems form an Active Directory domain, a universally unique identifier is
allocated for the domain itself, from which Domain Controllers can allocate
unique RIDs for new user accounts.  See [2] for a good, brief AD overview.

Therefore, unlike UNIX systems, when data is moved between Windows systems,
the ownership attributes and ACEs retain their semantic meaning, regardless
of the Active Directory configuration of the destination system.  If data is
moved to a system that is not participating in the original AD domain, then
only the administrator can access the data, or perform the Windows equivalent
of a chown to change ownership to one of the users with the new domain.
Furthermore, the system effectively supports an infinite number of unique
users, as long as they are in sets of 4 billion per system or per AD domain.

Finally, in order to support CIFS, NFS, and POSIX system call semantics,
Solaris must address the issue of providing a mapping between Windows SIDs
and POSIX UIDs and GIDs.  That is, if a CIFS session is initiated and creates
a file on a ZFS filesystem, that file's attributes should be able to be read
and written using the POSIX identity model, including local stat(2), access
over NFS, and so forth.  Similarly, files created locally or over NFS using
POSIX credentials should be accessible to remote clients using CIFS.

Three well-known solutions to this problem exist in the industry:

(a) Storing SIDs in the filesystem, and providing a mapping of local SIDs on
    the server to UIDs exported to remote UNIX clients, which is the approach
    taken by Microsoft Windows as part of its Windows Services for UNIX (SFU).

(b) Storing UIDs in the filesystem, and providing a mapping of remote SIDs
    to local UIDs for use in the normal UNIX stack, which is the approach
    taken by the OpenSource Samba software (see [4], [5], and [6])  Sun's
    Winchester project, targeting Solaris Nevada, proposes a similar approach
    (see opensolaris.org/os/project/winchester); we'll return to the details
    of the Winchester project later in this document in the Proposal section.

(c) Storing both UIDs and SIDs in the filesystem and/or a supplementary cache,
    and managing a set of mappings between them for use in the normal UNIX
    stack and for NFS, which is the approach taken by NetApp in WAFL / OnTAP
    (see [7], [8], and [9] for the evolution of NetApp's approach here).

    Apple has in effect adopted a similar approach in MacOS X 10.4, using
    128-bit GUIDs for user identity, and then using participation in a 
    directory service to locate POSIX or Windows credentials (see [E]).

Option (a) alone isn't a valid choice for Solaris since we must also address
issues of backward compatibility and we must support local POSIX system calls.
The rest of this document discusses the details and tradeoffs implied by the
existing approaches (b) and (c) and suggests a strategic direction for Solaris.

3. Analysis of Samba ID Mapping

Samba (www.samba.org) is an OpenSource SMB server that provides Windows
clients access to UNIX servers and vice-versa, both for files and for other
services including printer sharing etc.  Samba's design goal / constraint was
to operate on top of a variety of existing UNIX systems, running as a set of
userland daemons and services, without needing to modify existing kernels.
This design is certainly to be credited in terms of Samba's widespread use.
However, one consequence is that Samba is by definition constrained to the
uid/gid model presented by the UNIX system on which it is deployed.  That is,
uid values must conform to the range supported by the underlying kernel, and
all filesystems, data formats, and UNIX network services on the system remain
unchanged and follow the POSIX rather than Windows design.  Once data is 
written to a filesystem, it contains a POSIX uid/gid pair only: if data is
moved to a different system, the POSIX rule of relying on administrative
configuration to preserve identity semantics still applies.

In order to support Windows SIDs on POSIX systems, Samba relies on the UNIX
administrator to configure a mapping service that converts incoming SIDs to
POSIX uids/gids and vice-versa.  Luke Leighton described Samba's approach to
the problem in a now-expired 1999 Internet Draft [4].  Today's Samba offers
administrators a wide variety of approaches to the mapping problem; the full
details are found in [5] and [6].  These details can be condensed to the
following basic set of approaches to the mapping problem:

(a) Name Mapping

Administrators can create name equivalence between local Windows or AD users
and users described by a UNIX name service backend.  For example, an admin
could configure both Windows AD and UNIX LDAP or NIS to contain the complete
set of user names for a given enterprise, and then configure Samba so that when
Windows user SUN\mws establishes a session on a UNIX server, Samba looks up
the UNIX user "mws" in passwd(4) (via the LDAP or NIS server) to get a UID.

Name Mapping can also be extended by a set of static rewriting rules; both
Samba and NetApp (discussed in Section 4) support static rewriting for (a).
Static rewriting rules are useful when users have different names across the
enterprise, due to disparate naming conventions, such as when companies merge.
For example, a static rewriting rule could establish that Windows user
"SUN\MikeS" is in fact the same user as NIS user "mws", and then Samba will
apply this static rewriting prior to performing the UNIX name service lookup.

(b) Algorithmic Mapping

Administrators can manually partition the POSIX UID space by creating a set of
algorithmic mapping rules for SIDs (based on the encoded RID) to a portion of
the POSIX UID space.  For example, an administrator can configure:

idmap backend = idmap_rid:SUN=70000-80000

indicating that the "SUN" domain should be mapped to UIDs [70,000-80,000].
When Samba encounters the SID S-1-5-21-34567898-12529001-32973135-1234 from
this domain, the resulting POSIX UID will be 70000 + 1234 = 71234.

(c) Centralized Identity

Since Windows AD is based on an LDAP directory, administrators can use AD as
the single name service for an entire identity domain, and supplement the LDAP
directory with information that configures the ID mapping service.  Examples
of various Samba LDAP configurations to do this are shown in [5]; the basic
idea is to provide information similar to (b) in the LDAP directory itself.
Another obvious approach here would be to define a standard for storing POSIX
UIDs directly in AD (i.e. accompany Windows accounts with RFC 2307 attributes)
and leave it up to administrators to create a single centralized user identity.

One Centralized Identity scheme in existence is Microsoft's Windows Services
for UNIX (SFU) [3].  Prior to Windows Server 2003 R2, SFU was an add-on product
and it included an LDAP schema extension with properties for UNIX accounts in
AD including "msSFU30UidNumber" and "msSFU30GidNumber".  With Windows 2003 R2,
SFU became part of the base product, and Windows user objects in AD *may* have
a posixAccount object attached to them which includes a renamed "uidNumber"
member.  However, posixAccounts are not generated by default on Windows systems
so this cannot be relied upon, and the numbers used are unique only to the
current AD domain.  Microsoft's intent seems to be provide posixAccount only
when an admin chooses to slurp existing UNIX NIS or LDAP into AD.

Although it is clear that many customers have accepted Samba's administrative
model, there are some obvious drawbacks associated with Samba:

(i) POSIX semantics, rather than Windows semantics, apply to data owned by
UNIX servers.  If files or filesystems are moved to a different server, their
identity is only the POSIX identity, subject to the name service configuration
and Samba configuration of the destination system, which may be different.

(ii) Windows clients cannot utilize a UNIX file server unless an administrator
decides upon and successfully configures the identity mapping solution.  For
approaches (a) or (c), this requires changes to domain-wide identity services,
as well as changes to procedures and/or tools for creating user accounts.
This may be easy for large, well-planned UNIX installations, but annoying
for startups trying to quickly assemble a mix of Windows and UNIX systems.

(iii) In situations where disparate Windows and UNIX identity domains must
co-exist (e.g. a Windows shop and a UNIX shop just became the same company),
the lowest-impact choice is Algorithmic Mapping (b), but this approach leaves
administrators with the disgusting problem of managing the integer UID space.

Algorithmic Mapping also has several nasty edge conditions, all of which are
implicitly left to the administrator to avoid or remedy:

- A UNIX administrator could erroneously allocate a POSIX UID within a range
  established for SIDs.  This would result in two users owning the same data.

- A Windows domain could grow to exceed its algorithmic set boundary (this is
  likely in large companies because a 32-bit space for each SID domain is
  being mapped to a single 32-bit or smaller POSIX space).  It then may not
  be possible to grow the set boundary without conflicting with other existing
  POSIX IDs.  Such a conflict could only be resolved by renumbering POSIX IDs, 
  which would also imply renumbering (chowning) all POSIX data files as well.

- If a large number of Windows domains must be mapped into the POSIX space,
  it will become increasingly complicated and error-prone to partition uid_t.

Samba has other drawbacks which have prevented NFS + Samba from seriously
competing as an enterprise multi-protocol file service solution, most notably
performance.  These issues are outside the scope of this document, but are
related to Sun's strategic decision to port Procom's CIFS stack to Solaris,
and make this technology more broadly available to the OpenSolaris community.

4. Analysis of NetApp ID Mapping

NetApp is a current industry player in multi-protocol file servers, and
addresses the ID mapping problem in their OnTAP operating system and WAFL
file system.  NetApp's original approach to the problem was outlined in a
1998 USENIX paper by Dave Hitz et al [7], and has since been refined in an
evolutionary fashion; the complete latest details are found in [8] and [9].

In order to solve some of the problems described earlier in Section 3, NetApp
added support for both SIDs and POSIX IDs to the WAFL filesystem, and also
implemented an ID mapping mechanism similar to the Samba idmap component.
WAFL filesystems are divided into subtrees called "qtrees", and an explicit
file security style is assigned to each volume and qtree that it contains:
either "unix", "ntfs", or "mixed" to support multi-protocol access.  Although
NetApp's source code is private, my suspicion is that the security style not
only affects security behavior but also on-disk format as well.  One public
paper mentions an "extended inode" structure, so it seems likely that based on
the security style different extended inode formats are used so as to avoid
the overhead of storing SIDs in inodes and ACEs in a UNIX-style qtree.

NetApp implements Name Mapping (a) only, including both dynamic lookups (by
default) and static rewriting rules (in a usermap.cfg(5) file), but its use
and behavior varies depending on the style associated with a given qtree in
the filesystem.  The behaviors are:

- UNIX clients accessing UNIX qtrees use POSIX semantics as usual.  POSIX
  UID and GID values are stored in the filesystem and returned over NFSv3.

- UNIX clients accessing NTFS qtrees invoke the ID mapping mechanism to obtain
  a Windows SID.  If none can be found, access to the client is either denied,
  or assigned a global, configurable Windows identity (wafl.default_nt_user).
  A credential cache called the "WCC" caches this mapping for performance.

- CIFS clients accessing CIFS qtrees use SID semantics as usual.  SIDs are
  stored in the filesystem and returned over SMB.  A POSIX ID mapping is also
  stored in the filesystem, but a global mapping service does not need to have
  been configured; instead a default UID is assigned (wafl.default_unix_user).

- CIFS clients accessing UNIX qtrees invoke the ID mapping mechanism to obtain
  a POSIX ID.  If none can be found, access to the client is either denied, or
  assigned a global, configurable POSIX identity (wafl.default_unix_user).

Mixed-mode functions as a kind of "last-touch" security policy; the complete
details are discussed in [8].  Fundamentally only one security policy, unix
or ntfs, is in place at a time for a given file or directory.

A significant aspect of NetApp's design is that there is no persistent mapping
of POSIX IDs and SIDs that is maintained on the filer or in its filesystems.
Namely, the WAFL Credential Cache (WCC) is an in-memory database that only
caches POSIX ID-to-SID mappings for performance.  SID-to-POSIX ID mappings
are only performed at the time an SMB session is established (the moral
equivalent of a UNIX "nfs mount" operation), and are saved with the session.
Since the administrator must declare their intended security policy a priori,
filesystems store POSIX IDs or SIDs as requested.  If POSIX IDs are stored,
then the POSIX identity design constraints apply when moving data around,
and administrators are again responsible for consistency of name services.
This design has important consequences for the data replication features
offered by NetApp: no auxiliary, global data must be moved with a filesystem.

Although clearly customers have accepted NetApp's approach (in part because
they are an entrenched market leader), there are drawbacks to their design:

(i) Administrators must pre-declare their intended security policy for a given
qtree.  If unix style is selected, Windows SID identities are in effect lost
or subject to the same security and identity drawbacks as in (3i) earlier.

(ii) Administrators must pre-declare the type of data sharing between Windows
and UNIX users that will occur, which in effect must match the security policy.
And it isn't necessarily easy for true data sharing to occur: one NetApp user
told me that configuring the mixed style was "insane and not recommended."

(iii) It leads to an overly complicated filesystem implementation, in that
the filesystem is managing different on-disk structures and security modes.

As a result, it isn't necessarily trivial for Windows and UNIX clients to
successfully begin sharing files in a common location: if name mappings have
not yet been established, one set of clients either has no access or has all
of its file creates and updates assigned to a global default user identifier.
In addition, the data namespace itself must effectively reflect its degree
of protocol sharing (which may of course change over time and unexpectedly), as
opposed to reflecting only the customer's semantic organization of their data.

5. Proposed Solution Constraints

Before describing the details of the proposed solution, it's worth stating
my proposed set of constraints on any solution appropriate for Solaris, both
in terms of our approach to general-purpose Solaris, and our approach to
OpenSolaris as the basis for servers providing storage services:

(a) CIFS clients should be able to read and write data to our CIFS server and
obtain true CIFS identity semantics.  That is, we should not rely on external
POSIX name service configuration to preserve data identities as in (3i).

(b) Windows clients should be able to access a Solaris server that has access
to an AD name service without requiring admins to also set up a network-wide
SID/UID equivalence relationship.  Such a requirement seems extremely costly
for a large Windows shop and large UNIX shop that just merged with each other,
as well as for a startup trying to rapidly evolve a heterogeneous environment.

(c) The mapping mechanism used for SID/UID mapping should not require the use
of persistent global data that must be propagated as part of data migration.
If such global data existed, it would significantly complicate basic data
migration such as zfs(1M) send/recv, as well as what can be built on top.

(d) We should avoid to the maximum extent possible any sort of unnecessary
complexity for administrators, especially if it is prone to human error and/or
security consequences.  Integer UID namespace partitioning is undesirable
for this reason.  So is partitioning the data namespace according to protocol.

(e) We need to provide appropriate compatibility for existing Solaris data and
administrative configurations, at no significant penalty of space or time.
Despite the drawbacks of the POSIX ID model for data, we can't simply decree
that a new Solaris system will no longer adhere to that model, especially if
it makes no use of either CIFS or Active Directory for data or identity.

6. Proposed Solution for Solaris

The solution proposed for Solaris is fundamentally to do the following:

(a) Modify ZFS to support SIDs directly in the filesystem, using an encoding
    that can be generalized to other forms of SIDs, generalized to other on-
    disk filesystems should that be required, and efficiently encode POSIX IDs.

(b) Modify the kernel to support SIDs as part of credentials (cred_t, ucred_t)
    so that the new Solaris CIFS server can establish such credentials in a 
    generic fashion and have them be passed through the VOP layer to ZFS.

(c) Deliver an ID mapping service to perform POSIX ID <-> SID mapping, and
    make this available to both user and kernel clients (i.e. via door upcall).
    This service is the Winchester project, with some minor modifications.

(d) Change Solaris uid_t and gid_t to be unsigned 32-bit types, and partition
    the ID space into half reserved for standard POSIX identifiers (the current
    range supported by Solaris, 0-0x7fffffff), and half reserved for ephemeral
    mappings associated with SIDs (the new range 0x80000000-0xfffffffe).

Item (d) is obviously where the dish gets a little spicy, so we'll explore the
consequence of that design approach with respect to existing Solaris in some
detail in the remainder of this document.  That aside for the moment, the major
design change proposed here is that Solaris adopt the notion of SIDs (or some
alternative generic name we give to them such as "ZID" for "the last damn ID
encoding format we're ever going to introduce into this operating system").
My belief is that the adoption of SIDs has long-term strategic benefits for
Solaris beyond the successful integration of a CIFS server:

(i) It provides the strategic foundation for a much larger class of services
    where OpenSolaris can function as a server for a set of Windows clients.

(ii) It provides true single unique identities in the operating system instead
    of the present POSIX semantics, thereby improving our strategic foundation
    for addressing growing technology needs for Security and Compliance.

(iii) It provides the ability for Solaris filesystems to store an effectively
    infinite number of identities associated with data, without the need to
    revise or break on-disk data formats when a fixed limit is exceeded.

(iv) It provides the basis for supporting an effectively infinite number of
    unique identities on the system, as opposed to a predefined fixed limit.

7. Canonical SID Representation

The canonical representation for SIDs should be the canonical representation
used by Windows, namely a printable string consisting of a letter to indicate
the SID format, a digit to indicate its version, and then groups of digits.
Windows often keeps SIDs in binary form; we should avoid this at all costs.
Solaris code that validates and/or establishes SIDs should be written to verify
the basic form of the SID string, but SHOULD NOT encode the set of format
characters and/or known version values.  In other words, we should explicitly
permit the handling of subsequent versions of the SID format, such as S-2-*,
without the need to re-issue or patch the Solaris kernel or user libraries.

We should also explicitly permit the deployment of entirely alternate SID forms
without the need to re-issue or patch the Solaris kernel or user libraries.

One approach to solving this problem would be to introduce a new SID encoding.
i.e. the encoding P-1-x-<decimal digit string> could be defined by Solaris
and reserved as the canonical representation of a standard POSIX identifier
with the semantic that it is resolved according to the current name service.
The authority values 1 and 2 should be reserved for POSIX uid and POSIX gid
respectively (that is, P-1-1-X is uid_t X and P-1-2-X is gid_t X).

Another approach would be to allocate additional identifier authority
values within the Windows SID space for POSIX IDs or any additional SID forms,
e.g. S-1-123 where 123 is defined to mean POSIX, in conjunction with efforts
towards IETF standardization of the form and/or agreement from Microsoft.
In late 2006, Samba 3.0.23c adopted a similar approach, mapping POSIX IDs
to the SID families S-1-22-1-<uid> and S-1-22-2-<gid>.

Alternate SID forms could also be used to represent generic GUIDs on systems
where globally unique unstructured identifiers are assigned to users.  Recently
in MacOS X 10.4 Apple has implemented GUIDs as an underlying unique identifier
form, storing POSIX ownership in files, GUIDs in ACEs, and providing a mapping
from GUIDs to SIDs when participating in an Active Directory.  Further details
are available in [E].  GUIDs could be represented either as G-1-<guid>-0, i.e.
using the GUID itself as the domain, or by computing a corresponding local SID
in the manner of Apple's implementation and then assigning it an ephemeral ID.

Common user and kernel APIs that store and retrieve SIDs SHOULD NOT make use
of fixed-size buffers as part of any API definition.  User APIs to retrieve
SIDs should either return a pointer to arbitrary-sized data as a string, or
should provide a run-time API call to retrieve the size of a given SID in
order to allocate space dynamically prior to retrieving the SID data itself.

Sun should most likely pursue an IETF Internet Standard for SID representation
and the current set of valid SID generation and encoding formats, independent
of use in any particular operating system, data format, or network protocol.
This can be done in parallel with but not gating our use of SIDs in Solaris.
Similarly, Sun could pursue standardizing its new APIs through IEEE or POSIX.

The canonical SID representation should provide these minimal guarantees:

(a) The SID is composed of groups of octets delimited by a hyphen ("-").
    Octets should be printed using the escape syntax of RFC 2396 Section 2.
    This parses existing Windows SIDs while providing clarity for extensions.

(b) There must be at least four groups of octets: (1) format, (2) version,
    (3) authority, (4) relative identifier.  If an SID is comprised of 1-N
    octet groups with N > 4, groups [4, N-1] indicate the domain identifier.

(c) The format and authority groups can consist of any characters, following
    the RFC 2396 escape sequence.  In particular, we should permit authorities
    that are composed of strings rather than integers.  For example, it may be
    useful to support something like "P-1-3-sun.com-12345" as an SID encoding
    of the AUTH_DES (aka AUTH_DH) identity "unix.12345@sun.com"

(d) The version group indicates the major version of the specified format.
    This group must consist solely of the ASCII characters [0-9] and represent
    the decimal integer major version number, which can be incremented when
    incompatible changes to the format encoding need to occur.  The version
    number should be limited to fit in a 32-bit unsigned integer.

(e) The final octet group indicates the relative identifier within the
    specified identity domain.  This group must consist solely of the ASCII
    characters [0-9] and represent a decimal integer indicating the RID.
    Windows RIDs are currently limited to 32-bit unsigned values, although
    I propose that the Solaris implementation extend this limit to 64-bit.

8. Filesystem SID Representation

To preserve efficiency of space in a filesystem, we can observe that the number
of file meta-data structures (e.g. ZFS znodes) and the number of on-disk ACEs
will far exceed the number of distinct SID domains that are likely to be seen
on a given system.  Therefore it will be desirable to establish a more compact
representation of SIDs in the filesystem than an arbitrary length byte string.
The proposal is to establish a Solaris convention for filesystem representation
that we will refer to as a FUID (Filesystem Unique Identifier); FUIDs can be
used to represent users, groups, or anything else that can be named by an SID.

FUIDs will be represented by convention as unsigned 64-bit integer values,
with the upper 32-bits serving as an index into an auxiliary table of domain
identifiers, and the lower 32-bits serving as a relative identifier within
that domain.  The upper 32-bit index of all zeroes will be reserved to 
indicate that the FUID refers to a 32-bit standard POSIX UID or GID identifier.
A filesystem that uses FUIDs can therefore represent a maximum of 4 billion
distinct identity domains, each with a maximum of 4 billion users; the maximum
number of distinct identities is 18,446,744,065,119,617,025 (18 quintillion).
At the same time, since ZFS already uses a uint64_t for storing uid and gid
in its on-disk znode representation, the existing POSIX UID/GID range supported
by Solaris can be represented with no change to the znode and no space penalty.
The ZFS ace_t current uses a 32-bit uid_t; it must be changed to uint64_t.

Although the present SID space used by Windows has a domain prefix followed
by an unsigned 32-bit RID, our FUID domain table representation should support
SIDs where the RID exceeds 32-bits.  Therefore, we also propose the convention
that the FUID table should implement an offset field, such that SIDs that have
RIDs greater than 32-bits can consume multiple 32-bit prefixes for the same
domain identifier, with each 32-bit prefix entry referring to a group of four
billion RIDs computed by adding the offset from the table to the low 32-bits.

Here is a simple example of SIDs and a possible FUID encoding:

SID                                           | FUID
----------------------------------------------+-------------------
S-1-5-12-7623811015-3361044348-123456789-1234 | 0x00000001000004d2
S-1-5-12-7623811015-3361044348-123456789-5678 | 0x000000010000162e
S-1-5-12-7623811015-3361044348-987654321-1234 | 0x00000002000004d2
P-1-1-1234                                    | 0x00000000000004d2
P-1-1-5678                                    | 0x000000000000162e

with a corresponding FUID Domain Table as follows:

FUID Index | FUID Domain                              | FUID Offset
-----------+------------------------------------------+------------
0x00000001 | S-1-5-12-7623811015-3361044348-123456789 | 0
0x00000002 | S-1-5-12-7623811015-3361044348-987654321 | 0

It is important to note that the FUID representation implies that the same SID
stored in two different filesystems (or pools, depending on the implementation) 
is not guaranteed to be represented using the same FUID.  As such, filesystems
that support FUIDs should provide debug tools to convert FUIDs to SIDs.  For
example, ZFS should likely provide a zdb command for this as well as mdb dcmds.
Based on the FUID design, SIDs can be encoded in any filesystem and the unique
SID identity is preserved when filesystems move, such as by ZFS send/recv,
without the need to transmit any auxiliary global information.

It is left to subsequent PSARC cases to specify the implementation changes
necessary to implement FUIDs for any given on-disk filesystem such as ZFS.
A filesystem should be permitted to define an implementation limit for the
maximum FUID domain sequence that can be represented; such implementation
limits SHOULD NOT be exposed as #defines or APIs to userland Solaris code.
Implementation notes for historic filesystems that will not be modified to
support Solaris SIDs (e.g. UFS) are explained later in Section 15.

Solaris filesystems that implement SIDs/FUIDs should also support a canonical
means of retrieving and modifying the SIDs associated with a file or directory. 
Since part of our proposal is to require mapping from SIDs to POSIX IDs, this
could be implemented as a future extension, where initially we only support
retrieval and modification of filesystem SIDs either through CIFS, or by 
using the POSIX APIs to query the uid_t and gid_t values, and then mapping
those to SIDs using the ID mapping service described in Section 11.  However,
it is also desirable to provide some means to archive files and directories
with SIDs such as by tar(1) or cpio(1), including SIDs stored in ACEs.

Therefore, Solaris should provide a canonical representation of a file or
directory's user or group SID and associated ACEs using some form of an
extended attributes mechanism.  The advantage of using an extended attributes
mechanism is that it already has a full system call interface, and existing
Solaris utilities such as tar(1) and cpio(1) have been modified to be able to
archive and restore filesystem attributes.  However, since the current
attributes namespace can be used with no name restrictions, and is limited to a
single-level hierarchy, extended attributes for so-called system attributes
would have to be placed in a new namespace.  This new namespace could be
implemented entirely as a software abstraction on top of the underlying
filesystem attributes already on-disk, and adopt the same programming model
as our existing extended attributes (namely, open with a new flag for the
namespace, O_SATTR instead of O_XATTR, and then use normal read/write/close).

By using attributes, we could defer or entirely avoid the need to implement an
SID-based equivalent for chown(2) and associated APIs by either requiring that
callers map SIDs to POSIX IDs using the mapping service, or by implementing a
filesystem feature whereby writing to the specified attributes file with an
appropriate privilege would effect the equivalent of a chown to the new SID.
The Solaris tar, cpio, and pax utilities could then be modified generically
to archive and restore any system attributes, and would never need to change
again as Solaris continued to enrich and extend the set of system attributes.

It's also worth noting that this use of filesystem attributes would allow us
to solve the other long-standing problem of being unable to reasonably extend
the stat structure to represent larger ino_t and dev_t values and other related
problems.  Rather than simply introducing SID-based attributes, we could
introduce the complete set of stat attributes at the same time, or define a
canonical Solaris attribute file 'SUNWstat' that contained an extensible
name-value pair list (i.e. encoded nvlist_t) of the known file attributes.

9. Credential SID Representation

The Solaris credentials (both cred_t and ucred_t) should be extended to include
SIDs for users and groups in addition to the existing uid_t's and gid_t's,
which will be retained for mapping purposes, described next in Section 10.
As an optimization for the POSIX representation, SID fields can be omitted or
set to NULL whenever the SID refers to a standard POSIX ID value.  Since prior
work in Solaris 10 made cred_t and ucred_t opaque data types with functional
interfaces (see PSARC 2002/188), the necessary extensions to both types can be
done compatibly.  It is left to a subsequent PSARC case to specify the new APIs
to retrieve SIDs from a ucred_t, in accordance with the rules of Section 7.
In order to support true Windows semantics, the cred_t must be extended to 
support an arbitrary number of supplementary groups specified by SIDs; these
details should also be covered in a subsequent PSARC case discussing cred_t.

The credential SID representation should by definition imply appropriate
observability features for debuggers and core files through proc(4).  Since
/proc/<pid>/cred is already defined using the fixed POSIX representation, we
should supply an extensible, self-describing /proc/<pid>/ucred file that
corresponds to the ucred_*(3C) family of APIs, including the SID extensions.
One possible representation of the ucred file would be a serialized Solaris
nvlist_t; a subsequent PSARC case should specify the encoding details.
The pcred(1) utility at minimum should be modified to report Solaris SIDs.
DTrace should also be extended to be able to observe SIDs; this can be done
with no modifications, since DTrace permits clients to deference kernel memory
such as elements that are added to the cred_t structure.  But it may be useful
to explicitly define DTrace string inlines such as "curusid" and "curgsid".

10. POSIX ID Partition and Mapping

As stated earlier in Section 5 and our discussion of NetApp and Samba, we want
to provide a facility for administrators to map Windows identities to POSIX
identities using name equivalency or static rewriting rules, but we do not
want to require that they do so.  In particular, we want to permit Windows
clients to immediately access a Solaris CIFS server that has joined an AD
domain without the need to provide any sort of name mapping or algorithmic
UID partitioning.  Without such a feature, our solution would be either non-
competitive with existing industry products, or it would be overly complex.

To solve this problem, and also permit data to be moved around without the
need to also move a globally persistent mapping database (or require that all
Solaris servers participate in a global mapping network service), we introduce
the concept of ephemeral UIDs.  Namely, the idea that we will reserve a part
of the UID space to perform on-the-fly mappings from SIDs to UIDs as needed,
when name based mapping is either not configured or has not found any match.
This is similar in concept to algorithmic partitioning, but will require zero
configuration by the administrator, and no mapping persistence across reboots.
Since our last effort at uid_t expansion left us with a signed 32-bit uid_t
and no values permitted above INT_MAX, we can actually take advantage of our
current state by converting uid_t to unsigned, supporting the existing values
of uid_t for POSIX identifiers, and using the range [0x80000000-0xfffffffe]
for our ephemeral mappings (0xffffffff is omitted because it has special
meaning to the POSIX chown(2) system call, and it gives us a sentinel value).
The new Solaris will thus support 2 billion POSIX identifiers exactly as it
does today with no regressions, and also 2 billion simultaneous unmapped SIDs.

In the rest of this section, we discuss the implications of these proposed
complementary changes to Solaris: an unsigned 32-bit uid_t and ephemeral UIDs.

For some reason, when Sun made the transition from SunOS to Solaris, uid_t
became a signed type, despite the fact that SunOS 4 had 16-bit unsigned UIDs,
and earlier versions of both Berkeley and SVR4 had used unsigned types.  So
when large UID support was added by PSARC 1995/334, there was no reason to
address the issue of the base type when extending the maximum value to INT_MAX.
The PSARC 1995/334 case materials make only passing mention of the signed
issue, and unfortunately seems to have just assumed leaving it alone was best.
Clearly extending uid_t from 32 to 64-bits would be a painful transition for
Solaris, as we would have to again provide extended versions of every system
call that exports uid_t's directly or in structures (e.g. stat, fstat, etc.)
However, it does seem possible to compatibly grow uid_t from a signed to an
unsigned type, and thereby extend the effective value range up to 0xfffffffe.

There is a lot of evidence that suggests that these types should always have
been unsigned.  Among the data points are:

(a) Use of unsigned types in earlier versions of both Berkeley and SVR4 UNIX.
    As shown earlier, unsigned was the standard in the world of 16-bit UIDs.

(b) AIX was using unsigned UIDs up to ULONG_MAX as early as 1995.  UIDs are
    declared as unsigned types in latest versions of both BSD and Linux.
    Therefore we know that portable UNIX software can cope with unsigned.

(c) The encoding of UIDs/GIDs as unsigned 32-bit types in Sun's 1988 RFC 1057
    in the definition of the auth_unix RPC credential, which then became the
    basis for UIDs in NFS (today, NFSv3 still uses this unsigned definition).

(d) Also in 1988, POSIX 1003.1 stated that "each system user is identified by a
    non-negative integer known as a user ID that can be contained in an object
    of type uid_t", which implies that a negative uid_t value is never valid.

(e) The most recent clear statement on the subject comes from POSIX 1003.1's
    2001 Rationale for System Interfaces document, section B.2.12, which says:

    "The types uid_t and gid_t are magic cookies.  There is no {UID_MAX}
    defined by POSIX.1, and no structure imposed on uid_t and gid_t other than
    that they be positive arithmetic types. (In fact, they could be unsigned
    char.)  There is no maximum or minimum specified for the number of distinct
    user or group IDs."

Item (e) clearly permits unsigned uid_t/gid_t types, since one is explicitly
named, and reinforces the original statement in (d) by further claiming that
by design POSIX does not specify a range or any range limit for UIDs, since
they are to be treated as opaque cookies that refer to an identity.  Therefore,
since we're not changing the size of the type, it would seem to be possible to
compatibly evolve these types to unsigned in a Solaris Minor release (Nevada).
Specifically, this would not require any incompatible change to Solaris
interfaces because the size of the base type does not change, nor does the
size, field offset, or alignment of any struct members of type uid_t or gid_t.

Finally, no on-disk or on-wire incompatibilities are created in filesystems:
UFS stores IDs as 32-bit unsigned and casts to uid_t/gid_t in VOP_GETATTR,
ZFS does the same but uses 64-bit unsigned values in its disk structures,
and NFSv2 and NFSv3 use 32-bit unsigned values over the wire as described
earlier. (NFSv4 converts UIDs to names; we discuss NFSv4 later in Section 14.)
ZFS and UFS rely on the Solaris kernel to propagate legitimate ID values.
As an example, if one were to experience data corruption in a UFS filesystem
that resulted in a uid_t in an inode greater than INT_MAX, UFS will happily
copy this value into a uid_t in a VOP_GETATTR call and return a negative UID.
There is at present no code in UFS preventing that from happening, nor is there
any such code in ZFS (although ZFS data checksums make this nearly impossible).

Some issues are created with respect to UIDs stored in archive formats such
as tar and cpio, but these issues turn out to be mitigated by our ephemeral
UID plan, so we defer discussion of archive utilities until Section 16.

Two other related standard issues which should be considered but I believe can
be mitigated are the definition of id_t in <sys/types.h> and the SPARC SVID
and subsequent SPARC Compliance Definition (SCD).  The relevant details are:

(a) The POSIX Base Definition for <sys/types.h> describes id_t as "Used as a
    general identifier; can be used to contain at least a pid_t, uid_t, or
    gid_t."  However, I can't find any text clarifying whether the meaning of
    the word "contain" in this context applies to size only or size and sign.
    My evaluation of this issue is that we should leave id_t alone (signed)
    and that this does conform to the definition.  Other OSes such as Linux
    and BSD have made id_t unsigned, but this change would seem to have more
    widespread consequences in that I have found lots of code comparing a value
    of type id_t < 0 where id_t is being used generically in userland code.

(b) The 1990 Generic SVID ABI document makes no mention of uid_t or gid_t,
    but the SPARC Processor Supplement has a diagram of the <sys/types.h>
    definitions on page 6-65 which indicates typedef long for uid_t and gid_t.
    Given that this ABI document was not yet aligned with POSIX, I believe
    the proposed changes here are compatible and more true to the definitions
    that POSIX eventually adopted.  The subsequent SCD has the same issue.

These documents aside, the observable consequences of changing uid_t/gid_t from
signed to unsigned in a Solaris Minor release will be the following:

(a) Code that attempts to printf uid_t using %d (or %ld in a 32-bit program)
    will produce negative values if we use IDs above INT_MAX.  However, such
    code will work correctly if it prints using %d and then scanf's back %d.
    It isn't clear that this would cause harm other than odd-looking output;
    Solaris commands such ps(1) and pcred(1) should have this issue corrected.

(b) Code that attempts to sort uid_t by performing signed comparison will end
    up sorting IDs above INT_MAX before those IDs between zero and INT_MAX.
    Such an issue would likely be fixed by a recompile (as the use of the
    derived typedefs would now convert the code to unsigned automatically),
    but it doesn't seem like this would create any serious incompatibility.
    Binary object code that sorted uid_t's by signed comparison and then binary
    searched the result using signed comparison would still function properly.

    One variant of this comparison issue is some very old UNIX tools that
    attempt to "categorize" UIDs based on being less than or greater than 99.
    The two examples of this I can find in the source base are logins(1) and
    listusers(1), which compare pw_uid to 99 to categorize its output.  Both
    tools will work properly by virtue of a simple recompile to unsigned uid_t.
    The useradd(1M) command has a similar notion, in that it only starts
    allocating uids at 100 for new user accounts, up to UID_MAX.  Again, the
    code stays the same, but we may rename a #define for its allocation limit.

(c) Code that attempts to convert a string to an integer uid_t could fail if
    applied to a UID greater than INT_MAX.  Specifically, atol("4294967295")
    successfully returns -1 and atoi("4294967295") successfully returns -1,
    permitting uid_t u = atoi(s) to work when s is a string UID > INT_MAX.
    But strol("4294967295") returns failure, setting errno to EOVERFLOW.
    However, code that prints uid_t as a signed int (%d) or signed long (%ld)
    would be able to convert those strings back to uid_t/gid_t using strtol().
    Only code that was self-inconsistent with the underlying type would break.
    Documentation should guide programmers to resolve this issue.

(d) Code that maintains a persistent copy of a uid may store an ephemeral
    value which cannot later be mapped back to a user (e.g. an audit trail).
    However, such code had no way to reliably do so prior to ephemeral IDs,
    as nothing precludes administrators from reusing uid values, removing them
    from the name service, or moving such a file to a different name domain.
    We discuss some examples of Solaris files like this in Section 16.

(e) Any non-Sun RPC protocols that send UIDs over the wire may obtain wrong
    results when applied to ephemeral IDs, but only when those services are
    deployed in conjunction with CIFS and AD.  Such RPC protocols would already
    have required a globally consistent name service configuration in order to
    make any sense, and in the presence of such configuration would still work.

(f) C++ code that uses uid_t or gid_t in a C++ function signature will produce
    a different mangled signature when recompiled.  Therefore, if a group of .o
    or .a files are recompiled against the unsigned uid_t and linked against a
    second group that has yet to be recompiled, a link-time error will occur.
    Recompiling both groups and re-linking will solve this problem.  Existing
    C++ binaries, compiled prior to the change, will continue to work correctly
    subject to the issues described above.  System interfaces that are declared
    extern "C" do not suffer from the recompilation issue at all, such as our
    base system interfaces in usr/include/, because extern C interfaces do not
    encode name mangling and by extension parameter types in the symbol table.
    
With uid_t and gid_t now extended to 32-bit unsigned types, we now propose to
partition the ID space in half, reserving the upper 2 billion values for so-
called ephemeral IDs.  These ID values would be reserved for transient mappings
of SIDs introduced into the system for which no name-based mapping rule between
the SID and a POSIX ID in the existing range [0, INT_MAX] applies.  A central
mapping service (the Winchester project, discussed further in Section 11),
will establish the reservation of an ephemeral ID and its connection to an SID,
and will hold the reservation until a Solaris instance reboots.  That is, when
the forthcoming SMB server for CIFS establishes a session, it will take the SID
over the wire, look up the Windows AD name, and contact the ID mapper to see if
a name-based mapping applies; if so, a POSIX ID in the existing range will be
assigned to the credential in addition to storing the SID there.  If not, an
ephemeral ID above INT_MAX will be assigned.  In either case, every credential
will always contain both uid_t/gid_t values and an SID simultaneously.

This design implies that once an SID-based service such as SMB/CIFS creates a
file, that file can be stat'd by a local UNIX process or over the wire by NFS,
and appropriate uid_t / gid_t values can be returned.  A process could then
stat other files, compare those uid_t's, and correctly determine that a file
has the same identity as another file with the same corresponding SID.  Other
POSIX system calls, such as setuid() or chown(), can be used with the ephemeral
ID values, and will have the correct semantics.  The major mental leap is that
the mapping between SIDs and ephemeral IDs is not persistent across reboots.

The first thing to realize is that we're only doing this when no UNIX name
service is being used or when no POSIX identity mapping is provided.  That is,
a Solaris system deployed exactly as it is today has no ephemeral IDs, and thus
no incompatibility issues at all.  Administrators must specifically configure
Solaris to participate in AD and utilize CIFS without name mapping, and thus
we can clearly explain the consequences as part of the documentation to do so.

Second, the notion of ephemeral IDs that are not persisted across reboots can
only cause issues when those IDs are written to disk or sent over the
wire: if they are not, then no semantic incompatibility exists at all because
the in-memory behavior of ephemeral IDs on a running system is no different.
And there is strong history behind the idea that UNIX UIDs and GIDs were not
the ideal concepts for persistence in the first place.  Sun's original RFC 1057
that defined auth_unix for RPC defined auth_des at the same time, noting that
different UIDs would have different meaning in different network domains.  The
idea of ID mapping also has been around for nearly 20 years, as mentioned
earlier.  And most recently, NFSv4 used usernames rather than the ID values
as part of the over-the-wire protocol changes from NFSv3.

Third, we propose a set of limiting constraints in terms of the kernel and
user behavior for ephemeral IDs that will help to limit their propagation.
Specifically, we propose the following non-changes to enhance compatibility:

(a) The Solaris NFSv3 server already maps UIDs above Solaris UID_MAX (INT_MAX)
    to UID_NOBODY; this code should remain in place, implying that ephemeral
    IDs are never sent or received over the wire.  See Section 13 for more.

(b) The Solaris tar, cpio, and pax utilities already do not support large UIDs
    in most data formats; this code should remain in place, implying that
    ephemeral IDs are not archived by default.  See Section 16 for more.

(c) The Solaris kernel already prevents system calls from propagating UIDs
    greater than UID_MAX (INT_MAX) to filesystems.  For example, chown returns
    EINVAL if a uid_t < -1 or > UID_MAX is specified as an argument.  To
    preserve compatibility with existing filesystems that are not converted
    to use the FUID scheme of Section 8, the Solaris VOP_* layer should convert
    ephemeral IDs to UID_NOBODY / GID_NOBODY before calling old filesystems.
    See Section 15 for more details about this proposal.

One final change in behavior is that we don't want to permit the ephemeral ID
space to be exhausted by a buggy or malicious user application.  Therefore, we
define that an ephemeral ID mapping must be established by the ID mapper by a
client with appropriate privilege before an ID-based system call can use that
uid_t or gid_t as an argument.  For example, in present Solaris, it is legal
to chown(2) a file to a uid_t value that has no mapping in the name service.
If this behavior worked for ephemeral IDs, one could exhaust the ephemeral ID
space by issuing millions of chown operations to as-yet unused ephemeral IDs.
However, since chown(2) fails with EINVAL on signed uids < -1, we know no
application code can be relying on that working.  So instead, we propose that
upon chown(2) and similar calls using ephemeral IDs, the system call will
determine by use of the ID mapper or a cache if the ID is claimed; if so, the
call will succeed, otherwise it will fail and return EINVAL as it does today.
The setuid(2) family of system calls will be modified along the same lines.
So a user process cannot exhaust the space maliciously, but it can successfully
stat one file, obtain an ephemeral ID, and then chown another file to that uid.

Despite protection against malicious exhaustion of the ephemeral ID space, it
is still of course possible for the system to run out of ephemeral IDs.  My
view is that this is, by virtue of the design, no worse than the current
Solaris system behavior.  Namely, Solaris supports only 2 billion UIDs today.
A Solaris system with the proposed changes configured solely as a CIFS server
using AD and no POSIX name services would support 2 billion ephemeral IDs.
If this limit becomes insufficient in some relevant customer scenario, this
would provide the impetus to either grow uid_t to 64-bits, or implement a
larger-scale API conversion from the use of POSIX IDs to SIDs.  These are
challenges we face anyway given the current size of our POSIX ID space.  Until
such changes are made, the behavior of the ID mapping service when all
ephemeral IDs are exhausted should be to map any new SIDs to *ID_NOBODY, which
is already reserved within the POSIX ID space, and log an FMA message.

Finally, with respect to the top-level name service APIs, we propose that
getpwuid(3C), getpwuid_r(3C), getgrgid(3C), and getgrgid_r(3C), all return
failure with "not found" semantics when passed an ephemeral uid or gid in the
situation where no corresponding SID can be resolved in the SID-based name
service, or where no SID-based name service is available at all (i.e. on a
system with CIFS support before Reno and nss_ad exist or are deployed at all).
This semantic is consistent with the notion that these entities do not exist
in the UNIX name service backend, and certainly it is already possible for
one to stat(2) a file and see a uid or gid that is not retrievable from the
current name service configuration.  Another active Solaris Nevada project,
Reno (see http://opensolaris.org/os/project/reno/), proposes to extend PAM
to permit passwd(4)-style user attributes to be loaded by authentication
modules.  We propose that providing such information for ephemeral IDs can
be safely deferred until Reno and/or an nss_ad switch module are implemented.

11. ID Mapping Service

The ID mapping service necessary to implement the changes described in this
document will be delivered by the Winchester project, and should provide the
following minimum capabilities:

(a) The ability to perform name-based mapping so that the CIFS/SMB session
    initiator can obtain a POSIX uid or gid corresponding to an SID.

(b) The ability to configure static rewriting rules for name-based mapping
    that are equivalent to rewriting rules offered by Samba and NetApp.

(c) The ability to allocate ephemeral IDs when an SID to POSIX ID mapping
    cannot be computed.  Ephemeral IDs must be cached across restarts of the
    ID mapping service (i.e. either in the kernel, in tmpfs, or both), and
    should not be cached persistently on disk.

(d) The ability for kernel code such as FUID-aware filesystems to upcall the
    mapping service via a door to convert SIDs to UIDs/GIDs, and some
    appropriate caching of the results, as determined by performance analysis.

(e) The ability for Sun to eventually deliver a unified identity solution,
    wherein a single directory could contain UNIX information and SIDs.
    The unified identity code should also support Microsoft SFU, as described
    in (3c), i.e. looking for msSFU30UidNumber or the new uidNumber attributes
    when AD is configured to have posixAccount objects for Windows users.

As specified, Winchester resembles the Samba ID mapping component, in that
it offers an ID mapping service with pluggable mapping models, independent
of any underlying capability of the operating system to support SIDs.
Winchester proposes to implement features beyond (a-c), including:

- Algorithmic Mapping from Section 3b, similar to Samba
- A plug-in interface for other mapping schemes, e.g. Apple Open Directory

and other features.  My intent here is only to discuss dynamic mapping of SIDs
and its impact on Winchester; the complete set of Winchester features is
described in its project documentation and should be evaluated as part of its
ARC review.  My hope is that this proposal will actually simplify the
implementation of Winchester, and also integrate it more tightly with the
base operating system as the central place for ID management going forward.

The Winchester project should consider several key implementation issues based
on the analysis in this document.  The resolution of these issues is left for
discussion of that project and its associated ARC materials:

(i) The project proposes to store its mappings in a persistent database.
    Although such mappings need to be persisted, with ephemeral IDs they only
    need to be persisted across service restart, and not reboots, implying that
    tmpfs can be used to back the cache.  This property may significantly
    simplify the design and implementation of the persistence mechanism.

(ii) For NFSv4, PSARC 2004/592 extended nfsmapid(1M) to also support user
    plug-ins to perform mapping, finally implementing the original concept
    proposed by PSARC 1998/335 in a more much sane fashion.  Given this
    interface and the existing code in Solaris to upcall nfsmapid(1M) and
    cache its results, the Winchester project should investigate whether it
    would be simpler to make ID mapping one common service as opposed to
    having both nfsmapid and a completely orthogonal service for Winchester.

(iii) Given the critical importance of not re-using ephemeral IDs once the
    system has booted, it may be dangerous to only store the allocated IDs
    in the cache from (i).  Since allocation of ephemeral IDs can be done in
    order, it would be relatively simple for the Winchester service to preserve
    the next ephemeral ID (i.e. a reservation of all previously allocated IDs)
    in a non-persistent smf(5) property group, or save it in the kernel itself.
    (Storing the entire cache in-kernel is presumed to be wasteful of memory.)
    The Winchester team should consider these options for their implementation.

With the combination of ephemeral IDs and the ID mapping service, it is not
necessary for us to deliver a complete family of setuid(2) and setgid(2)
system call equivalents that accept SIDs as part of the initial project work,
because any application that groks SIDs can contact the mapping service to
obtain the mapped POSIX ID, and then use the existing system calls.  At the
same time, these calls could be compatibly added in a future Solaris release.

12. CIFS Implementation

The new CIFS service will establish SID-based credentials as part of creating
an SMB session, since SIDs are transported over-the-wire in the SMB protocol.
When the SMB service establishes credentials, it will contact the ID mapper
directly (or indirectly as the result of some new system call) which will
result in computing the appropriate uid_t and gid_t mappings for the SIDs,
either POSIX IDs if a name mapping exists, or ephemeral IDs if no mapping is
found.  In either case, the SMB service will establish a full cred_t with
both uid and gid values and SIDs, and can then interact with the rest of
Solaris.  When CIFS writes through the VOP layer to a FUID-capable filesystem
such as ZFS, SIDs will be stored as FUIDs in the filesystem.

13. NFSv3 Implementation

As discussed earlier, NFSv3 sends UIDs and GIDs over the wire if AUTH_UNIX
is selected, but the Solaris code already maps values greater than INT_MAX
to *ID_NOBODY.  The code path for this is as follows: sec_svc_msg() takes
AUTH_UNIX from the wire and in _svcauth_unix() does an XDR decode of int32 to
store the uid in aup_uid, which is then used by sec_svc_getcred() to make the
cred.  This in turn calls crsetugid(), which can and does fail if the uid is
greater than UID_MAX.  This failure is then propagated back to checkauth() in
nfs_server.c which for AUTH_UNIX then resets the credentials to the anonymous
user (ex_anon).  We propose to keep this code intact, albeit using different
#defines.  Thus NFSv3 would map any ephemeral IDs that have no POSIX equivalent
to *ID_NOBODY.  One extant bug here is that I can't find any code which
precludes one from configuring ex_anon (the share(1M) anon=N setting) to be a
value greater than UID_MAX; this bug should be corrected as part of this work.

The proposal thus introduces no incompatibilities with other operating systems
for NFSv3 file sharing or with existing Solaris as an NFS v3 client or server,
and effectively means that NFSv4 must be used to share files across Windows and
UNIX clients when Solaris is configured as a CIFS server and no global POSIX
name mapping equivalence has been established.  This seems like an eminently
reasonable constraint, helps to propagate the use of NFSv4, and doesn't 
compromise any of our constraints with respect to pure Windows client support.

14. NFSv4 Implementation

Unlike NFSv3, NFSv4 does not send UIDs and GIDs over the wire for attributes.
Instead, nfsmapid(1M) is used to map the values to utf8 strings containing the
user and group name suffixed by the NFSv4 mapping domain (either the DNS domain
or a domain name manually configured using the NFSMAPID_DOMAIN property).
See RFC 3530 for more information on the exact NFSv4 semantics.  Fundamentally
this behavior would remain unchanged; the NFS server would continue to upcall
a mapping daemon to map a UID or GID to a name using the name service, and if
this call is successful (which it could be when Windows and POSIX name mapping
equivalence has been established), the appropriate name is sent over the wire.
If an ephemeral ID for an SID has no mapping, then the POSIX name service
lookup should fail and return *ID_NOBODY to the kernel, which NFSv4 already
has defined as a clear semantic, and it sends "nobody" back over the wire.

However, NFSv4 does not change the basic underlying RPC mechanisms for
authentication; namely that AUTH_UNIX can still be used to authenticate,
and therefore NFSv4 clients are expected to have POSIX-style credentials
when using AUTH_UNIX and POSIX IDs exceeding the maximum Solaris POSIX ID,
0x7fffffff, will be converted to *ID_NOBODY as we do for NFSv4 and v3 today.
As such NFSv4 clients using AUTH_UNIX can only create files with POSIX IDs.

15. Legacy Solaris Filesystems

Other than NFSv3, historic Solaris filesystems such as UFS will not be changed
to use FUIDs.  Instead, the VOP layer should be modified to transparently
convert ephemeral IDs to *ID_NOBODY as they are passed to historic filesystems.
This is similar to the approach taken by the original EFT project [A], where
the old 16-bit UFS inode uid and gid fields were left intact, and set to the
nobody values only when the new 32-bit fields contained IDs above 0xffff.
Code should be added to UFS to prevent it from retrieving a corrupt uid or
gid from on-disk inodes (i.e. one above INT_MAX), and convert it to *ID_NOBODY.

A particularly disgusting use of UIDs is the UFS quota database, used by
quotacheck(1M), which performs a linear search over the entire UID space.
Thankfully, since we're not proposing to extend UFS's effective UID range,
the quota tools and formats do not need to be changed and continue to
behave exactly as they do today (that is, perform really, really badly).

The consequence of this approach is that any credential that is associated
with an ephemeral ID cannot be stored in a historic filesystem unless POSIX
name mapping equivalence is established.  Since our local filesystem of choice
ZFS will support FUIDs, and this can only happen if one deploys CIFS on top
of UFS with no name mapping, this seems like a reasonable behavioral choice.

The other existing filesystem of interest is tmpfs(7FS), where conversion to
FUIDs is possible since there are no persistent meta-data issues to address.
However, since our only immediate need for Solaris is to support a CIFS server,
there is no pressing need to modify tmpfs: it can be treated like any other
existing filesystem.  When additional Solaris changes permit Active Directory
users without a reserved POSIX ID to authenticate, thereby establishing local
user processes with ephemeral IDs in their credentials, then it will likely
be necessary to modify tmpfs to support FUIDs so these process can use /tmp.

16. Solaris Data Formats

Solaris has a number of data formats where user and group identifiers are
written to files.  Some, like utmpx(4) and wtmpx(4), already use names rather
than integer identifiers.  Others, like passwd(4) files, will have no need of
supporting ephemeral IDs, but should be modified to prevent their explicit use.

The most common use of uids and gids in files in the Solaris archive utilities
tar(1), cpio(1), and pax(1).  All of these utilities store integer IDs in their
files, but with varying degrees of support depending on the selected format.
The current behavior of these utilities in Solaris is summarized as follows:

tar (ustar)	uids up to 2097151, otherwise 60001
tar -E (xustar)	uids up to INT_MAX, otherwise 60001

pax -x cpio	uids up to 262143, otherwise 60001
pax -x pax	uids up to 2097151, otherwise 60001
pax -x ustar	uids up to 2097151, otherwise 60001
pax -x xustar	uids up to 2097151, otherwise 60001

cpio (default)	uids up to 65535, otherwise 60001
cpio -c 	uids up to 0xffffffff, no restrictions
cpio -H crc	uids up to 0xffffffff, no restrictions
cpio -H odc	uids up to 262143, otherwise 60001
cpio -H tar	uids up to 2097151, otherwise 60001
cpio -H ustar	uids up to 2097151, otherwise 60001

Since archiving an ephemeral ID and attempting to restore it is definitely
a bad idea, the proposal is to leave the behavior of all the utilities above
alone, changing only perhaps the #defines or code comments for clarity.
The only exception is the behavior of cpio -c / -H crc, which I believe
should be changed to map values greater than INT_MAX to *ID_NOBODY.  This does
not seem to cause any incompatibility other than documentation since it is by
definition impossible for anyone to have cpio archived such a uid on Solaris.

Therefore, similar to historic filesystems, archive utilities will work
compatibly on all new Solaris systems with POSIX IDs only, will work properly
when POSIX name mappings exist, and will archive "nobody" for ephemeral IDs.
The suggestion, as described earlier in Section 8, is to extend FUID-based
filesystems with a set of extended attributes to report the true SID, and
to perhaps use the attributes interface to perform an SID-based chown.  This
technique would permit tar, cpio, and pax to function properly for filesystems
that support SIDs without the continuing need to modify their source code
for future changes to our POSIX UID space.  Given that the majority of the
formats described above don't even support up to the current INT_MAX, the
limitation on behavior with respect to ephemeral IDs seems very reasonable.

Solaris tar in extended mode (-E) also includes a feature which records the
user name and group name of each file or directory as a string, and will
attempt, prior to performing a chown on extract, to re-compute a new uid or
gid by looking up the original name in the name service.  This feature will
remain and work unmodified, but only for POSIX identifiers or SIDs that can
be successfully mapped to names by the name service, and speaks to the need to
get away from UIDs in tar files.  GNU tar also provides the same behavior.
In terms of the rest of its behavior, GNU tar has a more extensive set of
formats than Solaris tar.  The limits and behaviors of GNU tar are:

Format 	UID Limit
------  ---------
gnu 	1.8e19
oldgnu 	1.8e19
v7 	2097151
ustar 	2097151
posix 	Unlimited

with "posix" referring to the POSIX.1 2001 tar format specification (used
by the Solaris pax(1) utility).  In the current "gnu" format, a two's-
complement base-256 encoding is used for large uid values and those that are
negative if the system uid_t is signed.  Therefore, GNU tar will function
properly when deployed on a Solaris system with unsigned 32-bit uids and gids,
and it will properly handle values above INT_MAX if name mappings exist for
them.  The only case of concern is when an ephemeral ID is archived using
the gnu or posix formats: the result will be to correctly capture the current
ID, but upon extraction by root, the chown() may fail with EINVAL (by default,
archivers always continue to extract on such errors, but return non-zero).
This issue can only arise when GNU tar is used as an archiver of files when
CIFS and AD are configured on a system, so again this should be addressed
as part of our documentation for administrators on these features.  Sun should
also provide a patch or suggested change to the GNU tar developers.

Two other Solaris data formats that store UIDs and GIDs to persistent files
are the Extended Accounting (exacct) format of PSARC 1999/119, and the Solaris
audit(1M) trails.  As both of these file formats are extensible and under Sun's
control, PSARC cases should be filed as part of this work to extend them to
support SIDs in addition to POSIX IDs.  Given the use of these data formats as
long-term archives with billing and security implications, the use of integer
IDs in these files was already dubious since the interpretation would require a
correct connection to some persistent external name service.  These are
examples of where Solaris conversion to SIDs as the underlying ID makes sense.

The legacy Solaris SVR4 accounting file /var/adm/pacct also stores 32-bit
uid_t values; this can be examined using the lastcomm(1) utility.  The existing
behavior of lastcomm(1) is to call getpwuid(3C) on its saved values, and report
either the corresponding name or the uid value printf'd as %ld.  Since we're
not increasing the size of the type, no file format incompatibility is created.
We could either leave lastcomm(1) alone but change %ld to %lu, meaning that
ephemeral IDs would simply be printed as unknown integers (as it would today
if say, the file itself was damaged and a uid value above INT_MAX were
retrieved from the filesystem), or we could change it to report *ID_NOBODY.
In either case, it seems pointless to extend the SVR4 struct acct for SIDs.

17. Backup Formats

Since ephemeral IDs only exist on the system when CIFS is deployed without
POSIX name mapping equivalence, and such IDs cannot be stored in existing
filesystems anyway, there is no incompatibility with existing backup software.
The only backup issues arise when trying to backup ZFS with SIDs.  ZFS already
provides its own archival format by virtue of zfs(1M) send/recv; this format
would be extended to support the ZFS FUID representation as part of this work.
Furthermore, ZFS already introduces a number of novel concepts that must be
coped with by backup software, such as extensible attributes and properties.
The ZFS team should therefore discuss SIDs and FUIDs as part of its ongoing
work on an appropriate backup software strategy for ZFS and Solaris.

Sun is also implementing an NDMP server and including this with Solaris and
shortly in OpenSolaris; NDMP is the standard protocol for backup control [D],
and can be used with any other form of backup data format (e.g. tar, cpio).
NDMP includes only one use of POSIX uids and gids, which is in the ndmp_file
structure that forms part of the NDMP File History interface (see [D]).
The file history interface is in effect only a performance optimization,
permitting one to see a history of archived files and quickly seek a tape
drive to the appropriate location of the start of some chunk of files.
The ndmp_file definition already specifies uid and gid as unsigned 32-bit
values, and with respect to CIFS compatibility section 4.3.1 of the spec says:

owner: File owner identifier. uid SHOULD be used for UNIX file system type.
This field is undefined for NT file system type. 

group: File group identifier. gid SHOULD be used for UNIX file system type.
This field is undefined for NT file system type. 

Therefore, the proposal is that we insert *ID_NOBODY tokens into these fields
when our NDMP server generates file history for any ephemeral IDs.

18. Zones Integration

Solaris Zones provide a lightweight virtualization environment that includes
virtualization of the Solaris name service switch configuration. That is, a
local zone may have its own nsswitch.conf(4) settings indicating an entirely
different name server, name service, or name service prioritization.  As such
it is already the case that POSIX identifiers do not necessarily hold the same
meaning across disparate Zones in that one zone might assign a given uid_t
value one identity in its own passwd(4) file and another zone might see a
different identity for that uid_t based upon a NIS or LDAP directory.  As such
Solaris will need to evaluate identity mapping rules for non-POSIX identities
differently in each zone, and therefore the mapping of an SID to a POSIX uid_t
or ephemeral uid_t will vary across zones.  Finally, a Zone can use the BrandX
technology to provide an entirely different identity service for another OS
personality.  Therefore, each Solaris Zone should have its own instance of the
id mapping service, and maintain its own notion of ephemeral uid_t and gid_t's.

19. Case Summary and Next Steps

This document describes a technical strategy for unified representation of
Windows and POSIX credentials in Solaris.  This document is intended to be
approved by the ARC as a strategy for addressing the underlying problems
described here, and thereby provide the basis for subsequent interface review
of the Solaris projects that will define the articulation of this strategy.

The proposal to the ARC is that the approval of this case corresponds to the
following set of strategy and interface decisions:

(a) That the types of uid_t and gid_t will be changed to unsigned 32-bit int,
    and the addressable range of the types will be extended to 0xfffffffe.

(b) That the UID and GID spaces will be partitioned into a range used for
    traditional POSIX identifiers and a range used for ephemeral mappings.

(c) That present Solaris filesystems (Section 15) and data formats (Section 16)
    will be modified such that ephemeral IDs will be mapped to *ID_NOBODY.

(d) That a global ID mapping service will be implemented to provide POSIX
    identifier mappings for SIDs and meet all the requirements of Section 11.

(e) That the set*id() and chown() system calls will report EINVAL when an
    unmapped ephemeral ID argument is specified as a POSIX user or group ID.

The following pending ARC cases will therefore be reviewed in conjunction
with the set of decisions described as part of this case:

PSARC 2006/315 Winchester: Schema Mapping and ID Mapping for AD Interoperability
PSARC 2006/715 CIFS Service
PSARC 2006/719 NDMP service

And one or more ARC cases will be brought to specify the additional project-
specific changes necessary to complete the articulation of this strategy:

(a) Interface changes and additions for cred_t and ucred_t, including functions
    to establish, retrieve, and validate SID values associated with credentials
    and appropriate observability (proc(4) ucred file, pcred(1), DTrace)

(b) File format extensions for Solaris exacct and auditing to record, extract,
    and format SIDs as part of these persistent data formats.

(c) ZFS changes to support ownership and ACEs that contain SIDs using the
    FUID representation described in Section 8.

(d) APIs to examine a modify a set of extensible system attributes, including
    SIDs, for files, and changes to the archive utilities to support them.

20. Acknowledgements

Early drafts of this proposal were reviewed by Matthew Ahrens, Jeff Bonwick,
Bryan Cantrill, Don Cragun, Casper Dik, Brendan Gregg, Adam Leventhal,
Tim Marsland, Eric Schrock, Mark Shellenbaum, Spencer Shepler, Glenn Skinner,
Keith Wesolowski, Nico Williams, Gary Winiger, and Alan Wright.  I am indebted
to all of them for taking the time to do so and providing many useful comments.

21. References

Overview of Windows SIDs and Active Directory:

[1] Well-Known Windows Security Identifiers (SIDs)
    http://www.microsoft.com/technet/prodtechnol/\
    windows2000serv/reskit/distrib/dsfe_sid_yokv.mspx?mfr=true

[2] Active Directory Architecture
    http://www.microsoft.com/technet/prodtechnol/\
    windows2000serv/technologies/activedirectory/deploy/projplan/adarch.mspx

[3] Microsoft Windows Services for UNIX (SFU)
    http://www.microsoft.com/technet/interopmigration/unix/sfu/default.mspx

Samba Technical Documentation regarding UID Mapping:

[4] Security Identifier / User Identifier Resolution System (Internet Draft)
    http://www.cb1.com/~lkcl/cifs/draft-lkcl-sidtouidmap-00.html

[5] Samba HowTo: Chapter 14. Identity Mapping (IDMAP)
    http://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/idmapper.html

[6] Samba HowTo: Chapter 3. Server Types and Security Modes
    http://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/ServerType.html

NetApp Technical Papers regarding NFS, CIFS, and UID Mapping:

[7] Merging NT and UNIX Filesystem Permissions
    http://www.usenix.org/publications/library/\
    proceedings/usenix-nt98/full_papers/hitz/hitz.pdf 

[8] NetApp Storage System Multiprotocol Use Guide
    http://www.netapp.com/library/tr/3490.pdf

[9] Multiprotocol Data Access: NFS, CIFS, and HTTP
    http://www.netapp.com/library/tr/3014.pdf

Sun PSARC cases for UIDs and UID Mapping:

[A] PSARC 1995/334 Large uids and gids
[B] PSARC 1998/335 UID/GID Mapping for NFS
[C] PSARC 2004/592 nfsmapid extension for UID/GID mapping

NDMPv4 Specification:

[D] NDMP Version 4 Protocol (Internet Draft)
    http://www.ndmp.org/download/sdk_v4/draft-skardal-ndmp4-04.txt

Other References:

[E] MacOS X Server User Management, Second Edition, Appendix B (pg 239)
    http://images.apple.com/server/pdfs/User_Management_Admin_v10.4B.pdf