Unified POSIX and Windows Credentials for Solaris Mike Shapiro (mws@sun.com) Draft 5 1Nov07 0. Contents 1. Introduction 2. Problem Overview 3. Analysis of Samba ID Mapping (a) Name Mapping (b) Algorithmic Mapping (c) Centralized Identity 4. Analysis of NetApp ID Mapping 5. Proposed Solution Constraints 6. Proposed Solution for Solaris 7. Canonical SID Representation 8. Filesystem SID Representation 9. Credential SID Representation 10. POSIX ID Partition and Mapping 11. ID Mapping Service 12. CIFS Implementation 13. NFSv3 Implementation 14. NFSv4 Implementation 15. Legacy Solaris Filesystems 16. Solaris Data Formats 17. Backup Formats 18. Zones Integration 19. Case Summary and Next Steps 20. Acknowledgements 21. References 1. Introduction As part of Sun's effort to enhance the capabilities of the OpenSolaris system, the Sun Software organization is working to integrate support for CIFS, based on the stack acquired from Procom, into Solaris. The Procom CIFS stack previously ran on Procom's operating system, code-named Montana, currently the basis of the Sun StorageTek 5310 and 5320 NAS products. In order to port a CIFS stack to Solaris, and more generally provide support for Windows-style RPC services, Solaris must also provide support for Windows user credentials, which have a significantly different design than traditional POSIX uids/gids. The objective of this paper is to discuss the underlying design issue in terms of our long-term strategic objectives for Solaris, and to make recommendations for a set of technical changes and enhancements that will span the CIFS effort, underlying filesystems (particularly ZFS), authentication and ID mapping services, and kernel and user credential representation, in order to make the CIFS effort successful in CY07 and provide a strategic foundation for future work. This paper also describes the history and details of the approaches to this problem adopted by the two most successful technologies existing in the market: NetApp's filers, and the Samba OpenSource project, in order to validate our approach and compare it to these existing products. This paper stops short of actually specifying the complete project-level details of all of the changes; these details should be specified in subsequent PSARC cases. A summary of a the set of decisions corresponding to this case and details that should be covered in subsequent cases is found in Section 19. 2. Problem Overview Since the dawn of time, UNIX systems have represented users and groups using names stored in passwd(4) and group(4) that are converted to integer uid and gid values used in system call interfaces, the kernel, and in the filesystem. Although the size of the type has changed over time, the fundamental design remains the same: a single UNIX system views the set of user identifiers as a linear sequence of integers in some range, and the ownership of files is represented by storing the integer identifiers with the file. The system administrator is responsible for configuring the name service and nsswitch so that the appropriate set of identifiers are visible for uid/gid <-> name mapping, and to provide authentication services for establishing credentials. However, the consequence of the UNIX/POSIX user identification model is that when data is moved between machines over the network, by archiving files and then restoring them, or by physically moving disk drives, the identity of the file owner is evaluated within the single integer namespace maintained by the name service of the destination machine. That is, if I move a file with uid 12345 to a different machine (by NFSv3, which sends UIDs over the wire, by cpio(1) which stores UIDs in archives, or by moving a disk with a UFS or ZFS filesystem on it), then that file is now owned by whatever user has UID 12345 according to the new machine's name service configuration. If the system administrator there is using a different passwd(4) file or a different name service, then that data may now be owned by a semantically different user. The burden of preventing and/or solving this problem is left to administrators. The UNIX/POSIX UID model also means that a single system and any persistent representation derived from it can only support a flat namespace of users of some fixed size. For large enterprises, this may and historically has become cumbersome, leading to a series of painful incompatible changes to the size and limits associated with the uid_t and gid_t types. Here is a historical view of the size and legitimate values supported by uid_t on UNIX systems: 6th Edition UNIX: 8-bit signed, no restrictions on values AT&T SVR3: 16-bit unsigned, no restrictions on values 4.3 BSD: 16-bit unsigned, no restrictions on values SunOS 4.1.3: 16-bit unsigned, values ranging from [0, 65533] Solaris 2.0: 32-bit signed, values ranging from [0, 60002] Solaris 2.5: 32-bit signed, values ranging from [0, 60002] Solaris 2.5.1: 32-bit signed, values ranging from [0, 0x7fffffff] Solaris 10: 32-bit signed, values ranging from [0, 0x7fffffff] Linux as of Jan 2007: 32-bit unsigned, no restrictions on values FreeBSD as of Jan 2007: 32-bit unsigned, no restrictions on values In 1995, Sun went through the pain of extending UIDs up to 0x7fffffff [A]. In 1998, Sun received the first large-scale customer escalation (Boeing) of the need to map multiple large UID spaces on to the single UID space of a single NFS server [B]. Amazingly, this problem had been known about and considered as early as a decade before: the 1988 edition of the AT&T SVID includes a description of idload(RS_CMD), a command used by administrators to load a uid/gid translation table for Remote File Sharing into the system. Today, Sun also wants to position Solaris and its Trusted Extensions as a way to meet modern security and SOX compliance requirements, where enterprises wish to assign single identities to employees to support one-click hire/fire and implement restrictions on and auditing of information visible to certain identities. And we want to support CIFS as a first-class citizen in Solaris. In Windows, both for use in CIFS and elsewhere in the operating system, user credentials consist of Security Identifiers (SIDs) that represent users and groups. These are stored in the Windows equivalent of a kernel credential, in the Windows filesystem, and in filesystem data structures such as Access Control Entries (ACEs). A Windows SID typically looks something like this: S-1-5-12-7623811015-3361044348-030300820-1013 and decomposes into the following pieces: S - The string is a SID 1 - The revision level (1 is the only value in present use) 5 - 48-bit identifier authority value (5 refers to "Windows NT") 12-7623811015-3361044348-030300820 - Identifier for domain or local computer 1013 - 32-bit Relative ID (RID) within the previously described domain In other words, an SID is a universally unique identifier for a user or group. (Some generic terms are UUID and GUID, I'll just use SID here.) As in the UNIX system, some well-known SIDs are defined, as shown in [1]. Every Windows system computes a universally unique identifier for itself, so it can allocate unique local user accounts, and when one or more Windows systems form an Active Directory domain, a universally unique identifier is allocated for the domain itself, from which Domain Controllers can allocate unique RIDs for new user accounts. See [2] for a good, brief AD overview. Therefore, unlike UNIX systems, when data is moved between Windows systems, the ownership attributes and ACEs retain their semantic meaning, regardless of the Active Directory configuration of the destination system. If data is moved to a system that is not participating in the original AD domain, then only the administrator can access the data, or perform the Windows equivalent of a chown to change ownership to one of the users with the new domain. Furthermore, the system effectively supports an infinite number of unique users, as long as they are in sets of 4 billion per system or per AD domain. Finally, in order to support CIFS, NFS, and POSIX system call semantics, Solaris must address the issue of providing a mapping between Windows SIDs and POSIX UIDs and GIDs. That is, if a CIFS session is initiated and creates a file on a ZFS filesystem, that file's attributes should be able to be read and written using the POSIX identity model, including local stat(2), access over NFS, and so forth. Similarly, files created locally or over NFS using POSIX credentials should be accessible to remote clients using CIFS. Three well-known solutions to this problem exist in the industry: (a) Storing SIDs in the filesystem, and providing a mapping of local SIDs on the server to UIDs exported to remote UNIX clients, which is the approach taken by Microsoft Windows as part of its Windows Services for UNIX (SFU). (b) Storing UIDs in the filesystem, and providing a mapping of remote SIDs to local UIDs for use in the normal UNIX stack, which is the approach taken by the OpenSource Samba software (see [4], [5], and [6]) Sun's Winchester project, targeting Solaris Nevada, proposes a similar approach (see opensolaris.org/os/project/winchester); we'll return to the details of the Winchester project later in this document in the Proposal section. (c) Storing both UIDs and SIDs in the filesystem and/or a supplementary cache, and managing a set of mappings between them for use in the normal UNIX stack and for NFS, which is the approach taken by NetApp in WAFL / OnTAP (see [7], [8], and [9] for the evolution of NetApp's approach here). Apple has in effect adopted a similar approach in MacOS X 10.4, using 128-bit GUIDs for user identity, and then using participation in a directory service to locate POSIX or Windows credentials (see [E]). Option (a) alone isn't a valid choice for Solaris since we must also address issues of backward compatibility and we must support local POSIX system calls. The rest of this document discusses the details and tradeoffs implied by the existing approaches (b) and (c) and suggests a strategic direction for Solaris. 3. Analysis of Samba ID Mapping Samba (www.samba.org) is an OpenSource SMB server that provides Windows clients access to UNIX servers and vice-versa, both for files and for other services including printer sharing etc. Samba's design goal / constraint was to operate on top of a variety of existing UNIX systems, running as a set of userland daemons and services, without needing to modify existing kernels. This design is certainly to be credited in terms of Samba's widespread use. However, one consequence is that Samba is by definition constrained to the uid/gid model presented by the UNIX system on which it is deployed. That is, uid values must conform to the range supported by the underlying kernel, and all filesystems, data formats, and UNIX network services on the system remain unchanged and follow the POSIX rather than Windows design. Once data is written to a filesystem, it contains a POSIX uid/gid pair only: if data is moved to a different system, the POSIX rule of relying on administrative configuration to preserve identity semantics still applies. In order to support Windows SIDs on POSIX systems, Samba relies on the UNIX administrator to configure a mapping service that converts incoming SIDs to POSIX uids/gids and vice-versa. Luke Leighton described Samba's approach to the problem in a now-expired 1999 Internet Draft [4]. Today's Samba offers administrators a wide variety of approaches to the mapping problem; the full details are found in [5] and [6]. These details can be condensed to the following basic set of approaches to the mapping problem: (a) Name Mapping Administrators can create name equivalence between local Windows or AD users and users described by a UNIX name service backend. For example, an admin could configure both Windows AD and UNIX LDAP or NIS to contain the complete set of user names for a given enterprise, and then configure Samba so that when Windows user SUN\mws establishes a session on a UNIX server, Samba looks up the UNIX user "mws" in passwd(4) (via the LDAP or NIS server) to get a UID. Name Mapping can also be extended by a set of static rewriting rules; both Samba and NetApp (discussed in Section 4) support static rewriting for (a). Static rewriting rules are useful when users have different names across the enterprise, due to disparate naming conventions, such as when companies merge. For example, a static rewriting rule could establish that Windows user "SUN\MikeS" is in fact the same user as NIS user "mws", and then Samba will apply this static rewriting prior to performing the UNIX name service lookup. (b) Algorithmic Mapping Administrators can manually partition the POSIX UID space by creating a set of algorithmic mapping rules for SIDs (based on the encoded RID) to a portion of the POSIX UID space. For example, an administrator can configure: idmap backend = idmap_rid:SUN=70000-80000 indicating that the "SUN" domain should be mapped to UIDs [70,000-80,000]. When Samba encounters the SID S-1-5-21-34567898-12529001-32973135-1234 from this domain, the resulting POSIX UID will be 70000 + 1234 = 71234. (c) Centralized Identity Since Windows AD is based on an LDAP directory, administrators can use AD as the single name service for an entire identity domain, and supplement the LDAP directory with information that configures the ID mapping service. Examples of various Samba LDAP configurations to do this are shown in [5]; the basic idea is to provide information similar to (b) in the LDAP directory itself. Another obvious approach here would be to define a standard for storing POSIX UIDs directly in AD (i.e. accompany Windows accounts with RFC 2307 attributes) and leave it up to administrators to create a single centralized user identity. One Centralized Identity scheme in existence is Microsoft's Windows Services for UNIX (SFU) [3]. Prior to Windows Server 2003 R2, SFU was an add-on product and it included an LDAP schema extension with properties for UNIX accounts in AD including "msSFU30UidNumber" and "msSFU30GidNumber". With Windows 2003 R2, SFU became part of the base product, and Windows user objects in AD *may* have a posixAccount object attached to them which includes a renamed "uidNumber" member. However, posixAccounts are not generated by default on Windows systems so this cannot be relied upon, and the numbers used are unique only to the current AD domain. Microsoft's intent seems to be provide posixAccount only when an admin chooses to slurp existing UNIX NIS or LDAP into AD. Although it is clear that many customers have accepted Samba's administrative model, there are some obvious drawbacks associated with Samba: (i) POSIX semantics, rather than Windows semantics, apply to data owned by UNIX servers. If files or filesystems are moved to a different server, their identity is only the POSIX identity, subject to the name service configuration and Samba configuration of the destination system, which may be different. (ii) Windows clients cannot utilize a UNIX file server unless an administrator decides upon and successfully configures the identity mapping solution. For approaches (a) or (c), this requires changes to domain-wide identity services, as well as changes to procedures and/or tools for creating user accounts. This may be easy for large, well-planned UNIX installations, but annoying for startups trying to quickly assemble a mix of Windows and UNIX systems. (iii) In situations where disparate Windows and UNIX identity domains must co-exist (e.g. a Windows shop and a UNIX shop just became the same company), the lowest-impact choice is Algorithmic Mapping (b), but this approach leaves administrators with the disgusting problem of managing the integer UID space. Algorithmic Mapping also has several nasty edge conditions, all of which are implicitly left to the administrator to avoid or remedy: - A UNIX administrator could erroneously allocate a POSIX UID within a range established for SIDs. This would result in two users owning the same data. - A Windows domain could grow to exceed its algorithmic set boundary (this is likely in large companies because a 32-bit space for each SID domain is being mapped to a single 32-bit or smaller POSIX space). It then may not be possible to grow the set boundary without conflicting with other existing POSIX IDs. Such a conflict could only be resolved by renumbering POSIX IDs, which would also imply renumbering (chowning) all POSIX data files as well. - If a large number of Windows domains must be mapped into the POSIX space, it will become increasingly complicated and error-prone to partition uid_t. Samba has other drawbacks which have prevented NFS + Samba from seriously competing as an enterprise multi-protocol file service solution, most notably performance. These issues are outside the scope of this document, but are related to Sun's strategic decision to port Procom's CIFS stack to Solaris, and make this technology more broadly available to the OpenSolaris community. 4. Analysis of NetApp ID Mapping NetApp is a current industry player in multi-protocol file servers, and addresses the ID mapping problem in their OnTAP operating system and WAFL file system. NetApp's original approach to the problem was outlined in a 1998 USENIX paper by Dave Hitz et al [7], and has since been refined in an evolutionary fashion; the complete latest details are found in [8] and [9]. In order to solve some of the problems described earlier in Section 3, NetApp added support for both SIDs and POSIX IDs to the WAFL filesystem, and also implemented an ID mapping mechanism similar to the Samba idmap component. WAFL filesystems are divided into subtrees called "qtrees", and an explicit file security style is assigned to each volume and qtree that it contains: either "unix", "ntfs", or "mixed" to support multi-protocol access. Although NetApp's source code is private, my suspicion is that the security style not only affects security behavior but also on-disk format as well. One public paper mentions an "extended inode" structure, so it seems likely that based on the security style different extended inode formats are used so as to avoid the overhead of storing SIDs in inodes and ACEs in a UNIX-style qtree. NetApp implements Name Mapping (a) only, including both dynamic lookups (by default) and static rewriting rules (in a usermap.cfg(5) file), but its use and behavior varies depending on the style associated with a given qtree in the filesystem. The behaviors are: - UNIX clients accessing UNIX qtrees use POSIX semantics as usual. POSIX UID and GID values are stored in the filesystem and returned over NFSv3. - UNIX clients accessing NTFS qtrees invoke the ID mapping mechanism to obtain a Windows SID. If none can be found, access to the client is either denied, or assigned a global, configurable Windows identity (wafl.default_nt_user). A credential cache called the "WCC" caches this mapping for performance. - CIFS clients accessing CIFS qtrees use SID semantics as usual. SIDs are stored in the filesystem and returned over SMB. A POSIX ID mapping is also stored in the filesystem, but a global mapping service does not need to have been configured; instead a default UID is assigned (wafl.default_unix_user). - CIFS clients accessing UNIX qtrees invoke the ID mapping mechanism to obtain a POSIX ID. If none can be found, access to the client is either denied, or assigned a global, configurable POSIX identity (wafl.default_unix_user). Mixed-mode functions as a kind of "last-touch" security policy; the complete details are discussed in [8]. Fundamentally only one security policy, unix or ntfs, is in place at a time for a given file or directory. A significant aspect of NetApp's design is that there is no persistent mapping of POSIX IDs and SIDs that is maintained on the filer or in its filesystems. Namely, the WAFL Credential Cache (WCC) is an in-memory database that only caches POSIX ID-to-SID mappings for performance. SID-to-POSIX ID mappings are only performed at the time an SMB session is established (the moral equivalent of a UNIX "nfs mount" operation), and are saved with the session. Since the administrator must declare their intended security policy a priori, filesystems store POSIX IDs or SIDs as requested. If POSIX IDs are stored, then the POSIX identity design constraints apply when moving data around, and administrators are again responsible for consistency of name services. This design has important consequences for the data replication features offered by NetApp: no auxiliary, global data must be moved with a filesystem. Although clearly customers have accepted NetApp's approach (in part because they are an entrenched market leader), there are drawbacks to their design: (i) Administrators must pre-declare their intended security policy for a given qtree. If unix style is selected, Windows SID identities are in effect lost or subject to the same security and identity drawbacks as in (3i) earlier. (ii) Administrators must pre-declare the type of data sharing between Windows and UNIX users that will occur, which in effect must match the security policy. And it isn't necessarily easy for true data sharing to occur: one NetApp user told me that configuring the mixed style was "insane and not recommended." (iii) It leads to an overly complicated filesystem implementation, in that the filesystem is managing different on-disk structures and security modes. As a result, it isn't necessarily trivial for Windows and UNIX clients to successfully begin sharing files in a common location: if name mappings have not yet been established, one set of clients either has no access or has all of its file creates and updates assigned to a global default user identifier. In addition, the data namespace itself must effectively reflect its degree of protocol sharing (which may of course change over time and unexpectedly), as opposed to reflecting only the customer's semantic organization of their data. 5. Proposed Solution Constraints Before describing the details of the proposed solution, it's worth stating my proposed set of constraints on any solution appropriate for Solaris, both in terms of our approach to general-purpose Solaris, and our approach to OpenSolaris as the basis for servers providing storage services: (a) CIFS clients should be able to read and write data to our CIFS server and obtain true CIFS identity semantics. That is, we should not rely on external POSIX name service configuration to preserve data identities as in (3i). (b) Windows clients should be able to access a Solaris server that has access to an AD name service without requiring admins to also set up a network-wide SID/UID equivalence relationship. Such a requirement seems extremely costly for a large Windows shop and large UNIX shop that just merged with each other, as well as for a startup trying to rapidly evolve a heterogeneous environment. (c) The mapping mechanism used for SID/UID mapping should not require the use of persistent global data that must be propagated as part of data migration. If such global data existed, it would significantly complicate basic data migration such as zfs(1M) send/recv, as well as what can be built on top. (d) We should avoid to the maximum extent possible any sort of unnecessary complexity for administrators, especially if it is prone to human error and/or security consequences. Integer UID namespace partitioning is undesirable for this reason. So is partitioning the data namespace according to protocol. (e) We need to provide appropriate compatibility for existing Solaris data and administrative configurations, at no significant penalty of space or time. Despite the drawbacks of the POSIX ID model for data, we can't simply decree that a new Solaris system will no longer adhere to that model, especially if it makes no use of either CIFS or Active Directory for data or identity. 6. Proposed Solution for Solaris The solution proposed for Solaris is fundamentally to do the following: (a) Modify ZFS to support SIDs directly in the filesystem, using an encoding that can be generalized to other forms of SIDs, generalized to other on- disk filesystems should that be required, and efficiently encode POSIX IDs. (b) Modify the kernel to support SIDs as part of credentials (cred_t, ucred_t) so that the new Solaris CIFS server can establish such credentials in a generic fashion and have them be passed through the VOP layer to ZFS. (c) Deliver an ID mapping service to perform POSIX ID <-> SID mapping, and make this available to both user and kernel clients (i.e. via door upcall). This service is the Winchester project, with some minor modifications. (d) Change Solaris uid_t and gid_t to be unsigned 32-bit types, and partition the ID space into half reserved for standard POSIX identifiers (the current range supported by Solaris, 0-0x7fffffff), and half reserved for ephemeral mappings associated with SIDs (the new range 0x80000000-0xfffffffe). Item (d) is obviously where the dish gets a little spicy, so we'll explore the consequence of that design approach with respect to existing Solaris in some detail in the remainder of this document. That aside for the moment, the major design change proposed here is that Solaris adopt the notion of SIDs (or some alternative generic name we give to them such as "ZID" for "the last damn ID encoding format we're ever going to introduce into this operating system"). My belief is that the adoption of SIDs has long-term strategic benefits for Solaris beyond the successful integration of a CIFS server: (i) It provides the strategic foundation for a much larger class of services where OpenSolaris can function as a server for a set of Windows clients. (ii) It provides true single unique identities in the operating system instead of the present POSIX semantics, thereby improving our strategic foundation for addressing growing technology needs for Security and Compliance. (iii) It provides the ability for Solaris filesystems to store an effectively infinite number of identities associated with data, without the need to revise or break on-disk data formats when a fixed limit is exceeded. (iv) It provides the basis for supporting an effectively infinite number of unique identities on the system, as opposed to a predefined fixed limit. 7. Canonical SID Representation The canonical representation for SIDs should be the canonical representation used by Windows, namely a printable string consisting of a letter to indicate the SID format, a digit to indicate its version, and then groups of digits. Windows often keeps SIDs in binary form; we should avoid this at all costs. Solaris code that validates and/or establishes SIDs should be written to verify the basic form of the SID string, but SHOULD NOT encode the set of format characters and/or known version values. In other words, we should explicitly permit the handling of subsequent versions of the SID format, such as S-2-*, without the need to re-issue or patch the Solaris kernel or user libraries. We should also explicitly permit the deployment of entirely alternate SID forms without the need to re-issue or patch the Solaris kernel or user libraries. One approach to solving this problem would be to introduce a new SID encoding. i.e. the encoding P-1-x- could be defined by Solaris and reserved as the canonical representation of a standard POSIX identifier with the semantic that it is resolved according to the current name service. The authority values 1 and 2 should be reserved for POSIX uid and POSIX gid respectively (that is, P-1-1-X is uid_t X and P-1-2-X is gid_t X). Another approach would be to allocate additional identifier authority values within the Windows SID space for POSIX IDs or any additional SID forms, e.g. S-1-123 where 123 is defined to mean POSIX, in conjunction with efforts towards IETF standardization of the form and/or agreement from Microsoft. In late 2006, Samba 3.0.23c adopted a similar approach, mapping POSIX IDs to the SID families S-1-22-1- and S-1-22-2-. Alternate SID forms could also be used to represent generic GUIDs on systems where globally unique unstructured identifiers are assigned to users. Recently in MacOS X 10.4 Apple has implemented GUIDs as an underlying unique identifier form, storing POSIX ownership in files, GUIDs in ACEs, and providing a mapping from GUIDs to SIDs when participating in an Active Directory. Further details are available in [E]. GUIDs could be represented either as G-1--0, i.e. using the GUID itself as the domain, or by computing a corresponding local SID in the manner of Apple's implementation and then assigning it an ephemeral ID. Common user and kernel APIs that store and retrieve SIDs SHOULD NOT make use of fixed-size buffers as part of any API definition. User APIs to retrieve SIDs should either return a pointer to arbitrary-sized data as a string, or should provide a run-time API call to retrieve the size of a given SID in order to allocate space dynamically prior to retrieving the SID data itself. Sun should most likely pursue an IETF Internet Standard for SID representation and the current set of valid SID generation and encoding formats, independent of use in any particular operating system, data format, or network protocol. This can be done in parallel with but not gating our use of SIDs in Solaris. Similarly, Sun could pursue standardizing its new APIs through IEEE or POSIX. The canonical SID representation should provide these minimal guarantees: (a) The SID is composed of groups of octets delimited by a hyphen ("-"). Octets should be printed using the escape syntax of RFC 2396 Section 2. This parses existing Windows SIDs while providing clarity for extensions. (b) There must be at least four groups of octets: (1) format, (2) version, (3) authority, (4) relative identifier. If an SID is comprised of 1-N octet groups with N > 4, groups [4, N-1] indicate the domain identifier. (c) The format and authority groups can consist of any characters, following the RFC 2396 escape sequence. In particular, we should permit authorities that are composed of strings rather than integers. For example, it may be useful to support something like "P-1-3-sun.com-12345" as an SID encoding of the AUTH_DES (aka AUTH_DH) identity "unix.12345@sun.com" (d) The version group indicates the major version of the specified format. This group must consist solely of the ASCII characters [0-9] and represent the decimal integer major version number, which can be incremented when incompatible changes to the format encoding need to occur. The version number should be limited to fit in a 32-bit unsigned integer. (e) The final octet group indicates the relative identifier within the specified identity domain. This group must consist solely of the ASCII characters [0-9] and represent a decimal integer indicating the RID. Windows RIDs are currently limited to 32-bit unsigned values, although I propose that the Solaris implementation extend this limit to 64-bit. 8. Filesystem SID Representation To preserve efficiency of space in a filesystem, we can observe that the number of file meta-data structures (e.g. ZFS znodes) and the number of on-disk ACEs will far exceed the number of distinct SID domains that are likely to be seen on a given system. Therefore it will be desirable to establish a more compact representation of SIDs in the filesystem than an arbitrary length byte string. The proposal is to establish a Solaris convention for filesystem representation that we will refer to as a FUID (Filesystem Unique Identifier); FUIDs can be used to represent users, groups, or anything else that can be named by an SID. FUIDs will be represented by convention as unsigned 64-bit integer values, with the upper 32-bits serving as an index into an auxiliary table of domain identifiers, and the lower 32-bits serving as a relative identifier within that domain. The upper 32-bit index of all zeroes will be reserved to indicate that the FUID refers to a 32-bit standard POSIX UID or GID identifier. A filesystem that uses FUIDs can therefore represent a maximum of 4 billion distinct identity domains, each with a maximum of 4 billion users; the maximum number of distinct identities is 18,446,744,065,119,617,025 (18 quintillion). At the same time, since ZFS already uses a uint64_t for storing uid and gid in its on-disk znode representation, the existing POSIX UID/GID range supported by Solaris can be represented with no change to the znode and no space penalty. The ZFS ace_t current uses a 32-bit uid_t; it must be changed to uint64_t. Although the present SID space used by Windows has a domain prefix followed by an unsigned 32-bit RID, our FUID domain table representation should support SIDs where the RID exceeds 32-bits. Therefore, we also propose the convention that the FUID table should implement an offset field, such that SIDs that have RIDs greater than 32-bits can consume multiple 32-bit prefixes for the same domain identifier, with each 32-bit prefix entry referring to a group of four billion RIDs computed by adding the offset from the table to the low 32-bits. Here is a simple example of SIDs and a possible FUID encoding: SID | FUID ----------------------------------------------+------------------- S-1-5-12-7623811015-3361044348-123456789-1234 | 0x00000001000004d2 S-1-5-12-7623811015-3361044348-123456789-5678 | 0x000000010000162e S-1-5-12-7623811015-3361044348-987654321-1234 | 0x00000002000004d2 P-1-1-1234 | 0x00000000000004d2 P-1-1-5678 | 0x000000000000162e with a corresponding FUID Domain Table as follows: FUID Index | FUID Domain | FUID Offset -----------+------------------------------------------+------------ 0x00000001 | S-1-5-12-7623811015-3361044348-123456789 | 0 0x00000002 | S-1-5-12-7623811015-3361044348-987654321 | 0 It is important to note that the FUID representation implies that the same SID stored in two different filesystems (or pools, depending on the implementation) is not guaranteed to be represented using the same FUID. As such, filesystems that support FUIDs should provide debug tools to convert FUIDs to SIDs. For example, ZFS should likely provide a zdb command for this as well as mdb dcmds. Based on the FUID design, SIDs can be encoded in any filesystem and the unique SID identity is preserved when filesystems move, such as by ZFS send/recv, without the need to transmit any auxiliary global information. It is left to subsequent PSARC cases to specify the implementation changes necessary to implement FUIDs for any given on-disk filesystem such as ZFS. A filesystem should be permitted to define an implementation limit for the maximum FUID domain sequence that can be represented; such implementation limits SHOULD NOT be exposed as #defines or APIs to userland Solaris code. Implementation notes for historic filesystems that will not be modified to support Solaris SIDs (e.g. UFS) are explained later in Section 15. Solaris filesystems that implement SIDs/FUIDs should also support a canonical means of retrieving and modifying the SIDs associated with a file or directory. Since part of our proposal is to require mapping from SIDs to POSIX IDs, this could be implemented as a future extension, where initially we only support retrieval and modification of filesystem SIDs either through CIFS, or by using the POSIX APIs to query the uid_t and gid_t values, and then mapping those to SIDs using the ID mapping service described in Section 11. However, it is also desirable to provide some means to archive files and directories with SIDs such as by tar(1) or cpio(1), including SIDs stored in ACEs. Therefore, Solaris should provide a canonical representation of a file or directory's user or group SID and associated ACEs using some form of an extended attributes mechanism. The advantage of using an extended attributes mechanism is that it already has a full system call interface, and existing Solaris utilities such as tar(1) and cpio(1) have been modified to be able to archive and restore filesystem attributes. However, since the current attributes namespace can be used with no name restrictions, and is limited to a single-level hierarchy, extended attributes for so-called system attributes would have to be placed in a new namespace. This new namespace could be implemented entirely as a software abstraction on top of the underlying filesystem attributes already on-disk, and adopt the same programming model as our existing extended attributes (namely, open with a new flag for the namespace, O_SATTR instead of O_XATTR, and then use normal read/write/close). By using attributes, we could defer or entirely avoid the need to implement an SID-based equivalent for chown(2) and associated APIs by either requiring that callers map SIDs to POSIX IDs using the mapping service, or by implementing a filesystem feature whereby writing to the specified attributes file with an appropriate privilege would effect the equivalent of a chown to the new SID. The Solaris tar, cpio, and pax utilities could then be modified generically to archive and restore any system attributes, and would never need to change again as Solaris continued to enrich and extend the set of system attributes. It's also worth noting that this use of filesystem attributes would allow us to solve the other long-standing problem of being unable to reasonably extend the stat structure to represent larger ino_t and dev_t values and other related problems. Rather than simply introducing SID-based attributes, we could introduce the complete set of stat attributes at the same time, or define a canonical Solaris attribute file 'SUNWstat' that contained an extensible name-value pair list (i.e. encoded nvlist_t) of the known file attributes. 9. Credential SID Representation The Solaris credentials (both cred_t and ucred_t) should be extended to include SIDs for users and groups in addition to the existing uid_t's and gid_t's, which will be retained for mapping purposes, described next in Section 10. As an optimization for the POSIX representation, SID fields can be omitted or set to NULL whenever the SID refers to a standard POSIX ID value. Since prior work in Solaris 10 made cred_t and ucred_t opaque data types with functional interfaces (see PSARC 2002/188), the necessary extensions to both types can be done compatibly. It is left to a subsequent PSARC case to specify the new APIs to retrieve SIDs from a ucred_t, in accordance with the rules of Section 7. In order to support true Windows semantics, the cred_t must be extended to support an arbitrary number of supplementary groups specified by SIDs; these details should also be covered in a subsequent PSARC case discussing cred_t. The credential SID representation should by definition imply appropriate observability features for debuggers and core files through proc(4). Since /proc//cred is already defined using the fixed POSIX representation, we should supply an extensible, self-describing /proc//ucred file that corresponds to the ucred_*(3C) family of APIs, including the SID extensions. One possible representation of the ucred file would be a serialized Solaris nvlist_t; a subsequent PSARC case should specify the encoding details. The pcred(1) utility at minimum should be modified to report Solaris SIDs. DTrace should also be extended to be able to observe SIDs; this can be done with no modifications, since DTrace permits clients to deference kernel memory such as elements that are added to the cred_t structure. But it may be useful to explicitly define DTrace string inlines such as "curusid" and "curgsid". 10. POSIX ID Partition and Mapping As stated earlier in Section 5 and our discussion of NetApp and Samba, we want to provide a facility for administrators to map Windows identities to POSIX identities using name equivalency or static rewriting rules, but we do not want to require that they do so. In particular, we want to permit Windows clients to immediately access a Solaris CIFS server that has joined an AD domain without the need to provide any sort of name mapping or algorithmic UID partitioning. Without such a feature, our solution would be either non- competitive with existing industry products, or it would be overly complex. To solve this problem, and also permit data to be moved around without the need to also move a globally persistent mapping database (or require that all Solaris servers participate in a global mapping network service), we introduce the concept of ephemeral UIDs. Namely, the idea that we will reserve a part of the UID space to perform on-the-fly mappings from SIDs to UIDs as needed, when name based mapping is either not configured or has not found any match. This is similar in concept to algorithmic partitioning, but will require zero configuration by the administrator, and no mapping persistence across reboots. Since our last effort at uid_t expansion left us with a signed 32-bit uid_t and no values permitted above INT_MAX, we can actually take advantage of our current state by converting uid_t to unsigned, supporting the existing values of uid_t for POSIX identifiers, and using the range [0x80000000-0xfffffffe] for our ephemeral mappings (0xffffffff is omitted because it has special meaning to the POSIX chown(2) system call, and it gives us a sentinel value). The new Solaris will thus support 2 billion POSIX identifiers exactly as it does today with no regressions, and also 2 billion simultaneous unmapped SIDs. In the rest of this section, we discuss the implications of these proposed complementary changes to Solaris: an unsigned 32-bit uid_t and ephemeral UIDs. For some reason, when Sun made the transition from SunOS to Solaris, uid_t became a signed type, despite the fact that SunOS 4 had 16-bit unsigned UIDs, and earlier versions of both Berkeley and SVR4 had used unsigned types. So when large UID support was added by PSARC 1995/334, there was no reason to address the issue of the base type when extending the maximum value to INT_MAX. The PSARC 1995/334 case materials make only passing mention of the signed issue, and unfortunately seems to have just assumed leaving it alone was best. Clearly extending uid_t from 32 to 64-bits would be a painful transition for Solaris, as we would have to again provide extended versions of every system call that exports uid_t's directly or in structures (e.g. stat, fstat, etc.) However, it does seem possible to compatibly grow uid_t from a signed to an unsigned type, and thereby extend the effective value range up to 0xfffffffe. There is a lot of evidence that suggests that these types should always have been unsigned. Among the data points are: (a) Use of unsigned types in earlier versions of both Berkeley and SVR4 UNIX. As shown earlier, unsigned was the standard in the world of 16-bit UIDs. (b) AIX was using unsigned UIDs up to ULONG_MAX as early as 1995. UIDs are declared as unsigned types in latest versions of both BSD and Linux. Therefore we know that portable UNIX software can cope with unsigned. (c) The encoding of UIDs/GIDs as unsigned 32-bit types in Sun's 1988 RFC 1057 in the definition of the auth_unix RPC credential, which then became the basis for UIDs in NFS (today, NFSv3 still uses this unsigned definition). (d) Also in 1988, POSIX 1003.1 stated that "each system user is identified by a non-negative integer known as a user ID that can be contained in an object of type uid_t", which implies that a negative uid_t value is never valid. (e) The most recent clear statement on the subject comes from POSIX 1003.1's 2001 Rationale for System Interfaces document, section B.2.12, which says: "The types uid_t and gid_t are magic cookies. There is no {UID_MAX} defined by POSIX.1, and no structure imposed on uid_t and gid_t other than that they be positive arithmetic types. (In fact, they could be unsigned char.) There is no maximum or minimum specified for the number of distinct user or group IDs." Item (e) clearly permits unsigned uid_t/gid_t types, since one is explicitly named, and reinforces the original statement in (d) by further claiming that by design POSIX does not specify a range or any range limit for UIDs, since they are to be treated as opaque cookies that refer to an identity. Therefore, since we're not changing the size of the type, it would seem to be possible to compatibly evolve these types to unsigned in a Solaris Minor release (Nevada). Specifically, this would not require any incompatible change to Solaris interfaces because the size of the base type does not change, nor does the size, field offset, or alignment of any struct members of type uid_t or gid_t. Finally, no on-disk or on-wire incompatibilities are created in filesystems: UFS stores IDs as 32-bit unsigned and casts to uid_t/gid_t in VOP_GETATTR, ZFS does the same but uses 64-bit unsigned values in its disk structures, and NFSv2 and NFSv3 use 32-bit unsigned values over the wire as described earlier. (NFSv4 converts UIDs to names; we discuss NFSv4 later in Section 14.) ZFS and UFS rely on the Solaris kernel to propagate legitimate ID values. As an example, if one were to experience data corruption in a UFS filesystem that resulted in a uid_t in an inode greater than INT_MAX, UFS will happily copy this value into a uid_t in a VOP_GETATTR call and return a negative UID. There is at present no code in UFS preventing that from happening, nor is there any such code in ZFS (although ZFS data checksums make this nearly impossible). Some issues are created with respect to UIDs stored in archive formats such as tar and cpio, but these issues turn out to be mitigated by our ephemeral UID plan, so we defer discussion of archive utilities until Section 16. Two other related standard issues which should be considered but I believe can be mitigated are the definition of id_t in and the SPARC SVID and subsequent SPARC Compliance Definition (SCD). The relevant details are: (a) The POSIX Base Definition for describes id_t as "Used as a general identifier; can be used to contain at least a pid_t, uid_t, or gid_t." However, I can't find any text clarifying whether the meaning of the word "contain" in this context applies to size only or size and sign. My evaluation of this issue is that we should leave id_t alone (signed) and that this does conform to the definition. Other OSes such as Linux and BSD have made id_t unsigned, but this change would seem to have more widespread consequences in that I have found lots of code comparing a value of type id_t < 0 where id_t is being used generically in userland code. (b) The 1990 Generic SVID ABI document makes no mention of uid_t or gid_t, but the SPARC Processor Supplement has a diagram of the definitions on page 6-65 which indicates typedef long for uid_t and gid_t. Given that this ABI document was not yet aligned with POSIX, I believe the proposed changes here are compatible and more true to the definitions that POSIX eventually adopted. The subsequent SCD has the same issue. These documents aside, the observable consequences of changing uid_t/gid_t from signed to unsigned in a Solaris Minor release will be the following: (a) Code that attempts to printf uid_t using %d (or %ld in a 32-bit program) will produce negative values if we use IDs above INT_MAX. However, such code will work correctly if it prints using %d and then scanf's back %d. It isn't clear that this would cause harm other than odd-looking output; Solaris commands such ps(1) and pcred(1) should have this issue corrected. (b) Code that attempts to sort uid_t by performing signed comparison will end up sorting IDs above INT_MAX before those IDs between zero and INT_MAX. Such an issue would likely be fixed by a recompile (as the use of the derived typedefs would now convert the code to unsigned automatically), but it doesn't seem like this would create any serious incompatibility. Binary object code that sorted uid_t's by signed comparison and then binary searched the result using signed comparison would still function properly. One variant of this comparison issue is some very old UNIX tools that attempt to "categorize" UIDs based on being less than or greater than 99. The two examples of this I can find in the source base are logins(1) and listusers(1), which compare pw_uid to 99 to categorize its output. Both tools will work properly by virtue of a simple recompile to unsigned uid_t. The useradd(1M) command has a similar notion, in that it only starts allocating uids at 100 for new user accounts, up to UID_MAX. Again, the code stays the same, but we may rename a #define for its allocation limit. (c) Code that attempts to convert a string to an integer uid_t could fail if applied to a UID greater than INT_MAX. Specifically, atol("4294967295") successfully returns -1 and atoi("4294967295") successfully returns -1, permitting uid_t u = atoi(s) to work when s is a string UID > INT_MAX. But strol("4294967295") returns failure, setting errno to EOVERFLOW. However, code that prints uid_t as a signed int (%d) or signed long (%ld) would be able to convert those strings back to uid_t/gid_t using strtol(). Only code that was self-inconsistent with the underlying type would break. Documentation should guide programmers to resolve this issue. (d) Code that maintains a persistent copy of a uid may store an ephemeral value which cannot later be mapped back to a user (e.g. an audit trail). However, such code had no way to reliably do so prior to ephemeral IDs, as nothing precludes administrators from reusing uid values, removing them from the name service, or moving such a file to a different name domain. We discuss some examples of Solaris files like this in Section 16. (e) Any non-Sun RPC protocols that send UIDs over the wire may obtain wrong results when applied to ephemeral IDs, but only when those services are deployed in conjunction with CIFS and AD. Such RPC protocols would already have required a globally consistent name service configuration in order to make any sense, and in the presence of such configuration would still work. (f) C++ code that uses uid_t or gid_t in a C++ function signature will produce a different mangled signature when recompiled. Therefore, if a group of .o or .a files are recompiled against the unsigned uid_t and linked against a second group that has yet to be recompiled, a link-time error will occur. Recompiling both groups and re-linking will solve this problem. Existing C++ binaries, compiled prior to the change, will continue to work correctly subject to the issues described above. System interfaces that are declared extern "C" do not suffer from the recompilation issue at all, such as our base system interfaces in usr/include/, because extern C interfaces do not encode name mangling and by extension parameter types in the symbol table. With uid_t and gid_t now extended to 32-bit unsigned types, we now propose to partition the ID space in half, reserving the upper 2 billion values for so- called ephemeral IDs. These ID values would be reserved for transient mappings of SIDs introduced into the system for which no name-based mapping rule between the SID and a POSIX ID in the existing range [0, INT_MAX] applies. A central mapping service (the Winchester project, discussed further in Section 11), will establish the reservation of an ephemeral ID and its connection to an SID, and will hold the reservation until a Solaris instance reboots. That is, when the forthcoming SMB server for CIFS establishes a session, it will take the SID over the wire, look up the Windows AD name, and contact the ID mapper to see if a name-based mapping applies; if so, a POSIX ID in the existing range will be assigned to the credential in addition to storing the SID there. If not, an ephemeral ID above INT_MAX will be assigned. In either case, every credential will always contain both uid_t/gid_t values and an SID simultaneously. This design implies that once an SID-based service such as SMB/CIFS creates a file, that file can be stat'd by a local UNIX process or over the wire by NFS, and appropriate uid_t / gid_t values can be returned. A process could then stat other files, compare those uid_t's, and correctly determine that a file has the same identity as another file with the same corresponding SID. Other POSIX system calls, such as setuid() or chown(), can be used with the ephemeral ID values, and will have the correct semantics. The major mental leap is that the mapping between SIDs and ephemeral IDs is not persistent across reboots. The first thing to realize is that we're only doing this when no UNIX name service is being used or when no POSIX identity mapping is provided. That is, a Solaris system deployed exactly as it is today has no ephemeral IDs, and thus no incompatibility issues at all. Administrators must specifically configure Solaris to participate in AD and utilize CIFS without name mapping, and thus we can clearly explain the consequences as part of the documentation to do so. Second, the notion of ephemeral IDs that are not persisted across reboots can only cause issues when those IDs are written to disk or sent over the wire: if they are not, then no semantic incompatibility exists at all because the in-memory behavior of ephemeral IDs on a running system is no different. And there is strong history behind the idea that UNIX UIDs and GIDs were not the ideal concepts for persistence in the first place. Sun's original RFC 1057 that defined auth_unix for RPC defined auth_des at the same time, noting that different UIDs would have different meaning in different network domains. The idea of ID mapping also has been around for nearly 20 years, as mentioned earlier. And most recently, NFSv4 used usernames rather than the ID values as part of the over-the-wire protocol changes from NFSv3. Third, we propose a set of limiting constraints in terms of the kernel and user behavior for ephemeral IDs that will help to limit their propagation. Specifically, we propose the following non-changes to enhance compatibility: (a) The Solaris NFSv3 server already maps UIDs above Solaris UID_MAX (INT_MAX) to UID_NOBODY; this code should remain in place, implying that ephemeral IDs are never sent or received over the wire. See Section 13 for more. (b) The Solaris tar, cpio, and pax utilities already do not support large UIDs in most data formats; this code should remain in place, implying that ephemeral IDs are not archived by default. See Section 16 for more. (c) The Solaris kernel already prevents system calls from propagating UIDs greater than UID_MAX (INT_MAX) to filesystems. For example, chown returns EINVAL if a uid_t < -1 or > UID_MAX is specified as an argument. To preserve compatibility with existing filesystems that are not converted to use the FUID scheme of Section 8, the Solaris VOP_* layer should convert ephemeral IDs to UID_NOBODY / GID_NOBODY before calling old filesystems. See Section 15 for more details about this proposal. One final change in behavior is that we don't want to permit the ephemeral ID space to be exhausted by a buggy or malicious user application. Therefore, we define that an ephemeral ID mapping must be established by the ID mapper by a client with appropriate privilege before an ID-based system call can use that uid_t or gid_t as an argument. For example, in present Solaris, it is legal to chown(2) a file to a uid_t value that has no mapping in the name service. If this behavior worked for ephemeral IDs, one could exhaust the ephemeral ID space by issuing millions of chown operations to as-yet unused ephemeral IDs. However, since chown(2) fails with EINVAL on signed uids < -1, we know no application code can be relying on that working. So instead, we propose that upon chown(2) and similar calls using ephemeral IDs, the system call will determine by use of the ID mapper or a cache if the ID is claimed; if so, the call will succeed, otherwise it will fail and return EINVAL as it does today. The setuid(2) family of system calls will be modified along the same lines. So a user process cannot exhaust the space maliciously, but it can successfully stat one file, obtain an ephemeral ID, and then chown another file to that uid. Despite protection against malicious exhaustion of the ephemeral ID space, it is still of course possible for the system to run out of ephemeral IDs. My view is that this is, by virtue of the design, no worse than the current Solaris system behavior. Namely, Solaris supports only 2 billion UIDs today. A Solaris system with the proposed changes configured solely as a CIFS server using AD and no POSIX name services would support 2 billion ephemeral IDs. If this limit becomes insufficient in some relevant customer scenario, this would provide the impetus to either grow uid_t to 64-bits, or implement a larger-scale API conversion from the use of POSIX IDs to SIDs. These are challenges we face anyway given the current size of our POSIX ID space. Until such changes are made, the behavior of the ID mapping service when all ephemeral IDs are exhausted should be to map any new SIDs to *ID_NOBODY, which is already reserved within the POSIX ID space, and log an FMA message. Finally, with respect to the top-level name service APIs, we propose that getpwuid(3C), getpwuid_r(3C), getgrgid(3C), and getgrgid_r(3C), all return failure with "not found" semantics when passed an ephemeral uid or gid in the situation where no corresponding SID can be resolved in the SID-based name service, or where no SID-based name service is available at all (i.e. on a system with CIFS support before Reno and nss_ad exist or are deployed at all). This semantic is consistent with the notion that these entities do not exist in the UNIX name service backend, and certainly it is already possible for one to stat(2) a file and see a uid or gid that is not retrievable from the current name service configuration. Another active Solaris Nevada project, Reno (see http://opensolaris.org/os/project/reno/), proposes to extend PAM to permit passwd(4)-style user attributes to be loaded by authentication modules. We propose that providing such information for ephemeral IDs can be safely deferred until Reno and/or an nss_ad switch module are implemented. 11. ID Mapping Service The ID mapping service necessary to implement the changes described in this document will be delivered by the Winchester project, and should provide the following minimum capabilities: (a) The ability to perform name-based mapping so that the CIFS/SMB session initiator can obtain a POSIX uid or gid corresponding to an SID. (b) The ability to configure static rewriting rules for name-based mapping that are equivalent to rewriting rules offered by Samba and NetApp. (c) The ability to allocate ephemeral IDs when an SID to POSIX ID mapping cannot be computed. Ephemeral IDs must be cached across restarts of the ID mapping service (i.e. either in the kernel, in tmpfs, or both), and should not be cached persistently on disk. (d) The ability for kernel code such as FUID-aware filesystems to upcall the mapping service via a door to convert SIDs to UIDs/GIDs, and some appropriate caching of the results, as determined by performance analysis. (e) The ability for Sun to eventually deliver a unified identity solution, wherein a single directory could contain UNIX information and SIDs. The unified identity code should also support Microsoft SFU, as described in (3c), i.e. looking for msSFU30UidNumber or the new uidNumber attributes when AD is configured to have posixAccount objects for Windows users. As specified, Winchester resembles the Samba ID mapping component, in that it offers an ID mapping service with pluggable mapping models, independent of any underlying capability of the operating system to support SIDs. Winchester proposes to implement features beyond (a-c), including: - Algorithmic Mapping from Section 3b, similar to Samba - A plug-in interface for other mapping schemes, e.g. Apple Open Directory and other features. My intent here is only to discuss dynamic mapping of SIDs and its impact on Winchester; the complete set of Winchester features is described in its project documentation and should be evaluated as part of its ARC review. My hope is that this proposal will actually simplify the implementation of Winchester, and also integrate it more tightly with the base operating system as the central place for ID management going forward. The Winchester project should consider several key implementation issues based on the analysis in this document. The resolution of these issues is left for discussion of that project and its associated ARC materials: (i) The project proposes to store its mappings in a persistent database. Although such mappings need to be persisted, with ephemeral IDs they only need to be persisted across service restart, and not reboots, implying that tmpfs can be used to back the cache. This property may significantly simplify the design and implementation of the persistence mechanism. (ii) For NFSv4, PSARC 2004/592 extended nfsmapid(1M) to also support user plug-ins to perform mapping, finally implementing the original concept proposed by PSARC 1998/335 in a more much sane fashion. Given this interface and the existing code in Solaris to upcall nfsmapid(1M) and cache its results, the Winchester project should investigate whether it would be simpler to make ID mapping one common service as opposed to having both nfsmapid and a completely orthogonal service for Winchester. (iii) Given the critical importance of not re-using ephemeral IDs once the system has booted, it may be dangerous to only store the allocated IDs in the cache from (i). Since allocation of ephemeral IDs can be done in order, it would be relatively simple for the Winchester service to preserve the next ephemeral ID (i.e. a reservation of all previously allocated IDs) in a non-persistent smf(5) property group, or save it in the kernel itself. (Storing the entire cache in-kernel is presumed to be wasteful of memory.) The Winchester team should consider these options for their implementation. With the combination of ephemeral IDs and the ID mapping service, it is not necessary for us to deliver a complete family of setuid(2) and setgid(2) system call equivalents that accept SIDs as part of the initial project work, because any application that groks SIDs can contact the mapping service to obtain the mapped POSIX ID, and then use the existing system calls. At the same time, these calls could be compatibly added in a future Solaris release. 12. CIFS Implementation The new CIFS service will establish SID-based credentials as part of creating an SMB session, since SIDs are transported over-the-wire in the SMB protocol. When the SMB service establishes credentials, it will contact the ID mapper directly (or indirectly as the result of some new system call) which will result in computing the appropriate uid_t and gid_t mappings for the SIDs, either POSIX IDs if a name mapping exists, or ephemeral IDs if no mapping is found. In either case, the SMB service will establish a full cred_t with both uid and gid values and SIDs, and can then interact with the rest of Solaris. When CIFS writes through the VOP layer to a FUID-capable filesystem such as ZFS, SIDs will be stored as FUIDs in the filesystem. 13. NFSv3 Implementation As discussed earlier, NFSv3 sends UIDs and GIDs over the wire if AUTH_UNIX is selected, but the Solaris code already maps values greater than INT_MAX to *ID_NOBODY. The code path for this is as follows: sec_svc_msg() takes AUTH_UNIX from the wire and in _svcauth_unix() does an XDR decode of int32 to store the uid in aup_uid, which is then used by sec_svc_getcred() to make the cred. This in turn calls crsetugid(), which can and does fail if the uid is greater than UID_MAX. This failure is then propagated back to checkauth() in nfs_server.c which for AUTH_UNIX then resets the credentials to the anonymous user (ex_anon). We propose to keep this code intact, albeit using different #defines. Thus NFSv3 would map any ephemeral IDs that have no POSIX equivalent to *ID_NOBODY. One extant bug here is that I can't find any code which precludes one from configuring ex_anon (the share(1M) anon=N setting) to be a value greater than UID_MAX; this bug should be corrected as part of this work. The proposal thus introduces no incompatibilities with other operating systems for NFSv3 file sharing or with existing Solaris as an NFS v3 client or server, and effectively means that NFSv4 must be used to share files across Windows and UNIX clients when Solaris is configured as a CIFS server and no global POSIX name mapping equivalence has been established. This seems like an eminently reasonable constraint, helps to propagate the use of NFSv4, and doesn't compromise any of our constraints with respect to pure Windows client support. 14. NFSv4 Implementation Unlike NFSv3, NFSv4 does not send UIDs and GIDs over the wire for attributes. Instead, nfsmapid(1M) is used to map the values to utf8 strings containing the user and group name suffixed by the NFSv4 mapping domain (either the DNS domain or a domain name manually configured using the NFSMAPID_DOMAIN property). See RFC 3530 for more information on the exact NFSv4 semantics. Fundamentally this behavior would remain unchanged; the NFS server would continue to upcall a mapping daemon to map a UID or GID to a name using the name service, and if this call is successful (which it could be when Windows and POSIX name mapping equivalence has been established), the appropriate name is sent over the wire. If an ephemeral ID for an SID has no mapping, then the POSIX name service lookup should fail and return *ID_NOBODY to the kernel, which NFSv4 already has defined as a clear semantic, and it sends "nobody" back over the wire. However, NFSv4 does not change the basic underlying RPC mechanisms for authentication; namely that AUTH_UNIX can still be used to authenticate, and therefore NFSv4 clients are expected to have POSIX-style credentials when using AUTH_UNIX and POSIX IDs exceeding the maximum Solaris POSIX ID, 0x7fffffff, will be converted to *ID_NOBODY as we do for NFSv4 and v3 today. As such NFSv4 clients using AUTH_UNIX can only create files with POSIX IDs. 15. Legacy Solaris Filesystems Other than NFSv3, historic Solaris filesystems such as UFS will not be changed to use FUIDs. Instead, the VOP layer should be modified to transparently convert ephemeral IDs to *ID_NOBODY as they are passed to historic filesystems. This is similar to the approach taken by the original EFT project [A], where the old 16-bit UFS inode uid and gid fields were left intact, and set to the nobody values only when the new 32-bit fields contained IDs above 0xffff. Code should be added to UFS to prevent it from retrieving a corrupt uid or gid from on-disk inodes (i.e. one above INT_MAX), and convert it to *ID_NOBODY. A particularly disgusting use of UIDs is the UFS quota database, used by quotacheck(1M), which performs a linear search over the entire UID space. Thankfully, since we're not proposing to extend UFS's effective UID range, the quota tools and formats do not need to be changed and continue to behave exactly as they do today (that is, perform really, really badly). The consequence of this approach is that any credential that is associated with an ephemeral ID cannot be stored in a historic filesystem unless POSIX name mapping equivalence is established. Since our local filesystem of choice ZFS will support FUIDs, and this can only happen if one deploys CIFS on top of UFS with no name mapping, this seems like a reasonable behavioral choice. The other existing filesystem of interest is tmpfs(7FS), where conversion to FUIDs is possible since there are no persistent meta-data issues to address. However, since our only immediate need for Solaris is to support a CIFS server, there is no pressing need to modify tmpfs: it can be treated like any other existing filesystem. When additional Solaris changes permit Active Directory users without a reserved POSIX ID to authenticate, thereby establishing local user processes with ephemeral IDs in their credentials, then it will likely be necessary to modify tmpfs to support FUIDs so these process can use /tmp. 16. Solaris Data Formats Solaris has a number of data formats where user and group identifiers are written to files. Some, like utmpx(4) and wtmpx(4), already use names rather than integer identifiers. Others, like passwd(4) files, will have no need of supporting ephemeral IDs, but should be modified to prevent their explicit use. The most common use of uids and gids in files in the Solaris archive utilities tar(1), cpio(1), and pax(1). All of these utilities store integer IDs in their files, but with varying degrees of support depending on the selected format. The current behavior of these utilities in Solaris is summarized as follows: tar (ustar) uids up to 2097151, otherwise 60001 tar -E (xustar) uids up to INT_MAX, otherwise 60001 pax -x cpio uids up to 262143, otherwise 60001 pax -x pax uids up to 2097151, otherwise 60001 pax -x ustar uids up to 2097151, otherwise 60001 pax -x xustar uids up to 2097151, otherwise 60001 cpio (default) uids up to 65535, otherwise 60001 cpio -c uids up to 0xffffffff, no restrictions cpio -H crc uids up to 0xffffffff, no restrictions cpio -H odc uids up to 262143, otherwise 60001 cpio -H tar uids up to 2097151, otherwise 60001 cpio -H ustar uids up to 2097151, otherwise 60001 Since archiving an ephemeral ID and attempting to restore it is definitely a bad idea, the proposal is to leave the behavior of all the utilities above alone, changing only perhaps the #defines or code comments for clarity. The only exception is the behavior of cpio -c / -H crc, which I believe should be changed to map values greater than INT_MAX to *ID_NOBODY. This does not seem to cause any incompatibility other than documentation since it is by definition impossible for anyone to have cpio archived such a uid on Solaris. Therefore, similar to historic filesystems, archive utilities will work compatibly on all new Solaris systems with POSIX IDs only, will work properly when POSIX name mappings exist, and will archive "nobody" for ephemeral IDs. The suggestion, as described earlier in Section 8, is to extend FUID-based filesystems with a set of extended attributes to report the true SID, and to perhaps use the attributes interface to perform an SID-based chown. This technique would permit tar, cpio, and pax to function properly for filesystems that support SIDs without the continuing need to modify their source code for future changes to our POSIX UID space. Given that the majority of the formats described above don't even support up to the current INT_MAX, the limitation on behavior with respect to ephemeral IDs seems very reasonable. Solaris tar in extended mode (-E) also includes a feature which records the user name and group name of each file or directory as a string, and will attempt, prior to performing a chown on extract, to re-compute a new uid or gid by looking up the original name in the name service. This feature will remain and work unmodified, but only for POSIX identifiers or SIDs that can be successfully mapped to names by the name service, and speaks to the need to get away from UIDs in tar files. GNU tar also provides the same behavior. In terms of the rest of its behavior, GNU tar has a more extensive set of formats than Solaris tar. The limits and behaviors of GNU tar are: Format UID Limit ------ --------- gnu 1.8e19 oldgnu 1.8e19 v7 2097151 ustar 2097151 posix Unlimited with "posix" referring to the POSIX.1 2001 tar format specification (used by the Solaris pax(1) utility). In the current "gnu" format, a two's- complement base-256 encoding is used for large uid values and those that are negative if the system uid_t is signed. Therefore, GNU tar will function properly when deployed on a Solaris system with unsigned 32-bit uids and gids, and it will properly handle values above INT_MAX if name mappings exist for them. The only case of concern is when an ephemeral ID is archived using the gnu or posix formats: the result will be to correctly capture the current ID, but upon extraction by root, the chown() may fail with EINVAL (by default, archivers always continue to extract on such errors, but return non-zero). This issue can only arise when GNU tar is used as an archiver of files when CIFS and AD are configured on a system, so again this should be addressed as part of our documentation for administrators on these features. Sun should also provide a patch or suggested change to the GNU tar developers. Two other Solaris data formats that store UIDs and GIDs to persistent files are the Extended Accounting (exacct) format of PSARC 1999/119, and the Solaris audit(1M) trails. As both of these file formats are extensible and under Sun's control, PSARC cases should be filed as part of this work to extend them to support SIDs in addition to POSIX IDs. Given the use of these data formats as long-term archives with billing and security implications, the use of integer IDs in these files was already dubious since the interpretation would require a correct connection to some persistent external name service. These are examples of where Solaris conversion to SIDs as the underlying ID makes sense. The legacy Solaris SVR4 accounting file /var/adm/pacct also stores 32-bit uid_t values; this can be examined using the lastcomm(1) utility. The existing behavior of lastcomm(1) is to call getpwuid(3C) on its saved values, and report either the corresponding name or the uid value printf'd as %ld. Since we're not increasing the size of the type, no file format incompatibility is created. We could either leave lastcomm(1) alone but change %ld to %lu, meaning that ephemeral IDs would simply be printed as unknown integers (as it would today if say, the file itself was damaged and a uid value above INT_MAX were retrieved from the filesystem), or we could change it to report *ID_NOBODY. In either case, it seems pointless to extend the SVR4 struct acct for SIDs. 17. Backup Formats Since ephemeral IDs only exist on the system when CIFS is deployed without POSIX name mapping equivalence, and such IDs cannot be stored in existing filesystems anyway, there is no incompatibility with existing backup software. The only backup issues arise when trying to backup ZFS with SIDs. ZFS already provides its own archival format by virtue of zfs(1M) send/recv; this format would be extended to support the ZFS FUID representation as part of this work. Furthermore, ZFS already introduces a number of novel concepts that must be coped with by backup software, such as extensible attributes and properties. The ZFS team should therefore discuss SIDs and FUIDs as part of its ongoing work on an appropriate backup software strategy for ZFS and Solaris. Sun is also implementing an NDMP server and including this with Solaris and shortly in OpenSolaris; NDMP is the standard protocol for backup control [D], and can be used with any other form of backup data format (e.g. tar, cpio). NDMP includes only one use of POSIX uids and gids, which is in the ndmp_file structure that forms part of the NDMP File History interface (see [D]). The file history interface is in effect only a performance optimization, permitting one to see a history of archived files and quickly seek a tape drive to the appropriate location of the start of some chunk of files. The ndmp_file definition already specifies uid and gid as unsigned 32-bit values, and with respect to CIFS compatibility section 4.3.1 of the spec says: owner: File owner identifier. uid SHOULD be used for UNIX file system type. This field is undefined for NT file system type. group: File group identifier. gid SHOULD be used for UNIX file system type. This field is undefined for NT file system type. Therefore, the proposal is that we insert *ID_NOBODY tokens into these fields when our NDMP server generates file history for any ephemeral IDs. 18. Zones Integration Solaris Zones provide a lightweight virtualization environment that includes virtualization of the Solaris name service switch configuration. That is, a local zone may have its own nsswitch.conf(4) settings indicating an entirely different name server, name service, or name service prioritization. As such it is already the case that POSIX identifiers do not necessarily hold the same meaning across disparate Zones in that one zone might assign a given uid_t value one identity in its own passwd(4) file and another zone might see a different identity for that uid_t based upon a NIS or LDAP directory. As such Solaris will need to evaluate identity mapping rules for non-POSIX identities differently in each zone, and therefore the mapping of an SID to a POSIX uid_t or ephemeral uid_t will vary across zones. Finally, a Zone can use the BrandX technology to provide an entirely different identity service for another OS personality. Therefore, each Solaris Zone should have its own instance of the id mapping service, and maintain its own notion of ephemeral uid_t and gid_t's. 19. Case Summary and Next Steps This document describes a technical strategy for unified representation of Windows and POSIX credentials in Solaris. This document is intended to be approved by the ARC as a strategy for addressing the underlying problems described here, and thereby provide the basis for subsequent interface review of the Solaris projects that will define the articulation of this strategy. The proposal to the ARC is that the approval of this case corresponds to the following set of strategy and interface decisions: (a) That the types of uid_t and gid_t will be changed to unsigned 32-bit int, and the addressable range of the types will be extended to 0xfffffffe. (b) That the UID and GID spaces will be partitioned into a range used for traditional POSIX identifiers and a range used for ephemeral mappings. (c) That present Solaris filesystems (Section 15) and data formats (Section 16) will be modified such that ephemeral IDs will be mapped to *ID_NOBODY. (d) That a global ID mapping service will be implemented to provide POSIX identifier mappings for SIDs and meet all the requirements of Section 11. (e) That the set*id() and chown() system calls will report EINVAL when an unmapped ephemeral ID argument is specified as a POSIX user or group ID. The following pending ARC cases will therefore be reviewed in conjunction with the set of decisions described as part of this case: PSARC 2006/315 Winchester: Schema Mapping and ID Mapping for AD Interoperability PSARC 2006/715 CIFS Service PSARC 2006/719 NDMP service And one or more ARC cases will be brought to specify the additional project- specific changes necessary to complete the articulation of this strategy: (a) Interface changes and additions for cred_t and ucred_t, including functions to establish, retrieve, and validate SID values associated with credentials and appropriate observability (proc(4) ucred file, pcred(1), DTrace) (b) File format extensions for Solaris exacct and auditing to record, extract, and format SIDs as part of these persistent data formats. (c) ZFS changes to support ownership and ACEs that contain SIDs using the FUID representation described in Section 8. (d) APIs to examine a modify a set of extensible system attributes, including SIDs, for files, and changes to the archive utilities to support them. 20. Acknowledgements Early drafts of this proposal were reviewed by Matthew Ahrens, Jeff Bonwick, Bryan Cantrill, Don Cragun, Casper Dik, Brendan Gregg, Adam Leventhal, Tim Marsland, Eric Schrock, Mark Shellenbaum, Spencer Shepler, Glenn Skinner, Keith Wesolowski, Nico Williams, Gary Winiger, and Alan Wright. I am indebted to all of them for taking the time to do so and providing many useful comments. 21. References Overview of Windows SIDs and Active Directory: [1] Well-Known Windows Security Identifiers (SIDs) http://www.microsoft.com/technet/prodtechnol/\ windows2000serv/reskit/distrib/dsfe_sid_yokv.mspx?mfr=true [2] Active Directory Architecture http://www.microsoft.com/technet/prodtechnol/\ windows2000serv/technologies/activedirectory/deploy/projplan/adarch.mspx [3] Microsoft Windows Services for UNIX (SFU) http://www.microsoft.com/technet/interopmigration/unix/sfu/default.mspx Samba Technical Documentation regarding UID Mapping: [4] Security Identifier / User Identifier Resolution System (Internet Draft) http://www.cb1.com/~lkcl/cifs/draft-lkcl-sidtouidmap-00.html [5] Samba HowTo: Chapter 14. Identity Mapping (IDMAP) http://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/idmapper.html [6] Samba HowTo: Chapter 3. Server Types and Security Modes http://www.samba.org/samba/docs/man/Samba-HOWTO-Collection/ServerType.html NetApp Technical Papers regarding NFS, CIFS, and UID Mapping: [7] Merging NT and UNIX Filesystem Permissions http://www.usenix.org/publications/library/\ proceedings/usenix-nt98/full_papers/hitz/hitz.pdf [8] NetApp Storage System Multiprotocol Use Guide http://www.netapp.com/library/tr/3490.pdf [9] Multiprotocol Data Access: NFS, CIFS, and HTTP http://www.netapp.com/library/tr/3014.pdf Sun PSARC cases for UIDs and UID Mapping: [A] PSARC 1995/334 Large uids and gids [B] PSARC 1998/335 UID/GID Mapping for NFS [C] PSARC 2004/592 nfsmapid extension for UID/GID mapping NDMPv4 Specification: [D] NDMP Version 4 Protocol (Internet Draft) http://www.ndmp.org/download/sdk_v4/draft-skardal-ndmp4-04.txt Other References: [E] MacOS X Server User Management, Second Edition, Appendix B (pg 239) http://images.apple.com/server/pdfs/User_Management_Admin_v10.4B.pdf