S10C: A Solaris 10 Branded Zone for Solaris Next Gerald Jelinek, Jordan Vaughan Solaris Virtualization Technologies [A note on terminology: This document abbreviates "Solaris 10" as "S10" and "virtualization" as "V12N".] Part 1: Introduction ____________________________ Each new minor release of Solaris brings with it the well-known problems of slow user adoption, slow ISV support and concerns about compatibility. The compatibility concerns will be more pronounced with the release of Solaris Next because of the expected greater than normal user-visible changes (e.g., the packaging system, etc.). Fortunately, V12N techniques have become widespread since the last minor release of Solaris (S10) and can be used to ease the transition to the new version of Solaris. Zones[1] combined with a brand[2] are particularly well suited for this task because the host system is running Solaris Next, which is not necessarily the case with other V12N technologies. In addition, zones are usable on any system that runs Solaris Next, which is not the case with other V12N technologies. We already have a proven track record delivering this sort of zones/brand- based solution to enable running earlier versions of Solaris on S10 [3, 4], so in one sense this case breaks little new ground. However, the earlier solaris8 and solaris9 brands were used to host releases that were mostly static as compared to hosting a zone running S10. In addition, Solaris Next can be expected to continue to change rapidly for the foreseeable future. Given this, a solaris10 brand for Solaris Next poses additional challenges for projects on both the S10 and Solaris Next sides of the system. Many of these challenges are outside of the scope of an architectural review and include developer education, testing and procedural changes. However, the existence of this brand could potentially impact future projects in various ways and at a minimum will require ARC consideration during future reviews. The existence of this brand can be seen as a potential "tax" on all projects that work on both sides of the user/kernel boundary for both S10 and Solaris Next. The benefits of the brand are as follows: For customers: - Provides a solution to cope with compatibility differences between S10 and Solaris Next - Protects investment in S10 infrastructure, training, and internal support - Minimizes the cost of consolidating S10 systems - Enables deployment of new technologies in Solaris Next (e.g., crossbow) while still running applications on S10, thereby limiting risk to production environments - Avoids or delays required application recertification For Sun: - Solaris Next is adopted sooner - Provides a S10 compatibility environment for Solaris Next - Sun is a solution provider easing the burden of getting to Solaris Next - Provides cross-platform V12N solution for Solaris Next (it is the only V12N solution on M-Series) This has been identified as a required feature for Solaris Next. === Project Overview === As with the earlier solaris8 and solaris9 brands, this project delivers the following: - A Branded Container which emulates S10 user environments using the BrandZ infrastructure provided with zones. This brand is called 'solaris10'. Only Solaris 10u8 and beyond will be supported and tested with the brand. Systems running releases earlier than S10u8 must be upgraded before being moved to Solaris Next. - A mechanism for archiving existing S10 systems and for deploying those archives into solaris10-branded zones. This process is referred to as "physical-to-virtual migration" (p2v) and uses the same techniques as the solaris8 and solaris9 brands. In addition, the following capabilities and characteristics, which do not exist for the solaris8 and solaris9 brands, will apply: - This brand will be supported on all hardware architectures that run Solaris Next (sun4v, sun4u and x86). The specific platforms, particularly sun4u, will be the same as those certified for Solaris Next. - A "virtual-to-virtual" (v2v) mechanism for archiving existing S10 native zones and deploying those archives into solaris10-branded zones on Solaris Next will be provided. The process will be very similar to that of the existing zone migration [5] feature except that the zone's brand must change in the process. In addition, if the zone is sparse on S10, then it must be converted to a whole-root zone during the migration. The functionality of the solaris10 brand will be delivered in phases. The first phase will include the basic brand module, p2v, v2v, and experimental support for exclusive IP stack zones and delegated ZFS datasets. Follow-on work will complete the support for exclusive IP stack and delegated ZFS datasets and provide a way to upgrade the S10 instances within the zones to newer S10 update releases. Part 2: solaris10 Brand ____________________________ The solaris10 brand is conceptually similar to the existing solaris8 and solaris9 brands and builds directly on the BrandZ infrastructure that was created to support the lx brand. Familiarity with BrandZ and the solaris8 and solaris9 brands is assumed. The initial phase of the brand will only support the shared stack [6] networking model in which a zone's network configuration is managed by the global zone. The exclusive IP stack model is expected to require more complex emulation due to the networking changes in Solaris Next. Thus, exclusive IP stack zones will be considered experimental in the initial phase of the project. The ZFS ioctls have been audited and no issues have been seen. Because much of ZFS has been backported to S10 updates earlier than the first S10 version supported by the brand (S10u8), ZFS delegated datasets appear to work fine; however, further testing needs to be done and future ZFS enhancements might require special brand changes. Thus, delegated ZFS datasets will be considered experimental in the initial phase of the project. === System Call Emulation === This section details the system call emulation provided by the current solaris10 brand module. The following system calls are currently being emulated: Syscall Name Syscall ID ---------------------------------------- SYS_exec 11 SYS_ioctl 54 SYS_exeve 59 SYS_acctctl 71 SYS_issetugid 75 SYS_uname 135 SYS_systeminfo 139 SYS_lwp_create 159 (x86 only) SYS_lwp_private 166 (x86 only) SYS_pwrite 174 SYS_auditsys 186 SYS_sigqueue 190 SYS_pwrite64 223 SYS_zone 227 SYS_exec SYS_exeve The brand interposes on these system calls to provide a convenient mechanism for branded processes to spawn native processes. SYS_ioctl Some device ioctls must be emulated due to parameter changes. - /dev/crypto The crypto_get_function_list_t structure returned by CRYPTO_GET_FUNCTION_LIST grew in Solaris Next. - /dev/zfs There are differences in the zfs_cmd_t structure between S10 and Solaris Next. - /system/contract (ctfs) Process contract ioctls need to be emulated for init(1M) because the ioctl parameter structure changed between S10 and Solaris Next. SYS_acctctl The mode shift, mode mask and option mask for acctctl changed for Crossbow. SYS_issetugid S10's issetugid() syscall is now a subcode to privsys(). SYS_uname The brand simply passes this through to the native kernel, then modifies the result upon return so that the system call returns "5.10" for the 'release' field and "Generic_Virtual" for the 'version' field. SYS_systeminfo The emulator interposes on the sysinfo(2) commands SI_RELEASE and SI_VERSION; all others get passed through to the native kernel. SYS_lpw_create SYS_lwp_private Due to some paravirtualization changes for the xVM hypervisor, the S10 libc expects that the %fs register will be nonzero for new 64-bit x86 LWPs but the Solaris Next kernel clears %fs for such LWPs. The brand interposes on these syscalls in order to set %fs to the expected value. The Solaris Next kernel can safely work with nonzero %fs values because the kernel configures per-thread %fs segment descriptors so that the legacy %fs selector value will still work. See the comment in lwp_load() regarding %fs and %fsbase in 64-bit x86 processes. This emulation is needed due to CRs 6467491 and 6501650. We are exploring a backport of these changes to an S10 update so that this emulation will no longer be necessary. SYS_pwrite SYS_pwrite64 pwrite()'s behavior differs between S10 and Solaris Next when applied to files opened with O_APPEND. The offset argument is ignored and the buffer is appended to the target file in S10, whereas the current file position is ignored in Solaris Next (i.e., pwrite() acts as though the target file wasn't opened with O_APPEND). This is a result of the fix for: 6655660 pwrite() must ignore the O_APPEND/FAPPEND flag. The brand emulates the old S10 pwrite() behavior by checking whether the target file was opened with O_APPEND. If it was, then the brand invokes the write() syscall instead of pwrite(); otherwise, it invokes the pwrite() syscall as usual. SYS_auditsys The brand interposes on the BSM_AUDITCTL command for the A_GETPOLICY and A_SETPOLICY subcommands because the audit policy bit flags have changed due to the removal of the AUDIT_USER flag and the downward shifting of the subsequent bits. All other subcommands are passed through to the native kernel. SYS_sigqueue The new block flag argument should be zero. The block flag is used by the Opensolaris AIO implementation, which is now part of libc. SYS_zone See discussion below. === zone(2) support === Zones have been part of S10 since its FCS, so in general S10 is already zone-aware and does the right thing in most cases. Commands that are zone-aware will continue to work as they do today in S10 native zones. One set of commands that require emulation are the S10 SVr4 packaging and patch commands. Those commands are zone-aware and in some cases will check if they are running in the global zone and refuse to function if not. If they are running in the global zone, then they will also attempt to look for other zones on which to operate. The brand emulation interposes on the zone syscall and selectively provides emulation when the running command is one of the SVr4 package or patch commands. In these cases the emulation indicates that the current zone is the global zone (zoneid 0) and various zone attributes, such as the zone brand itself, are emulated. In all other cases the syscall is passed to the native kernel so that the other S10 commands continue to behave as they currently do. Because the solaris10-branded zones are whole-root zones, all packaging and patch operations will succeed, although the kernel components of the package or patch will not be used. This is exactly the same behavior as in solaris8- and solaris9-branded zones. One further considerations for zones is related to the p2v process. There may be zones on the original physical system during a p2v operation. Since zones do not nest, p2v-ing these systems means that the zones themselves will not be usable inside the branded zone. This is detected when the zone is installed and a warning is issued indicating that any nested zones will not be usable and that the disk space could be recovered. Those zones can be migrated ahead of time using the v2v feature described below. In addition, a future project is planned which will examine a system prior to p2v and report issues that might arise. Detecting zones will be part of the examination. === Privileges === If existing privileges were to be broken up or removed, then the brand would be impacted if the privilege was available to or expected within a zone. Adding new privileges to Solaris Next should be OK because nothing in S10 should use those privileges and properly written S10 applications should be able to cope with additional privileges they do not understand. Zones already see a subset of the privileges available to the global zone and since zone privileges are configurable, it is normal for applications within zones to see varying sets of privileges. === What Is Not Emulated === This project does not make any changes to existing native zones limitations. - TX TX will continue to be incompatible with branded zones. Customers using TX on S10 systems will need to transition to certified, native Solaris Next TX solutions. Discussions with the TX team indicate that this is the normal procedure for TX users because the base OS itself must be certified for TX. - Unsupported Devices The /dev/sound device is currently not supported in solaris10-branded zones. This device was changed incompatibly in Solaris Next. This is an acceptable limitation because the device is rarely added to zones (less than one tenth of one percent of zones in the explorer database). Support can be added if demand changes. === Versioning === Future revisions of the solaris10 brand module are expected to provide compatibility for any release of S10u8 and beyond. That is, one zone could run S10u8, a different zone S10u9, and so on. Because of the potential issues with compatibility of various releases of S10 hosted on different releases of Solaris Next, a basic versioning system is incorporated into the brand. This versioning system works two ways. First, the brand emulation can check which version of S10 is being hosted inside zones and dynamically adjust the emulation accordingly. Second, future S10 updates that require specific emulation can indicate that a specific version of the emulation is required. If necessary, they can also check if they are running in branded zones and, if so, determine which emulation version is available. The initial release of the software will not need this versioning mechanism, but it is being included to cope with possible future enhancements to either S10 or Solaris Next. Changes made to S10 that require enhancements to the brand emulation library are expected to be delivered in S10 KU patches that provide components on both sides of the user/kernel boundary. When branded zones boot, the brand boot hook determines the minimal version of the KUs installed in the zones to verify that the zones' releases are supported. (Currently the minimal KU will be the one from S10u8.) The brand then makes the associated versions (e.g., version 0 of the emulation) available as attributes of the zones. The brand library can then use this information to provide conditional emulation if needed. Future projects that enhance the emulation for new features in S10 can add checks for a different KU version numbers that would provide associated versions (e.g., 1, 2, etc.) to the brand library. If the KU version is not sufficient, future S10 projects may need to design some other version check for the brand to enable it to properly detect the S10 changes. The ability to detect the KU version is already covered by the contract on the zone "update on attach" feature [7]. The situation is more complicated for future changes within the S10 code base that will require associated enhancements to the brand emulation. There are two mechanisms being proposed. The first mechanism is that future versions of S10 can specify that they require minimal versions of the brand emulation. They can do this by delivering a version number into the '/usr/lib/brand/solaris10/version' file on S10. When this future version of S10 is p2v-ed into a solaris10-branded zone, the solaris10 brand will check for the presence of this file and if it exists, verify that the brand's version is greater than or equal to the version specified in the S10 file. If not, then an error will be emitted and the zone p2v will fail, leaving the zone in the configured state. If the '/usr/lib/brand/solaris10/version' file is missing on S10, that indicates that the version of S10 is still compatible with the initial release of the solaris10 brand emulation. This file must be created and the version number bumped the first time a project is backported to S10 that requires an enhancement to the emulation. This mechanism will be useful if a future S10 update will be fundamentally incompatible with an older version of the Solaris Next brand emulation. The second mechanism allows projects that have been backported to S10 to actually be brand aware. A new zone attribute will be available indicating which version of the brand emulation is currently installed on the system. If future S10 updates deliver new features that require changes to the brand library, then the S10 updates can determine if they are running in branded zones and take appropriate action. If a newer S10 update is running in a zone on an older version of Solaris Next that does not provide the required emulation, then the S10 feature can adjust its behavior. The existing getzoneid() and zone_getattr(ZONE_ATTR_BRAND) functions can be used by S10 code to determine if it is running in a non-global zone and if that zone is a solaris10-branded zone. A new solaris10 brand-specific zone attribute, S10_EMUL_VERSION_NUM, is defined. The S10 feature can use the zone_getattr(S10_EMUL_VERSION_NUM) function to determine if the brand emulation supports the feature. The getzoneid() and zone_getattr() functions are already used throughout the ON consolidation for code that is zone-aware. These functions will continue to be consolidation private. Engineers backporting features to a future S10 update will need to first determine if the features require enhancements to the solaris10 brand library. If so, then they will have to enhance the emulation in Solaris Next and bump the emulation version number. They can then either bump the minimal emulation version number in the /usr/lib/brand/solaris10/version file on S10 during the S10 backport or they can add appropriate checks to the backported S10 code so that it can determine if support is available in the brand library and change behavior accordingly. This obviously adds complexity to projects backporting features to future S10 updates if those features require emulation to function correctly in a solaris10-branded zone. Ideally, projects requiring such enhancements to the brand emulation will not be backported. Perhaps the presence of the solaris10 brand may discourage projects from backporting because the brand provides S10 compatibility on Solaris Next. Future projects that cross the user/kernel boundary and request patch binding should be reviewed by the ARCs to determine if those projects must take the solaris10 brand into account. In addition to the above, any changes integrating into Solaris Next that might impact the solaris10 brand will need to test the supported versions of S10 in the branded zone and make any needed changes to the solaris10 emulation. === Upgrade === Because S10 update releases will continue for the forseeable future, some form of upgrade must be available for the solaris10-branded zones. For example, an S10u8-based zone should be upgradeable to S10u9. The mechanism for upgrading an individual solaris10-branded zone will be part of a follow-on phase of the project with a separate ARC case. Something like a modified version of live-upgrade that will work within zones will probably be used. Part 3: Archiving, Installation, p2v & v2v ____________________________ The p2v process for the solaris10 brand is the same as for the solaris8, solaris9 and native [8] brands. A contract will be included with this case for the flar command to explicitly call out the use of flash archives for migrating system images into zones. The p2v conversion during the installation of the zone will be similar to the native p2v process [8]. The v2v process for migrating S10 native zones to solaris10-branded zones will support the same archive formats as p2v. This process will use the 'zoneadm attach' subcommand because it is the existing interface for migrating [3] zones from one system to another. The solaris10 brand's attach subcommand will be extended to accept the following options, which correspond to the same options in the install subcommand: -a {path} - specifies a path to an archive to unpack into the zone -d {path} - specifies a path to a tree of files as the source for the installation. One issue with v2v of S10 zones is that such zones can be sparse root zones but solaris10-branded zones must be whole root zones. To address this, zones must be readied on the source system. This will mount any inherited-pkg-dirs and archives can then be made of the readied zones that will contain all of the required file systems. Part 4: Interface Table ____________________________ The S10C project seeks minor release binding. Although the solaris10 brand is conceptually no different from the other brands created to date (lx, solaris8 & solaris9) in terms of the potential impact on developers crossing the user/kernel boundary, it was felt that the awareness of the impact should be emphasized with this case. As a result, a contract (contract01.txt) is included with this case which is intended to raise awareness that changes that cross the user/kernel boundary, regardless of interface taxonomy, could impact the proper functioning of the brand. Exported Interfaces Stability ---------------------------------------------------------------------- "solaris10" brand name Committed "SUNWsolaris10" brand template name Committed For the solaris10 brand brand-specific install and attach subcommand options Committed documented in this case /usr/lib/brand/solaris10 directory Committed SUNWs10brandr, SUNWs10brandu packages Committed /usr/lib/brand/solaris10/version Committed getzoneid(), zone_getattr(), ZONE_ATTR_BRAND and S10_EMUL_VERSION_NUM attributes Consolidation Private Imported Interfaces Stability ---------------------------------------------------------------------- brandz[2] Project Private Solaris Next syscall traps documented above Consolidation Private user/kernel boundary Any contract01.txt included with this case flar(1m) Evolving contract02.txt included with this case REFERENCES 1. PSARC 2002/174 Virtualization and Namespace Isolation in Solaris 2. PSARC 2005/471 BrandZ: Support for non-native zones 3. PSARC 2007/350 Etude: Migration Technology 4. PSARC 2008/125 Etude Part Deux 5. PSARC 2006/030 Zone migration 6. PSARC 2006/366 Stack instances: Exclusive IP stack per zone 7. PSARC 2007/621 zone update on attach 8. PSARC 2008/766 native zones p2v