ZFS as a Root File System (PSARC/2006/370) Author: Lori Alt (lori.alt@sun.com) January 9, 2008 0. Contents 1. Introduction 2. Overview 3. Design Goals 4. Terminology 5. Boot Design 5.1 Booting from ZFS on x86 platforms 5.2 Booting from ZFS on sparc platforms 5.3 Importing the Pool State during Boot 5.4 Limits on ZFS Root Pools 6. ZFS Root in the Solaris Operating Environment 6.1 Division of the Solaris Name Space into Datasets 6.2 Naming the Datasets in the Boot Environment 6.2.1 New "noauto" value for "canmount" property. 6.3 Mounting the Boot Environment 6.4 System Initialization 6.5 Swap/Dump 6.6 Checkpoint-Restart (CPR) Boot - SPARC only 6.7 Managing Boot Environments 6.7.1 Format of the sparc menu.lst file 7. Installation 7.1 Overview of Solaris Installation 7.2 Changes to Installation to Support ZFS Boot 7.2.1 Some Guiding Design Principles 7.2.2 Initial Install 7.2.3 Upgrade 7.2.4 Servers of Diskless Clients 7.3 Details of Changes to the Install Tools 7.3.1 Jumpstart profile Interpretation (pfinstall) 7.3.2 Interactive Install Programs 7.3.3 LiveUpgrade 8. References 9. Interface Table 9.1 Exported Interfaces 9.2 Imported Interfaces 9.3 Interfaces Reimplemented Appendix A - Jumpstart Profile Keywords 1. Introduction The ZFS file system was introduced in Update 2 of Solaris 10. One of the limitations of the initial release of ZFS is that it could not be used as a root file system. This document describes the necessary changes to Solaris to enable ZFS to be used as a root file system. The implementation of the design specified here will enable users to install systems with ZFS roots. It is intended for integration into Solaris Nevada and then into the most appropriate Solaris 10 Update release. 2. Overview Enabling the ZFS file system to be Solaris's root file system requires the following broad tasks: - Boot support must be added. This includes (but is not limited to) the implementation of boot loaders for both the sparc and x86 architectures. - The operation of basic system management tasks, such as the booting of zones, or the mounting of file systems, must be defined in the context of a ZFS root. What does a system with a ZFS root look like? How is it different from a system with a UFS root? How are administrative tasks affected by the new root file system type? - The installation software must be modified to support the creation of ZFS file systems and the installation of Solaris into them. The installation software must also support upgrades and patching of systems with ZFS root file systems. These three broad areas--booting, system operation, and installation-- will be addressed by this document. 3. Design Goals This project is the first implementation of ZFS as a root file system for Solaris. The installation aspects of this project are for the near-term only. In the medium to long term, a new install project, Caiman, will take over installation and upgrade. Because of this, it is our goal that the installation of systems with ZFS roots be "good enough" in this project. We're not going to invest a lot of work in code that will end up being thrown away soon. The existing install code is very old (written in the early 1990's). It was written for a different world, where disks were small and expensive (along with many other differences from today, though disk size is the one that affects the install design the most). However, this implementation of ZFS root does have to prepare for the long run. Once users move to the new model of pooled storage, they're not going to want another paradigm shift after that. We need to define the "way it works" in such a way that the transition from the initial release of ZFS root to Caiman will be smooth. Caiman will have both an initial install and an ongoing maintenance component. Caiman's ongoing maintenance component in particular needs to be able to manage transitions from systems that were installed with the tools and processes described in this initial ZFS boot project. With those concerns in mind, these are the goals of this design: - Provide the tools to install systems that can make immediate use of many of the features of ZFS that are most valuable for system software. These features include: * pooled storage, giving users great flexibility in the setup of bootable environments. * snapshots and clones, which enable users to quickly and safely perform patches and upgrades. * robustness of storage, including mirroring - Enable the migration of systems with UFS root file systems to ZFS roots. - Define the administrative aspects of ZFS root file systems for the long term, as much as is reasonably possible at this time. Whenever possible, do things "the ZFS way" "the ZFS way", so that system software management will more naturally track the developments in ZFS over the years. - Prepare for Caiman. Set up installed systems in such a way that Caiman will be able to manage them. Define administrative procedures that are advantageous for Caiman's goals. 4. Terminology The following terms are important in this design and will occur throughout the document: root pool -- A ZFS storage pool that has been designated as a "root pool" by setting the value of the pool's "bootfs" property to something other than "none". When any of the devices that compose a root pool are booted (by specifying it as the boot device to the boot loader), the pool as a whole will be "booted", which means that a dataset in the pool (selected as described below) will become the root file system and will provide the necessary files for booting, such as the boot archive. pool dataset -- The dataset at the root of a pool's dataset namespace. The pool dataset of a pool is the dataset that exists by default in every pool. The pool dataset's name is the same as the pool's name. A pool named "tank" will have a pool dataset also called, simply, "tank". A pool dataset does not necessarily contain a root file system. In fact, usually it won't. It probably won't contain much of anything. It does have one very useful attribute: each pool has exactly one pool dataset. It cannot be deleted without deleting the pool itself. This makes the pool dataset a good place for files that contain per-pool state (as opposed to per-dataset state). The menu.lst file (the menu file for GRUB) is an example of a file that contains per-pool state and thus will be stored in the pool dataset. bootable environment -- Often abbreviated as a "BE". A bootable environment is basically a Solaris root file system, plus everything that is subordinate to it (such as mounted file systems). There can be multiple bootable environments in a root pool. In the case of a ZFS root pool, the Solaris root file system in a BE is a ZFS dataset. There can be multiple root file system datasets, and thus multiple BEs, in a root pool. bootable dataset -- a dataset which contains a root file system and which is part of a BE. Every BE contains exactly one bootable dataset. A BE can contain other datasets (such as a /usr or /opt file system), but those other datasets are not considered bootable. A bootable dataset must contain a boot archive. default bootable dataset -- a dataset named by the value of the root pool's "bootfs" property. If a pool has a value of something other than "none" for its "bootfs" property, the pool is a root pool. If a device that is part of a root pool is booted, and a specific dataset to be booted is not explicitly identified in the command to the boot loader, the dataset identified by the "bootfs" property will be the root file system dataset to be booted. 5. Boot Design The process of booting x86 platforms was revised by the Solaris Boot Architecture project (PSARC/2004/454), also called "newboot". The booting of SPARC platforms has also been modified to adhere to the newboot architecture (PSARC/2006/525). The main aspects of newboot that characterize both the x86 and sparc implementations are: * The use of a boot archive, which is a file system image that contains the files required for booting, and * The use of a ramdisk as the root file system during the early stages of booting. In the case of booting for the purpose of doing an installation, the ramdisk is the root file system for the entire installation process, which eliminates the need to be booted from a removable media. At this time, the file system type of the ramdisk file system can be either HSFS or UFS, but not ZFS (ZFS is not well-suited for use a ramdisk file system, and there is no particular reason to use it that way. UFS or HSFS is a better choice.) ZFS boot, on both x86 and SPARC, assumes the newboot style of booting. On both the x86 and SPARC platforms, for both ufs and ZFS root file systems, the following tasks must be accomplished during the process of booting: 1. The boot device must be identified. 2. The PROM (either OBP or the BIOS) reads a boot loader from the boot device. The boot loader is loaded into memory and executed. 3. The boot loader selects a bootable environment (BE) to be mounted as the root file system. This selection can be made based on user input, variables set in NVRAM, or however else the particular boot loader has been programmed to make the choice. 4. The boot archive for the selected boot environment is loaded into memory and that memory range is accessed as a ramdisk, from which the files needed for booting are loaded and executed. The biggest difference between booting from UFS and booting from ZFS is that on ZFS, a device identifier does not uniquely identify a root file system (and thus a BE). With ZFS, a device identifier uniquely identifies a pool, which can contain multiple bootable datasets. So with ZFS, there must be a mechanism for identifying the dataset to be used as the root file system. The mechanism for specifying the dataset to be booted was defined in PSARC/2007/083. A pool property, "bootfs", specifies the default bootable dataset for the pool. When a device in a root pool is booted, the dataset mounted by default as the root file system is the one identified by the "bootfs" pool property. The use can override this selection however. On x86 platforms, the GRUB menu can be used to select an alternate bootable environment. On sparc platforms, an option to the OBP "boot" command can specify the dataset to be booted. 5.1 Booting from ZFS on x86 platforms Much of this was already defined by PSARC/2007/083 (ZFS Bootable datasets). It is summarized here for review purposes. The steps by which a ZFS root file system is booted are: 1. The BIOS reads the Master Boot Record (MBR) from the boot disk. The MBR identifies the location where the GRUB boot loader has been installed on the disk. The version of GRUB installed for the purpose of booting from ZFS has a ZFS-reader built into it. 2. The ZFS GRUB plug-in special-cases the menu.lst file. When asked to read the menu.lst file, the plug-in reads it from the pool dataset. 3. GRUB presents the menu. 4. The menu entry for a boot environment with a ZFS dataset as its root looks like this: title Solaris kernel$ /platform/i86pc/kernel/$ISADIR/unix -B $ZFS-BOOTFS module$ /platform/i86pc/$ISADIR/boot_archive Optionally, it can contain a command of the form: bootfs before the kernel$ command. The ZFS plug-in will replace the $ZFS-BOOTFS macro in the GRUB commands with the name of the dataset to be booted, which will be either the argument of the "bootfs" command (if one was specified) or the value of the "bootfs" pool property. In this way, the dataset selected for booting is passed to the kernel. 5. The kernel$ and module$ commands are executed, thereby loading the unix module and the boot archive into memory. The unix module is now a multiboot-compliant executable as a result of the integration of direct boot (PSARC/2006/568). When the unix module is executed, it reads the remainder of the files it needs for booting from the boot archive. Eventually, zfs_mountroot is called. 6. zfs_mountroot() gets the name of the dataset to be mounted from the zfs-bootfs boot property. The vdev label from the boot device is read, which permits the reading of the pool metadata. The selected bootable dataset is then found in the pool metadata and mounted as root. 5.2 Booting from ZFS on sparc platforms 1. The boot device must have had a ZFS boot block installed on it. OBP reads the boot block into memory and executes it. 2. The ZFS boot block maps the boot device to a ZFS pool and reads enough of the pool's metadata to get the value of the "bootfs" pool property. The ZFS booter supports two features that allow an alternate dataset to be booted: i. The '-L' switch to the "boot" command is passed to the booter and causes the booter to read the /boot/grub/menu.lst file in the pool dataset and to present a simple menu of BEs. ii. The user can either select a BE from the menu printed by the 'L' option or can boot a specific dataset by using the '-Z ' switch to the booter. 3. The booter reads the file '/platform/`uname -m`/boot_archive' from the dataset selected for booting in the previous step (either by default or by explicit selection using '-Z'). The file is read into memory and set up as a ramdisk. 4. The booter creates a 'zfs-bootobj' property whose value is the identity of the dataset selected as the root file system. This is the same dataset from which the boot archive was loaded in the previous step. The booter also sets the value of the "fstype" property to "zfs". 5. The ramdisk set up by loading the boot archive in memory is itself booted. (This just means that blocks 1 - 15 of the device are loaded into memory and executed.) The ramdisk will have a boot block that matches the file system type of the ramdisk contents, which will be either HSFS or UFS. (No need for ZFS here.) 6. Unix is read and booted from the ramdisk. When krtld gains control, it mounts the ramdisk and loads additional kernel modules from it. Eventually, zfs_mountroot is called (since the value of the "fstype" property was set to "zfs"). 7. zfs_mountroot() imports the pool identified by the boot device and mounts the dataset identified by the "zfs-bootobj" property as root. 5.3 Importing the Pool State during Boot Early in the kernel startup stage, before the zfs root file can be mounted, the state of the root pool must be imported. The zpool.cache file cannot be read at this time because it isn't in the boot archive (see PSARC/2006/525 - Newboot Sparc). However, the information for the root pool is available in the metadata stored in the boot disk (which, by definition, is a vdev in the root pool). So the metadata used to make the root pool active is read directly from the boot disk. Later in the boot process, after root is mounted, the zpool.cache file is read as usual. Once the zpool.cache file is read, all other pools listed in it can be imported. If the configuration of the root pool in the boot disk's metadata differs from the configuration of that same pool in the zpool.cache file, the data read from the boot disk takes precedence. (This could happen if booting a BE in the pool that is old and existed prior to some changes to the pool configuration.) 5.4 Limits on ZFS Root Pools Initially, root pools are limited to mirrored configurations only. Striping of vdevs is not permitted. Nor is RAID-Z. The reason for this is that the firmware must be able to read everything needed for booting from a single disk. Each disk in a root pool must be fully accessible from the firmware on the system. Partly as a result of this constraint, best practice will be to have separate pools for system software and data. Users may not want to constrain the configuration of data pools to the limits imposed on root pools. This is not the only reason for this segregation however. In general, there are advantages to managing the "personality" of a system separate from its data. The segregation of system software and data is not mandatory however. Users can combine in them one pool if they wish. In a future release, it is likely that booting from RAID-Z pools will be supported. Another restriction on root pools is that the devices in the pool must have SMI labels (i.e., not EFI labels). This is a restriction imposed by OBP and the install software. 6. ZFS Root in the Solaris Operating Environment ZFS has a very different administrative paradigm than UFS, or just about any other file system type. For the most part, these new administrative concepts, such as pooled storage, offer great advantages when ZFS is used as a root file system, but they can also mean changing the rules. When deciding how to use ZFS as a root file system, we need to make tradeoffs between the familiarity of "the way we've always done it" versus the advantages of the radical new ways that ZFS lets us manage system software. The next sections describe the ways that Solaris's interactions with root file systems will change with ZFS. 6.1 Division of the Solaris Name Space into Datasets By default, the entire Solaris distribution will be installed into a single dataset (that is, no separate datasets for /usr, /opt or /var). There will be an option to put /var into a separate dataset. The reasons for this are: 1. It's simple and reflects current practice. 2. Without a clear plan for how Caiman would make use of additional datasets in the name space, there wasn't a good reason to have them. 3. The exception for /var is because some environments require a a separate /var in order to prevent growth in /var from filling up the root file system and resulting in a Denial Of Service situation. A space used by a separate /var dataset can be limited by a quota. In addition, it will be a documented best practice to create zone roots in their own datasets. The installation applications create /export and /export/home directories. These will be created as separate datasets and will be shared between BEs. 6.2 Naming the Datasets in the Boot Environment The name convention used by the installation and BE management software will be as follows: There will be a "container" dataset defined with the name: /ROOT and a "mountpoint" property value of "legacy". (This directory will not appear by default in any BE's file name space). The directories immediately under the /ROOT directory will contain BEs, named as follows and with the following values for their mount point properties: dataset name mountpoint property ------------ ------------------- /ROOT/ / /ROOT//var inherited where the /var dataset is optional. 6.2.1 New "noauto" value for "canmount" property. ZFS datasets currently have a "canmount" property. If "on", the dataset can be mounted and is automatically mounted at import, creation, and system startup. If "off", the dataset can't be mounted at all. For zfs datasets in BEs, there needs to be third option: "noauto". The "noauto" value has the following effect: * The dataset can be mounted. * The dataset can only be mounted and unmounted explicitly. It is not mounted automatically when the dataset is created or imported and it is not mounted by "zfs mount -a" or unmounted by "zfs unmount -a". This property is required for datasets in a BE because there could be multiple datasets with a mountpoint of "/". Obviously, only one of them can be mounted a time. Furthermore, at the time that the install software creates a new bootable dataset and gives it a default mount point of "/", the creation would fail if there were no "noauto" option because there is already a file system mounted as "/". 6.3 Mounting the Boot Environment ZFS supports automatic mounting of file systems, without the need for /etc/vfstab entries. File systems that compose the BE will use zfs mounts, not legacy mounts. The fact that the root file system and the /var file system (if one exists) don't have entries in /etc/vfstab will simplify the process of cloning BEs because the /etc/vfstab file will not need to be edited to modify the name of the dataset being mounted. 6.4 System Initialization With ZFS, there is no particular reason to do an initial read-only mount of the root file system. With UFS, root had to be mounted read-only so it could have fsck run on it. ZFS has no fsck. The first mount of a ZFS root file system will be read-write and there is no need for a later remount. Of course, the root dataset will be mounted read-only anyway if it is a read-only dataset (such as a snapshot. Again, booting of snapshots is not supported yet, but probably will be at some point.) 6.5 Swap/Dump On systems with zfs root, swap and dump will be allocated from within the root pool. This has several advantages: it simplifies the process of formatting disks for install, and it provides the ability to re-size the swap and dump areas without having to re-partition the disk. Both swap and dump will be zvols. On systems with zfs root, there will be separate swap and dump zvols. The reasons for this are: 1. The implementation of swap and dump to zvols (not discussed here because it isn't a visible interface) does not allow the same device to be used for both. 2. The total space used for swap and dump is a relatively small fraction of a typical disk size so the cost of having separate swap and dump devices is small. 3. It will be common for systems to need different amounts of swap and dump space. When they are separate zvols, it's easier to define the proper size of each. Because swap is implemented as a zvol, features such as encryption and checksumming will automatically work for swap space (it it particularly important that encyrption be supported). The swap and dump zvols for a pool will be the datasets: /swap /dump Their sizes and configuration values will be set to default values by the installation software. The default values will be: dump: dump content is "kernel pages". Size is 1/4 size of physical memory. swap: 512 MB, or 1% of root pool size, whichever is larger. Since it's easy for an administrator to change these sizes (by changing the size of the zvols), we no longer provide prompts in the install applications to set these sizes. In the old days when swap and dump occupied fixed-size slices, it was important to set the size correctly when the disk was formatted by install. Now, it can be changed later as necessary. All BEs in the pool will share the swap and dump zvols. The install program will put this entry in /etc/vfstab: /dev/zvol/dsk//swap - - swap - no - This is the only way that zfs boot still relies on the /etc/vfstab file. Install will use the dumpadm(1M) command to set up the dump device attributes on the installed system. 6.6 Checkpoint-Restart (CPR) Boot - SPARC only The ZFS boot architecture supports the checkpoint-restart facility on SPARC platforms. This is achieved by setting the "statefile" value in the /etc/power.conf file to the device name of the dump zvol. 6.7 Managing Boot Environments On both x86 and sparc systems, the list of BEs in a root pool will be contained in a "menu.lst" file. Up to now, menu.lst has been purely for GRUB use, but we will extend it to sparc as well. It menu.lst files on systems will be managed as follows: * There will be one menu.lst file per root pool and it will reside in the "pool dataset" (i.e., the dataset at the root of of the dataset hierarchy) at /boot/grub/menu.lst. * When LiveUpgrade has operated on a boot environment, the BE contains a file called /etc/lu/GRUB_slice, which specifies the slice that contains the active menu.lst. The /etc/lu/GRUB_slice file in a zfs-based BE will look like this: PHYS_SLICE= LOG_SLICE= LOG_FSTYP=zfs The bootadm(1M) program will be modified to recognize a GRUB_slice in this format and use it to find the active menu.lst file when an administrator performs a "bootadm list-menu" command. * The "pool dataset" will be mounted in every BE at the location /pooldata. This means that the menu.lst file will appear in the BE's name space at: /pooldata/boot/grub/menu.lst. It will NOT appear at /boot/grub/menu.lst. The "bootadm list-menu" command will show the correct location of the active GRUB menu. * BEs residing in zfs pools (i.e., those whose root file system is a zfs dataset) will NOT have a /boot/grub/menu.lst file in their name space. The menu.lst file can be accessed at /pooldata/boot/grub/menu.lst. * All of the disks in a mirrored root pool will share the same menu.lst file (obviously). However, if it is necessary to boot off a disk other than the primary boot disk in a mirror, only the "hd0" device in the menu.lst will be accessible. This is an artifact of how the BIOS works. Since the menu.lst file on sparc systems will not identify physical boot devices (only BE name), this will not be an issue with sparc. The complexities of booting from a backup mirror side on x86 are considerable (though not new), and will need to be documented in the context of zfs root pools. 6.7.1 Format of the sparc menu.lst file The menu.lst file on sparc systems will contain two of the GRUB commands: title - Provides a title for a boot environment. bootfs - full name of a bootable dataset The "boot -L" OBP command, when executed on a device with a zfs boot loader, will print a menu with the "title" values for all the menu.lst entries. The user can select a menu item to boot. The dataset named by the "bootfs" value for the menu item will be used for all subsequent files to be read by the booter (such as the boot archive and the various configuration files in /etc) and that dataset will be mounted as root. 6.8 Failsafe Booting Failsafe booting, as introduced in PSARC/2004/454 (Solaris Boot Architecture), will be supported on systems booted from zfs. As in the case of booting from UFS-rooted BEs, each BE will have a failsafe archive. The failsafe archive will be in same location in the root file system as it is in a UFS-rooted BE. Each failsafe archive will have an entry in the pool-wide GRUB menu. The "default" failsafe archive will be the one in the default bootable dataset, as indicated by the value of the pool's "bootfs" property. 7. Installation 7.1 Overview of Solaris Installation There are several ways to install and upgrade Solaris: (a) Mini-root based installs Some of the install procedures are run while the system is booted from a miniroot. A miniroot is the operating system on a DVD or CD, or which is downloaded to the system during a netinstall and which runs out of ramdisk. In all of these scenarios, the system being installed is not booted off local writable storage. This is the only kind of install that can be used on an uninstalled system (i.e., a system with no bootable local disks). (b) Live installs/upgrades If a system has been configured with extra disk space, it is possible to do an install or upgrade of a system that is booted off a local disk. While booted off a previously-installed bootable environment on the system, a new bootable environment can be installed (on spare storage). This new bootable environment can be an upgrade of the existing bootable environment. 7.2 Changes to Installation to Support ZFS Boot The following is an analysis of how the installation software will need to be changed to support ZFS boot. Before looking at how the many parts of install will need to change, we need an overview of how the installation of a system with a ZFS root will work. 7.2.1 Some Guiding Design Principles 1. For now, ZFS boot disks will still have SMI labels. By default, when a zpool is created on an entire disk, it is given an EFI label. But the boot proms and the install software don't yet support EFI labels. ZFS boot must live within that limitation, for now. So when an install application specifies that a pool be created on a full disk, the disk will be formatted with an SMI label with a VTOC that dedicates all of the disk (except for the small amount of space required for the GRUB slice on x86) to slice 0, which then becomes the vdev for the root pool. 2. ZFS pools do not require entire disks. Even though best practices recommend using the entire disk for a pool, we recognize that some systems have very large disks which the administrator might not want to dedicate entirely to a root pool. So we support the splitting of disks into a slice for a root pool (or one mirror of a root pool) and the remainder of the disk for slices for other purposes, including pools that are not root pools. 7.2.2 Initial Install 1. The software to be installed will be selected (as it is now with UFS root file systems). 2. Determine whether system being installed will end up with a UFS root or a ZFS root. It will not be possible to install part of the system software on ZFS and part on UFS. It will, of course, be possible to have ZFS file systems on a system with a UFS root and vice versa, but the software installed by Solaris install or upgrade must either be all on ZFS or all on UFS. 3. If UFS, install will work as it does now. If ZFS, the user will have the opportunity to designate multiple disks or disk slices for the root pool. In the first phase of ZFS boot support, these disks will be used to create a mirrored vdev. (RAID-Z configurations will likely be supported in the future.) The user will have the opportunity to set the size of the root pool (default should be the entire disk, but we can't require a whole disk). 4. The disks will be partitioned as specified in step 3. 5. The root pool will be created and the datasets composing the BE (/ and possibly/var) will be created. The swap and dump space zvols will be created within the root pool. 6. The standard "software install" part of the install backend will install all of the Solaris packages into the root file system datasets. 7. The boot "overhead" will be installed as necessary. The existence of a new bootable dataset will be recorded in the pool metadata. The menu.lst file will be created. The new boot environment will be recorded in /etc/lutab (and whatever other overhead files are required to establish a BE in LiveUpgrade). The boot archive will be created and the boot loader will be installed on each disk in the root pool. One thing to note about the above procedure is that step 7 requires the setup of a LiveUpgrade boot environment. Currently, when a system is installed, a "BE" in the LiveUpgrade sense is not established. This will change. ZFS bootable environments will always get recorded as LiveUpgrade- compliant BEs. The overhead files that establish LiveUpgrade BEs will eventually be processed by the install tools that are part of Caiman. (Caiman might not use the same files as LiveUpgrade, but it will understand them and be able to process BEs established by LiveUpgrade.) The above steps describe interactive installs. Jumpstart installs follow a similar series of steps but the "questions" are answered from the profile, not by querying a user. 7.2.3 Upgrade A system "upgrade" is the process of converting an installed instance of Solaris to a later version, while preserving all local customizations. The standard Solaris miniroot-based installation program currently supports an option to upgrade the installed Solaris instance. This upgrade is done "in-place" (i.e., the new bits are written into the same file system where the older version was, thereby replacing the old version). This kind of "in-place" upgrade, done from the miniroot-based install program, will not be supported for zfs root file systems. There are several reasons for this: 1. "In-place" upgrade dates from the days when disks were much smaller and it was common not to have enough space for two Solaris instances. That's not true now that disks are seldom smaller than 80 GB or so, and Solaris instances are around 6 GB. 2. A "copy and upgrade" model has several advantages over an "in-place" upgrade. It can be done while the system is "live", and can be easily backed out. A "copy and upgrade" model does not have the so-called "toxic patch" problem. 3. ZFS is ideally suited for the "copy-and-upgrade" model of system upgrade. With ZFS, a Solaris instance can be easily cloned and modified. Because of pooled storage, the new Solaris instance doesn't require its own slice. If an already-installed system is booted from an installation DVD/CD or a netinstall image, the install "discovery" software will detect the existence of any ZFS root pools on local storage. UFS roots are also detected. The logic for interactive installs will be: All local storage will be categorized as follows: Category 1: contains an upgradable ufs root file system Category 2: is part of a ZFS root pool Category 3: all other storage. If there are any upgradable ufs root file systems, the user will get an opportunity to upgrade them. If there are any root pools, the user will see a message indicating that the upgrade of any BEs in those pools must be accomplished by LiveUpgrade, not by the (currently-running) miniroot-based install. In all cases, if the user opts for an action that would destroy an existing pool, the user must be warned of that and given an opportunity to abort the install. 7.2.4 Servers of Diskless Clients Since exported services are just local directories, it will be possible to export services from zfs datasets. No special support for zfs is needed. 7.3 Details of Changes to the Install Tools 7.3.1 Jumpstart profile Interpretation (pfinstall) New keywords are defined to support the creation of ZFS pools and datasets. A detailed description of these keywords is provided in Appendix A. Not all of these keywords will have corresponding screens in the interactive install programs. There is precedent for allowing more configuration capabilities in profiles than are supported in the interactive install programs (currently, the only way to set up a mirrored root with SVM is by a profiled install). 7.3.2 Interactive Install Programs The interactive miniroot-based install programs are ttinstall (which has a character-based interface) and the install GUI. At this time, there are no plans to support the setup of zfs roots from the install GUI. Only ttinstall will support zfs root. (Naturally, the Caiman install will support ZFS fully.) The ttinstall program has a "parade", which is a series of screens that query the user for the details of how the system is to be installed. The "parade" will need new screens to determine the following: 1. whether the system to be installed will have a zfs root 2. identify the disks to be added to the pool 3. the name of the root pool and the root dataset 4. how much of the disk should be dedicated to the root pool (default is "all") In the initial (pre-Caiman) version of zfs root install, it will not be possible to set up a system with zfs root using Flash Install. 7.3.3 LiveUpgrade The required changes to LiveUpgrade for zfs fall into two areas: 1. The changes required to enable boot environments (BEs) to be in zfs datasets. 2. Changes required to support zfs datasets as subordinate file systems in BEs with a ufs root. This includes zfs file systems mounted under a ufs root, and support for non-global zone roots in zfs datasets. Technically, the items in (2) have nothing to do with zfs boot and should have been done as part of the original zfs integration. For whatever reason, most of them were not done. They need to be done now, since the use of zfs as a root file system necessitates full support for zfs in LiveUpgrade. LiveUpgrade will be modified to make it possible to create a boot environment (BE) whose root is a zfs file system. These zfs-based BEs can be populated by cloning an existing BE with either a UFS or ZFS root. Cloning a UFS root will be the most common way to migrate from UFS root to ZFS root. If the source BE of a lucreate has a zfs root file system, the target BE will be created as a zfs clone of the source BE. This means that the lucreate will be very fast and will initially occupy very little space. Currently, liveupgrade never partitions disks or allocates disk space for BEs. It depends on the slices to be used for BEs having already been created. This will remain true for zfs liveupgrade support. With zfs, there are two steps in allocating space: formatting disks and creating storage pools. Lucreate will not format disks or create pools. Both actions must have been previously performed before lucreate can create a BE in a zfs dataset. Note that liveupgrade will work for zfs and can be used to migrate systems from a ufs root to a zfs root, but the creation of the root storage pool must have been previously done by the administrator (since allocating storage has never been part of liveupgrade's job). LiveUpgrade will work differently than it does now in the following ways: * Mirroring of zfs file systems will not be directly supported by liveupgrade. Mirroring in zfs is fundamentally different than mirroring using SVM plus UFS. With zfs, mirroring is done at the pool level, not the file system level. Therefore, it's not meaningful to specify that a zfs mount be mirrored. If the user wants mirrored storage for their BEs, they must create their storage pool using a mirrored vdev configuration. So the "attach" and "detach" options will not apply to zfs mounts. * One of the purposes of having SVM-mirrored BEs (or more exactly, mirrored file systems within BEs), is to allow the fast 'cloning' of a BE by detaching a side of the mirror and using that detached device as a new BE (and the basis of a patch or an upgrade). ZFS-based BEs will support a fast cloning capability even though mirroring of individual ZFS file systems is not supported. With ZFS, fast cloning of a BE will be performed by zfs's own dataset cloning capability. This is much easier to manage and plan for than SVM mirroring because it isn't necessary to pre-allocate a fixed amount of space for the clone of a ZFS file system. There are two variants of the lucreate command that can be used with zfs root file system: 1. Migrating a BE to a new pool. This an either be a migration from one ZFS pool to another, or from a UFS root to a zfs root. The form of the command is: lucreate -n -p This command requests that a new BE name be create in the pool name . 2. Cloning an existing ZFS-based BE to a new BE in the same pool lucreate -n This command will be very easy to execute. All that is required is that the new BE be named. All datasets in the source BE (called the "PBE" in LiveUpgrade terminology) will be cloned and will appear as separate datasets in the new BE also. All other LiveUpgrade commands, such as luactivate and luupgrade can be used on the new BE. LiveUpgrade will take care of all the details of cloning all the datasets in the BE. The user shouldn't have to be aware that the BE is composed of multiple file systems. This is much easier than the use of LiveUpgrade with UFS, where each mounted file system in a BE must be represented by a "-m" option on the lucreate command line. 8. References PSARC/2004/454 Solaris Boot Architecture PSARC/2005/198 Install Interface Changes Under New Boot PSARC/2006/525 Newboot Sparc PSARC/2007/083 ZFS Bootable Datasets 9. Interface Table 9.1 Exported Interfaces zfs properties new "noauto" value for "canmount" Evolving menu.lst for sparc Evolving zfsboot loader for SPARC Evolving jumpstart keywords pool Evolving /pooldata Evolving 9.2 Imported Interfaces bootadm Stable /etc/power.conf Stable dumpadm Stable 9.3 Interfaces Reimplemented /etc/lu/GRUB_slice LiveUpgrade ttinstall jumpstart keywords bootenv rootdev Appendix A - Jumpstart Profile Keywords for ZFS ZFS interpretation of existing profile commands: Basically, a profile can be used to set up a zfs root or a ufs root. If the profile is being used to set up a ufs root, all the existing profile keywords work as they do now. There is only one exception to that: The "filesys" keyword can preserve a pool as well as an individual slice now. A profile which creates a zfs root is indicated by the presence of two keywords: a new "pool" keyword, and a "bootenv" keyword with a new "installbe" subcommand. If the profile has both of those keywords, it's a "zfs-creating" profile and some keywords that are allowed in a ufs-creating profile will not be allowed (such as those specifying the creation of ufs mount points for parts of the Solaris namespace). Here is a list of keywords that are permitted in zfs-creating profiles and their interpretation: * filesys filesys existing ignore preserve This tells pfinstall to leave the specified slice untouched and work around it. The command will be extended to allow a pool name to appear in the field. This causes all vdevs in the pool to be preserved. filesys unnamed where can be "any". This profile entry causes a raw slice of the specified size to be created. (it can be newfs'ed or otherwise initialized by a finish script, if necessary). All other uses of the "filesys" keyword are prohibited. * rootdev If present, specifies the device to be used for the root pool New keywords: * pool pool [ ] This command can be used either to specify a new pool, or to identify an existing pool to be used for installation. fields: name - specifies the name of the pool If the pool already exists, the install will be performed into newly created datasets in that pool. In this case, all remaining arguments are ignored. If the pool doesn't already exist, it will be created with the specified size and on the specified vdevs. If the pool already exists, but isn't a valid root pool, pfinstall will print a message and quit. If the pool doesn't exist ands and aren't supplied, we error out. So this lets us both install into an existing pool, or define a new one. If "-" is specified as the pool name, we look for the current root pool. If the system isn't already set up for zfs boot (i.e., no current root pool), error out. pool_size - size of the pool to be created. Can be "auto" or "existing" (meaning the boundaries of the existing slices by that name are preserved and overwritten by a zpool. Size is assumed to be in megabytes, unless terminated by "g" (gigabyte) swap_size - size in MB or GB of the swap zvol to be created. Can be "auto" (the default swap size will be used) (MB assumed, can specify GB by terminating the size with 'g') dump_size - size in MB or GB of the dump zvol to be created. Can be "auto" (the default dump size will be used) (MB assumed, can specify GB by terminating the size with 'g') vdev_list - list of devices to be used to create the pool. The format of the vdev list is the same as used for the "zpool create" command. At this time, only mirrored configuration are supported. In the future, we will probably support the raid-z configuration. Devices in the vdev list can either be whole devices or slices. They can also be the string "any", which means that the install software will select a suitable device. * bootenv installbe Actually, the bootenv already exists, but we will define a new subcommand. It will have the syntax: bootenv installbe which means: Create a new BE called and install it. ---------------------- Examples: Case 1 - new pool, mirrored install_type initial_install pool newrootpool auto auto auto mirror c0t0d0 c0t1d0 bootenv install_be s10_u6 Case 2 - new pool, mirrored, devices to be assigned, alternate metacluster selected. Pool size specified. install_type initial_install cluster SUNWCuser pool newrootpool 80g 2g auto mirror any any bootenv install_be s10_u6 This profile creates a new pool of 80 GB with a 2GB swap zvol and a 2GB dump zvol on a mirror of any two available devices that are large enough to supply a 80 GB pool (if two such devices aren't available, the install will fail). A new BE called "s10_u6" will be created. Case 3 - LiveInstall install_type initial_install pool currootpool bootenv install_be s10_u6 This creates a new BE called s10_u6 in the existing pool "currootpool". Case 4 - Upgrade install_type upgrade That's all you need. Since this is called by LU, the BE to be upgraded will already have been mounted. The root of it is passed to pfinstall by the -L option.