ZFS as a Root File System (PSARC/2006/370) Author: Lori Alt (lori.alt@sun.com) June 8, 2007 0. Contents 1. Introduction 2. Overview 3. Design Goals 4. Terminology 5. Boot Design 5.1 Booting from ZFS on x86 platforms 5.2 Booting from ZFS on sparc platforms 5.3 Checkpoint-Restart (CPR) Boot - SPARC only 5.4 Limits on ZFS Root Pools 6. ZFS Root in the Solaris Operating Environment 6.1 Division of the Solaris Name Space into Datasets 6.2 Mounting the Boot Environment 6.3 System Initialization 6.4 Swap/Dump 7. Installation 7.1 Overview of Solaris Installation 7.2 Changes to Installation to Support ZFS Boot 7.2.1 Some Guiding Design Principles 7.2.2 Initial Install 7.2.3 Upgrade 7.2.4 Servers of Diskless Clients 7.3 Details of Changes to the Install Tools 7.3.1 Jumpstart profile Interpretation (pfinstall) 7.3.2 Interactive Install Programs 7.3.3 LiveUpgrade 8. Summary 9. References Appendix A - Jumpstart Profile Keywords 1. Introduction The ZFS file system was introduced in Update 2 of Solaris 10. One of the limitations of the initial release of ZFS is that it could not be used as a root file system. This document describes the necessary changes to Solaris to enable ZFS to be used as a root file system. The implementation of the design specified here will enable users to install systems with ZFS roots. It is intended for integration into Solaris Nevada and then into the most appropriate Solaris 10 Update release. 2. Overview Enabling the ZFS file system to be Solaris's root file system requires the following broad tasks: - Boot support must be added. This includes (but is not limited to) the implementation of boot loaders for both the sparc and x86 architectures. - The operation of basic system management tasks, such as the booting of zones, or the mounting of file systems, must be defined in the context of a ZFS root. What does a system with a ZFS root look like? How is it different from a system with a UFS root? How are administrative tasks affected by the new root file system type? - The installation software must be modified to support the creation of ZFS file systems and the installation of Solaris into them. The installation software must also support upgrades and patching of systems with ZFS root file systems. These three broad areas--booting, system operation, and installation-- will be addressed by this document. 3. Design Goals This project is the first implementation of ZFS as a root file system for Solaris. The installation aspects of this project are for the near-term only. In the medium to long term, a new install project, Caiman, will take over installation and upgrade. Because of this, it is our goal that the installation of systems with ZFS roots be "good enough" in this project. We're not going to invest a lot of work in code that will end up being thrown away soon. The existing install code is very old (written in the early 1990's). It was written for a different world, where disks were small and expensive (along with many other differences from today, though disk size is the one that affects the install design the most). However, this implementation of ZFS root does have to prepare for the long run. Once users move to the new model of pooled storage, they're not going to want another paradigm shift after that. We need to define the "way it works" in such a way that the transition from the initial release of ZFS root to Caiman will be smooth. Caiman will have both an initial install and an ongoing maintenance component. Caiman's ongoing maintenance component in particular needs to be able to manage transitions from systems that were installed with the tools and processes described in this initial ZFS boot project. With those concerns in mind, these are the goals of this design: - Provide the tools to install systems that can make immediate use of many of the features of ZFS that are most valuable for system software. These features include: * pooled storage, giving users great flexibility in the setup of bootable environments. * snapshots and clones, which enable users to quickly and safely perform patches and upgrades. * robustness of storage, including mirroring - Enable the migration of systems with UFS root file systems to ZFS roots. - Define the administrative aspects of ZFS root file systems for the long term, as much as is reasonably possible at this time. Change happens and we don't have a crystal ball, so we can't be sure that the decisions we make now will stand the test of time, but we can at least choose to do things "the ZFS way", so that system software management will more naturally track the developments in ZFS over the years. - Prepare for Caiman. Set up installed systems in such a way that Caiman will be able to manage them. Define administrative procedures that are advantageous for Caiman's goals. 4. Terminology The following terms are important in this design and will occur throughout the document: root pool -- A ZFS storage pool that has been designated as a "root pool" by setting the value of the pool's "bootfs" property to something other than "none". When any of the devices that compose a root pool are booted (by specifying it as the boot device to the boot loader), the pool as a whole will be "booted", which means that a dataset in the pool (selected as described below) will become the root file system and will provide the necessary files for booting, such as the boot archive. pool dataset -- The dataset at the root of a pool's dataset namespace. The pool dataset of a pool is the dataset that exists by default in every pool. The pool dataset's name is the same as the pool's name. A pool named "tank" will have a pool dataset also called, simply, "tank". A pool dataset does not necessarily contain a root file system. In fact, usually it won't. It probably won't contain much of anything. It does have one very useful attribute: each pool has exactly one pool dataset. It cannot be deleted without deleting the pool itself. This makes the pool dataset a good place for files that contain per-pool state (as opposed to per-dataset state). The menu.lst file (the menu file for GRUB) is an example of a file that contains per-pool state and thus will be stored in the pool dataset. bootable environment -- Often abbreviated as a "BE". A bootable environment is basically a Solaris root file system, plus everything that is subordinate to it (such as mounted file systems). There can be multiple bootable environments in a root pool. In the case of a ZFS root pool, the Solaris root file system in a BE is a ZFS dataset. There can be multiple root file system datasets, and thus multiple BEs, in a root pool. bootable dataset -- a dataset which contains a root file system and which is part of a BE. Every BE contains exactly one bootable dataset. A BE can contain other datasets (such as a /usr or /opt file system), but those other datasets are not considered bootable. A bootable dataset must contain a boot archive. default bootable dataset -- a dataset named by the value of the root pool's "bootfs" property. If a pool has a value of something other than "none" for its "bootfs" property, the pool is a root pool. If a device that is part of a root pool is booted, and a specific dataset to be booted is not explicitly identified in the command to the boot loader, the dataset identified by the "bootfs" property will be the root file system dataset to be booted. 5. Boot Design The process of booting x86 platforms was revised by the Solaris Boot Architecture project (PSARC/2004/454), also called "newboot". The booting of SPARC platforms is currently being modified to adhere to the newboot architecture as well (PSARC/2006/525). The main aspects of newboot that characterize both the x86 and sparc implementations are: * The use of a boot archive, which is a file system image that contains the files required for booting, and * The use of a ramdisk as the root file system during the early stages of booting. In the case of booting for the purpose of doing an installation, the ramdisk is the root file system for the entire installation process, which eliminates the need to be booted from a removable media. At this time, the file system type of the ramdisk file system can be either HSFS or UFS, but not ZFS (ZFS is not well-suited for use a ramdisk file system, and there is no particular reason to use it that way. UFS or HSFS is a better choice.) ZFS boot, on both x86 and SPARC, assumes the newboot style of booting. On both the x86 and SPARC platforms, for both ufs and ZFS root file systems, the following tasks must be accomplished during the process of booting: 1. The boot device must be identified. 2. The PROM (either OBP or the BIOS) reads a boot loader from the boot device. The boot loader is loaded into memory and executed. 3. The boot loader selects a bootable environment (BE) to be mounted as the root file system. This selection can be made based on user input, variables set in NVRAM, or however else the particular boot loader has been programmed to make the choice. 4. The boot archive for the selected boot environment is loaded into memory and that memory range is accessed as a ramdisk, from which the files needed for booting are loaded and executed. The biggest difference between booting from UFS and booting from ZFS is that on ZFS, a device identifier does not uniquely identify a root file system (and thus a BE). With ZFS, a device identifier uniquely identifies a pool, which can contain multiple bootable datasets. So with ZFS, there must be a mechanism for identifying the dataset to be used as the root file system. The mechanism for specifying the dataset to be booted was defined in PSARC/2007/083. A pool property, "bootfs", specifies the default bootable dataset for the pool. When a device in a root pool is booted, the dataset mounted by default as the root file system is the one identified by the "bootfs" pool property. The use can override this selection however. On x86 platforms, the GRUB menu can be used to select an alternate bootable environment. On sparc platforms, an option to the OBP "boot" command can specify the dataset to be booted. 5.1 Booting from ZFS on x86 platforms Much of this was already defined by PSARC/2007/083 (ZFS Bootable datasets). It is summarized here for review purposes. The steps by which a ZFS root file system is booted are: 1. The BIOS reads the Master Boot Record (MBR) from the boot disk. The MBR identifies the location where the GRUB boot loader has been installed on the disk. The version of GRUB installed for the purpose of booting from ZFS has a ZFS-reader built into it. 2. The ZFS GRUB plug-in special-cases the menu.lst file. When asked to read the menu.lst file, the plug-in reads it from the pool dataset. 3. GRUB presents the menu. 4. The menu entry for a boot environment with a ZFS dataset as its root looks like this: title Solaris kernel$ /platform/i86pc/kernel/$ISADIR/unix -B $ZFS-BOOTFS module$ /platform/i86pc/$ISADIR/boot_archive Optionally, it can contain a command of the form: bootfs before the kernel$ command. The ZFS plug-in will replace the $ZFS-BOOTFS macro in the GRUB commands with the name of the dataset to be booted, which will be either the argument of the "bootfs" command (if one was specified) or the value of the "bootfs" pool property. In this way, the dataset selected for booting is passed to the kernel. 5. The kernel$ and module$ commands are executed, thereby loading the unix module and the boot archive into memory. The unix module is now a multiboot-compliant executable as a result of the integration of direct boot (PSARC/2006/568). When the unix module is executed, it reads the remainder of the files it needs for booting from the boot archive. Eventually, zfs_mountroot is called. 6. zfs_mountroot() gets the name of the dataset to be mounted from the zfs-bootfs boot property. The vdev label from the boot device is read, which permits the reading of the pool metadata. The selected bootable dataset is then found in the pool metadata and mounted as root. 5.2 Booting from ZFS on sparc platforms 1. The boot device must have had a ZFS boot block installed on it. OBP reads the boot block into memory and executes it. 2. The ZFS boot block maps the boot device to a ZFS pool and reads enough of the pool's metadata to get the value of the "bootfs" pool property. The ZFS booter supports two features that allow an alternate dataset to be booted: i. The '-L' switch to the "boot" command is passed to the booter and causes the booter to read the /etc/lutab file in the pool dataset to get the list of available BEs. ii. The '-Z ' switch to the booter causes the specified dataset to be booted (instead of the dataset identified by the "bootfs" pool property). 3. The booter reads the file '/platform/`uname -m`/boot_archive' from the dataset selected for booting in the previous step (either by default or by explicit selection using '-Z'). The file is read into memory and set up as a ramdisk. 4. The booter creates a 'zfs-bootobj' property whose value is the identity of the dataset selected as the root file system. This is the same dataset from which the boot archive was loaded in the previous step. The booter also sets the value of the "fstype" property to "zfs". 5. The ramdisk set up by loading the boot archive in memory is itself booted. (This just means that blocks 1 - 15 of the device are loaded into memory and executed.) The ramdisk will have a boot block that matches the file system type of the ramdisk contents, which will be either HSFS or UFS. (No need for ZFS here.) 6. Unix is read and booted from the ramdisk. When krtld gains control, it mounts the ramdisk and loads additional kernel modules from it. Eventually, zfs_mountroot is called (since the value of the "fstype" property was set to "zfs"). 7. zfs_mountroot() imports the pool identified by the boot device and mounts the dataset identified by the "zfs-bootobj" property as root. 5.3 Checkpoint-Restart (CPR) Boot - SPARC only The ZFS boot architecture supports the checkpoint-restart facility. A CPR boot takes place as follows: When the system was suspended prior to the CPR boot, the following things would have occurred: 1. The boot-command OBP variable would have been modified to: "boot -F cprboot". 2. The contents of memory will have been written to a "state file" in the root file system. The location of the state file is specified in the /etc/power.conf file (default location: /.CPR ) 3. The "boot" command will have been issued as follows: ok boot -F cprboot The booter uses its file system specific reader to read the cprboot file from the UFS root file system (in the case of UFS) or the selected bootable dataset (in the case of ZFS) From here on, the cprboot program runs as usual. It read its data (/etc/power.conf and the state file) from the root file system. In the case of zfs, the boot loader loaded in the booter phase still provides the ability to read files from the selected bootable dataset, which is the zfs root file system. NOTE: If the dataset that is booted at the time of the checkpoint is NOT the default bootable dataset, the checkpointing code may have to store the name of the booted dataset in an OBP variable (perhaps along with the "-F cprboot" string) so that the correct dataset is searched for the state file. 5.4 Limits on ZFS Root Pools Initially, root pools are limited to mirrored configurations only. Striping of vdevs is not permitted. Nor is RAID-Z. The reason for this is that the firmware must be able to read everything needed for booting from a single disk. Each disk in a root pool must be fully accessible from the firmware on the system. Partly as a result of this constraint, best practice will be to have separate pools for system software and data. Users may not want to constrain the configuration of data pools to the limits imposed on root pools. This is not the only reason for this segregation however. In general, there are advantages to managing the "personality" of a system separate from its data. The segregation of system software and data is not mandatory however. Users can combine in them one pool if they wish. In a future release, it is likely that booting from RAID-Z pools will be supported. Another restriction on root pools is that the devices in the pool must have SMI labels (i.e., not EFI labels). This is a restriction imposed by OBP and the install software. 6. ZFS Root in the Solaris Operating Environment ZFS has a very different administrative paradigm than UFS, or just about any other file system type. For the most part, these new administrative concepts, such as pooled storage, offer great advantages when ZFS is used as a root file system, but they can also mean changing the rules. When deciding how to use ZFS as a root file system, we need to make tradeoffs between the familiarity of "the way we've always done it" versus the advantages of the almost radical new ways that ZFS lets us manage system software. The next sections describe the ways that Solaris's interactions with root file systems will change with ZFS. 6.1 Division of the Solaris Name Space into Datasets Back in the days of small disks, Solaris was routinely installed on multiple disk slices. The /usr, /opt and other directories in the Solaris name space were often installed in separate file systems because they wouldn't fit on one disk. Then, as disks grew bigger, the default was to install all of Solaris in one file system. With ZFS, we might want to consider splitting the name space into separate file systems again, not for reasons of space, but because of administrative benefits. For starters, there's no strong reason NOT to install Solaris into multiple datasets. With ZFS, file systems are more like directories than what we used to call file systems. They require no pre-allocated storage and they have low overhead. Some reasons to split the Solaris name space into separate file systems are: 1. Administrators might want to use different storage attributes for different parts of the name space. Perhaps the user would like to compress /opt, but not the root file system. 2. When cloning boot environments, some parts of the name space could be included in the new BE by reference. For example, since /var/adm/log reflects the history of the entire system, not just one BE, perhaps there should only be one of them, which is mounted into each BE. The cleanest way to do this is to make that directory its own file system. 3. In order to support booting from RAID-Z in the future, it might be advantageous to keep the root file system as small as possible so that it can be replicated on each device in the pool without excessive space use. Splitting directories such /usr, /var, and /opt into separate file systems will help keep root small. 4. The installation of whole root zones would be simplified if the default inherited directories (/usr, /lib, /sbin, and /platform) were separate, clonable file systems. Whole root zones aren't used all that often right now, at least partly because they require a lot of disk space and time to set up. However, with ZFS cloning, whole root zones could be implemented quickly and use less space. The exact division of the Solaris name space into datasets is largely TBD. The proposal at this time is to split the name space along the following lines: / /usr /var /opt /sbin /lib /platform /home /export In addition, we would expect zone roots to be created in their own datasets. One issue here is whether to further split /var into separate datasets. Some reasons for not splitting /var any further are: 1. Where do you stop? It would be nice if the various log files that you might want to share between BES were all in one place, but they aren't. They're scattered all around /var. 2. LiveUpgrade already has a mechanism for sync'ing volatile files between BEs. Maybe Caiman's LiveUpgrade replacement will have a better solution to this problem. In the meantime, it may be best to continue to use our existing synchronization tools. 3. ZFS's copy-on-write capability allows us to clone a potentially large directory (such as /var/crash) without requiring any actual time-consuming, space-consuming copies. So there's no particular cost in having those parts of the file system cloned instead of shared. On the face of it, having all of these new datasets looks like it greatly complicates the interfaces. But there are a couple ways that the complication will be mitigated or hidden: 1. The tools that we provide for managing boot environments will have to shield users from the complexity of multi-dataset BEs anyway. Even if BEs are only divided into a couple datasets, the process of creating and maintaining BEs is going to result in fairly complex scenarios of snapshots, clones, and dependency relationships. The tools for doing this (LiveUpgrade for now, some aspect of Caiman later) must allow the user to manage the transitions at the BE-level, NOT the individual dataset level. Once we've done that, then it becomes less of a problem to add additional datasets to the recommended or required configuration. 2. In the section below on "System Initialization and Mounting", we describe how the ZFS file system mount capabilities, together with some changes to the SMF methods for mounting the Solaris name space, can hide the dataset hierarchy, or at least make it something the administrator doesn't have to be aware of. In particular, ZFS makes it possible to set up hierarchies of mounts without requiring entries in /etc/vfstab. 6.2 Mounting the Boot Environment ZFS supports automatic mounting of file systems, without the need for /etc/vfstab entries. ZFS does support the traditional method of mounting (if the dataset's "mountpoint" property is set to "legacy"), but there are real advantages in using ZFS's native mount approach. These advantages are especially useful when dealing with datasets in a BE. Consider what would happen if legacy mounts were used for the datasets in a BE. Every dataset in a BE would have to have an entry in /etc/vfstab, with the dataset's name explicitly appearing in the "device to mount" field. If the dataset's name were changed, every one of those entries in /etc/vfstab would have to be updated. When a snapshot is made of a BE, the snapshot has an immediate internal inconsistency: the name of the dataset in the BE's /etc/vfstab entries is wrong, by definition. The name can't be corrected because snapshots are read-only. Now, granted, we don't support the booting of read-only roots yet, but we might want to do so in the future. The clone IS bootable, because it's read-write, but its /etc/vfstab entry is initially incorrect and needs to be fixed. Furthermore, the BE name in the /etc/vfstab entries is redundant information at the time the entries are read. The system already KNOWS the name of the dataset that has been booted, because that's how it got booted in the first place. The better alternative is to use ZFS's native mount method and name the datasets in such a way that the mount of the root dataset for the BE automatically results in the mount of the entire BE, thanks to ZFS mountpoint inheritance and automatic mounts. So suppose we have a BE named "s10_u6". It is composed of the following hierarchy of datasets: rootpool/s10_u6 rootpool/s10_u6/usr rootpool/s10_u6/var rootpool/s10_u6/sbin rootpool/s10_u6/lib rootpool/s10_u6/platform rootpool/s10_u6/home rootpool/s10_u6/export At the time the BE is installed, if the mountpoint property of rootpool/s10_u6 is set to "/a", the remainder of the datasets will automatically get mounted at /a/usr, /a/var, /a/sbin, and so on. When the BE is booted and the dataset rootpool/s10_u6 is mounted at "/", the remainder of the datasets automatically inherit mountpoints /usr, /var, /sbin, and so on. 6.3 System Initialization With ZFS, there is no particular reason to do an initial read-only mount of the root file system. With UFS, root had to be mounted read-only so it could have fsck run on it. ZFS has no fsck. The first mount of a ZFS root file system will be read-write and there is no need for a later remount of either root of /usr. Moreover, if /sbin, /lib, and /platform are separate datasets, as is desirable for the whole-root zone support, those datasets will need to be mounted before the start of init(1M). The reason for this is that the very earliest userland code in the kernel expects to be able to find necessary files and tools in /sbin and /lib, and maybe in /platform. This means a change to the meaning of "vfs_mountroot", at least for ZFS. It now means "mount the datasets that are required for init(1M)". This might seem like a rather large change, but really, we're just mounting what we always mounted in "mountroot". To the rest of the system, and to the user-level code, those mounts all happen atomically, and so userland has available what it has always counted on having. The rest of the startup process will have to change accordingly. This means that the SMF methods that do the local system mounts (fs-root, fs-usr, fs-minimal) will need to check for root being zfs and perform the necessary actions for that case. The order of events will be something like this: 0. (already mounted: /, /sbin, /lib, /platform) 1. /lib/svc/method/fs-root will determine whether root is zfs by reading /etc/mnttab, and if so, will acquire the root's dataset name from /etc/mnttab too. Then it will derive the dataset name for /usr from the dataset name for root and mount /usr using "zfs mount" 2. if root is zfs, /lib/svc/method/fs-usr will not need to remount root or /usr 3. if root is zfs, /lib/svc/method/fs-minimal will look for all the file systems it normally mounts (/var, /var/adm, /var/run) in the list of the BE's dataset and mount them using zfs mounts, if they are separate datasets. 4. the /lib/svc/method/fs-local script will need to change to look at ALL the local mounts (both those in /etc/vfstab and the ZFS native mounts), order them correctly, and perform them. This will require a change to the operation of "/sbin/mount -a". (This change to the operation of mount -a is required no matter what. It was always broken to perform all the /etc/vfstab mounts first, and then do the "zfs mount -a". If there are mount in /etc/vfstab whose mount points are supplied in file systems mounted by ZFS, those mounts will fail. There is already a bug against this and it has nothing to do with ZFS root. But ZFS root will make the problem worse yet.) to get the remainder of file systems in the BE and mount them. 6.4 Swap/Dump We would like to allocate the space for swap and dump from within the pool if at all possible. This has several advantages: it simplifies the process of formatting disks for install, and it provides the ability to re-size the swap/dump area without having to re-partition the disk. A zvol might seem like the best option for this, but the semantics of zvols aren't quite right for swap and dump. For one thing, dump requires that that space be preallocated, which zvols don't do. Moreover, the copy-on-write model for zvols isn't necessarily what you want to do when swapping. In general, there are performance issues and possibly deadlock issues with using zvols as swap devices. At some point, this might get resolved (see the proposed VM2 project), but it won't get resolved in the near term. So instead, this project will eventually propose a new interface to zfs which allows the preallocation of areas of disk space for the purpose of creating a logical swap/dump device. This disk space will not be a true zvol. The details of this interface are still TBD. 7. Installation 7.1 Overview of Solaris Installation There are several ways to install and upgrade Solaris: (a) Mini-root based installs Some of the install procedures are run while the system is booted from a miniroot. A miniroot is the operating system on a DVD or CD, or which is downloaded to the system during a netinstall and which runs out of ramdisk. In all of these scenarios, the system being installed is not booted off local writable storage. This is the only kind of install that can be used on an uninstalled system (i.e., a system with no bootable local disks). Mini-root installs include initial install and upgrade. Both initial install and upgrade can be done using an interactive install program (either tty-based or GUI) or by Jumpstart, which is a hands-off install/upgrade that updates the system in accord with a specification file, called a "profile". It is also possible to install and upgrade systems using Flash archives. A Flash Archive is an image of an installed system which can be copied to a device using sequential I/O. It's much faster than a regular install and allows systems to be pre-customized and configured. A "differential" Flash archive, which contains only a subset of the file on an installed system, can be used to upgrade a system. (b) Live installs/upgrades If a system has been configured with extra disk space, it is possible to do an install or upgrade of a system that is booted off a local disk. While booted off a previously-installed bootable environment on the system, a new bootable environment can be installed (on spare storage). This new bootable environment can be an upgrade of the existing bootable environment. The new bootable environment can also be built from Flash archive. (c) Setup of Servers of Diskless Clients The setup of servers for diskless clients can either be done during a Jumpstart install, or by the smosservice(1M) command. 7.2 Changes to Installation to Support ZFS Boot The following is an analysis of how the installation software will need to be changed to support ZFS boot. Before looking at how the many parts of install will need to change, we need an overview of how the installation of a system with a ZFS root will work. 7.2.1 Some Guiding Design Principles 1. For now, ZFS boot disks will still have SMI labels. By default, when a zpool is created on an entire disk, it is given an EFI label. But the boot proms and the install software don't yet support EFI labels. ZFS boot must live within that limitation, for now. So when an install application specifies that a pool be created on a full disk, the disk will be formatted with an SMI label with a VTOC that dedicates all of the disk (except for the small amount of space required for the GRUB slice on x86) to slice 0, which then becomes the vdev for the root pool. 2. ZFS pools do not require entire disks. Even though best practices recommend using the entire disk for a pool, we recognize that some systems have very large disks which the administrator might not want to dedicate entirely to a root pool. So we support the splitting of disks into a slice for a root pool (or one mirror of a root pool) and the remainder of the disk for slices for other purposes, including pools that are not root pools. 7.2.2 Initial Install 1. Determine whether system being installed will end up with a UFS root or a ZFS root. It will not be possible to install part of the system software on ZFS and part on UFS. It will, of course, be possible to have ZFS file systems on a system with a UFS root and vice versa, but the software installed by Solaris install or upgrade must either be all on ZFS or all on UFS. 2. If UFS, install will work as it does now. If ZFS, the user will have the opportunity to designate multiple disks or disk slices for the root pool. In the first phase of ZFS boot support, these disks will be used to create a mirrored vdev. (Striped and RAID-Z configuration will be supported in the future.) The user will have the opportunity to set the size of the root pool (default should be the entire disk, but we can't require a whole disk). 3. The software to be installed will be selected (as it is now with UFS root file systems). 4. There will be a default division of the Solaris name space into separate datasets (see "Namespace Divisions" below). The user will not have an opportunity to modify this default division. (TBD: It might be possible to allow the user to add additional additional points at which separate datasets will be created. They may not remove the default divisions however. Those will be required.) 5. The disks will be partitioned as specified in step 2. 6. The root pool will be created and all datasets determined in step 4 will be created. The swap space will be allocated within the pool (see "Swap/Dump Implementation" below). 7. The standard "software install" part of the install backend will install all of the Solaris packages into the root file system datasets. 8. The boot "overhead" will be installed as necessary. The existence of a new bootable dataset will be recorded in the pool metadata. On x86, the menu.lst file will be created or updated to include the new boot environment. The new boot environment will be recorded in /etc/lutab (and whatever other overhead files are required to establish a BE in LiveUpgrade). The boot archive will be created and the boot loader will be installed on each disk in the root pool. One thing to note about the above procedure is that step 8 requires the setup of a LiveUpgrade boot environment. Currently, when a system is installed, a "BE" in the LiveUpgrade sense is not established. This will change. ZFS bootable environments will always get recorded as LiveUpgrade- compliant BEs. The overhead files that establish LiveUpgrade BEs will eventually be processed by the install tools that are part of Caiman. (Caiman might not use the same files as LiveUpgrade, but it will understand them and be able to process BEs established by LiveUpgrade.) The above steps describe interactive installs. Jumpstart installs follow a similar series of steps but the "questions" are answered from the profile, not by querying a user. 7.2.3 Upgrade A system "upgrade" is the process of converting an installed instance of Solaris to a later version, while preserving all local customizations. The standard Solaris miniroot-based installation program currently supports an option to upgrade the installed Solaris instance. This upgrade is done "in-place" (i.e., the new bits are written into the same file system where the older version was, thereby replacing the old version). This kind of "in-place" upgrade, done from the miniroot-based install program, will not be supported for zfs root file systems. There are several reasons for this: 1. "In-place" upgrade dates from the days when disks were much smaller and it was common not to have enough space for two Solaris instances. That's not true now that disks are seldom smaller than 80 GB or so, and Solaris instances are around 6 GB. 2. A "copy and upgrade" model has several advantages over an "in-place" upgrade. It can be done while the system is "live", and can be easily backed out. A "copy and upgrade" model does not have the so-called "toxic patch" problem. 3. ZFS is ideally suited for the "copy-and-upgrade" model of system upgrade. With ZFS, a Solaris instance can be easily cloned and modified. Because of pooled storage, the new Solaris instance doesn't require its own slice. If an already-installed system is booted from an installation DVD/CD or a netinstall image, the install "discovery" software will detect the existence of any ZFS root pools on local storage. UFS roots are also detected. The logic for interactive installs will be: All local storage will be categorized as follows: Category 1: contains an upgradable ufs root file system Category 2: is part of a ZFS root pool Category 3: is part of a ZFS non-root pool Category 4: all other file systems. If (there are any root pools present) { if (there are any upgradable ufs root file systems) Print a message indicating that the ZFS root pool can't be upgraded, but one or more of the ufs root file systems can. Allow the user to either * select a ufs file system for upgrade, or * do an initial install } else { Print a message indicating that the ZFS root pool can't be upgraded, Allow the user to * do an initial install } } else { if (there are upgradable ufs root file systems) { allow the user to either * select a ufs file system for upgrade, * do an initial install } else { allow the user to an initial install } } In all cases, if the user opts for an action that would destroy an existing pool, the user must be warned of that and given an opportunity to abort the install. So if we don't allow a miniroot-based upgrade of a ZFS root file system, how DO we upgrade it? The answer is LiveUpgrade, and eventually, its follow-on, Caiman. This document will discuss how upgrade will be done with LiveUpgrade, since that will be the only upgrade mechanism until Caiman is released. 7.2.4 Servers of Diskless Clients Since exported services are just local directories, it will be possible to export services from zfs datasets. No special support for zfs is needed. 7.3 Details of Changes to the Install Tools 7.3.1 Jumpstart profile Interpretation (pfinstall) New keywords are defined to support the creation of ZFS pools and datasets. A detailed description of these keywords is provided in Appendix A. Not all of these keywords will have corresponding screens in the interactive install programs. There is precedent for allowing more configuration capabilities in profiles than are supported in the interactive install programs (currently, the only way to set up a mirrored root with SVM is by a profiled install). 7.3.2 Interactive Install Programs The interactive miniroot-based install programs are ttinstall (which has a character-based interface) and the install GUI. At this time, there are no plans to support the setup of zfs roots from the install GUI. Only ttinstall will support zfs root. (Naturally, the Caiman install will support ZFS fully.) The ttinstall program has a "parade", which is a series of screens that query the user for the details of how the system is to be installed. The "parade" will need new screens to determine the following: 1. whether the system to be installed will have a zfs root 2. identify the disks to be added to the pool 3. the name of the root pool and the root dataset 4. how much of the disk should be dedicated to the root pool (default is "all") In the initial (pre-Caiman) version of zfs root install, it will not be possible to set up a system with zfs root using Flash Install. 7.3.3 LiveUpgrade The required changes to LiveUpgrade for zfs fall into two areas: 1. The changes required to enable boot environments (BEs) to be in zfs datasets. 2. Changes required to support zfs datasets as subordinate file systems in BEs with a ufs root. This includes zfs file systems mounted under a ufs root, and support for non-global zone roots in zfs datasets. Technically, the items in (2) have nothing to do with zfs boot and should have been done as part of the original zfs integration. For whatever reason, most of them were not done. They need to be done now, since the use of zfs as a root file system necessitates full support for zfs in LiveUpgrade. LiveUpgrade will be modified to make it possible to create a boot environment (BE) whose root is a zfs file system. These zfs-based BEs can be populated by cloning an existing BE with either a UFS or ZFS root. Cloning a UFS root will be the most common way to migrate from UFS root to ZFS root. If the source BE of a lucreate has a zfs root file system, the target BE will be created as a zfs clone of the source BE. This means that the lucreate will be very fast and will initially occupy very little space. Currently, liveupgrade never partitions disks or allocates disk space for BEs. It depends on the slices to be used for BEs having already been created. This will remain true for zfs liveupgrade support. With zfs, there are two steps in allocating space: formatting disks and creating storage pools. Lucreate will not format disks or create pools. Both actions must have been previously performed before lucreate can create a BE in a zfs dataset. Note that liveupgrade will work for zfs and can be used to migrate systems from a ufs root to a zfs root, but the creation of the root storage pool must have been previously done by the administrator (since allocating storage has never been part of liveupgrade's job). LiveUpgrade will work differently than it does now in the following ways: * Mirroring of zfs file systems will not be directly supported by liveupgrade. Mirroring in zfs is fundamentally different than mirroring using SVM plus UFS. With zfs, mirroring is done at the pool level, not the file system level. Therefore, it's not meaningful to specify that a zfs mount be mirrored. If the user wants mirrored storage for their BEs, they must create their storage pool using a mirrored vdev configuration. So the "attach" and "detach" options will not apply to zfs mounts. * One of the purposes of having SVM-mirrored BEs (or more exactly, mirrored file systems within BEs), is to allow the fast 'cloning' of a BE by detaching a side of the mirror and using that detached device as a new BE (and the basis of a patch or an upgrade). ZFS-based BEs will support a fast cloning capability even though mirroring of individual ZFS file systems is not supported. With ZFS, fast cloning of a BE will be performed by zfs's own dataset cloning capability. This is much easier to manage and plan for than SVM mirroring because it isn't necessary to pre-allocate a fixed amount of space for the clone of a ZFS file system. There are two variants of the lucreate command that can be used with zfs root file system: 1. Migrating a BE to a new pool. This an either be a migration from one ZFS pool to another, or from a UFS root to a zfs root. The form of the command is: lucreate -n -t :[:] This command requests that a new BE name be create in the pool name . At least one -t argument with a equal to "/" must be specified and that -t argument specifies where the new BE will be created. 2. Cloning an existing ZFS-based BE to a new BE in the same pool lucreate -n This command will be very easy to execute. All that is required is that the new BE be named. All datasets in the source BE (called the "PBE" in LiveUpgrade terminology) will be cloned and will appear as separate datasets in the new BE also. All other LiveUpgrade commands, such as luactivate and luupgrade can be used on the new BE. LiveUpgrade will take care of all the details of cloning all the datasets in the BE. The user shouldn't have to be aware that the BE is composed of multiple file systems. This is much easier than the use of LiveUpgrade with UFS, where each mounted file system in a BE must be represented by a "-m" option on the lucreate command line. 8. Summary This project is the first step toward moving Solaris to an environment in which system software is maintained entirely within pools, and it is relatively easy and safe to modify, patch, or upgrade the root file system and its subordinate file systems. This is also an important step toward making ZFS the principal local file system type for Solaris. 9. References PSARC/2004/454 Solaris Boot Architecture PSARC/2006/525 Newboot Sparc PSARC/2007/083 ZFS Bootable Datasets Appendix A - Jumpstart Profile Keywords for ZFS Keyword: install_type Syntax: install_type Possible values for : initial_install, upgrade, flash_install, flash_update "initial install" will continue to mean "bare-metal-install", as defined above. Preservation of existing slices/devices is allowed, but modification of existing pools will not be allowed. Ditto for "flash-install". "upgrade" will NOT allow an upgrade of an existing zfs bootable dataset. As it does now, it will allow the in-place upgrade of existing ufs boot environments only. --------------------------------------------------------------------- Keyword: pool Purpose: specify the characteristics of a pool to be created Syntax: pool Always creates a new root pool with the specified name, using the specified Vdevlist. A Vdevlist is either a single device name, or the keyword "mirror", followed by one or more device names. The keyword "mirror" may seem redundant since it is the only allowed configuration if more than device is in a pool (concatenation and RAID-Z are not supported), but it is included to give us the flexibility to implement concatenation and RAID-Z in the future, should that ever be possible. Any of the devices in the vdevlist can be "any", which means that a device of adequate size will be found on the system (if there is no free disk or none of sufficient size, the install will stop with an error.) Yes, if you specify "any any", you will get a mirroring of whatever two disks can be found. The first two suitable disks will be used. If there aren't two suitable disks available, the install will stop with an error. at this time, the can only be "/". size: can be one of the following values: auto - automatically select the size, based on other constraints in the profile and the required size of the root pool. By default, the entire disk will be dedicated to the pool if there are no other claims on the disk's space. existing - uses the existing size of the specified slice (only works for explicit slice designation.) all - use the entire disk for the root pool free - use the free space remaining on the designated disk - size explicitly specified in megabytes. ---------------------------------------------------------------------- Keyword: dataset Purpose: specify a dataset or BE to be created. If the mount point is "/", this command will create an entire BE, with all required subordinate file systems. That is, separate profile entries for all the subordinate file systems are not required. If provided, they can override the options for those subordinate datasets. Name space divisions other that the required ones can also be requested using this command. Syntax: dataset [] dataset_name: must be of the form size: can have the value [:] or "auto". where is specified in megabytes. The optional tells the profile interpreter how to interpret the size value. Options are: reserve - reserve the size in the pool for the dataset quota - establish a quota of specified size guidelines - make sure there is at least this much space in the pool for the dataset (i.e. use it to verify that the pool is big enough), but don't reserve the space or make it a quota zvol - this value makes the dataset a zvol of the specified size. The default value of the size modifier is "guideline". The entire field can have the value "auto", in which case the dataset will be created with no specified size and will grow to whatever size it needs during the install. At this time, only the "auto" keyword is supported. mnt_pnt : absolute mount point name, "default", or "none". properties : (optional) a white-space separated list of = pairings. [TBD - need examples]