ZFS as a Root File System (PSARC/2006/370)
Author:  Lori Alt (lori.alt@sun.com)
June 8, 2007

0. Contents

     1. Introduction
     2. Overview
     3. Design Goals
     4. Terminology
     5. Boot Design
	 5.1 Booting from ZFS on x86 platforms
	 5.2 Booting from ZFS on sparc platforms
	 5.3 Checkpoint-Restart (CPR) Boot - SPARC only
	 5.4 Limits on ZFS Root Pools
     6. ZFS Root in the Solaris Operating Environment
	 6.1 Division of the Solaris Name Space into Datasets
	 6.2 Mounting the Boot Environment
	 6.3 System Initialization
	 6.4 Swap/Dump
     7. Installation
	 7.1 Overview of Solaris Installation
	 7.2 Changes to Installation to Support ZFS Boot
	 7.2.1 Some Guiding Design Principles
	 7.2.2 Initial Install
	 7.2.3 Upgrade
	 7.2.4 Servers of Diskless Clients
	 7.3 Details of Changes to the Install Tools
	 7.3.1  Jumpstart profile Interpretation (pfinstall) 
	 7.3.2 Interactive Install Programs
	 7.3.3 LiveUpgrade
      8. Summary
      9. References

     Appendix A - Jumpstart Profile Keywords

1.  Introduction

The ZFS file system was introduced in Update 2 of Solaris 10.  One of
the limitations of the initial release of ZFS is that it could not be
used as a root file system.  This document describes the necessary changes
to Solaris to enable ZFS to be used as a root file system.  The
implementation of the design specified here will enable users to install
systems with ZFS roots.  It is intended for integration into Solaris Nevada
and then into the most appropriate Solaris 10 Update release.

2.  Overview

Enabling the ZFS file system to be Solaris's root file system requires the
following broad tasks:

 - Boot support must be added.  This includes (but is not limited to) the
   implementation of boot loaders for both the sparc and x86 architectures.

 - The operation of basic system management tasks, such as the booting
   of zones, or the mounting of file systems, must be defined in the
   context of a ZFS root.  What does a system with a ZFS root look like?
   How is it different from a system with a UFS root?  How are
   administrative tasks affected by the new root file system type?

 - The installation software must be modified to support the creation
   of ZFS file systems and the installation of Solaris into them.
   The installation software must also support upgrades and patching
   of systems with ZFS root file systems.

These three broad areas--booting, system operation, and installation--
will be addressed by this document.

3. Design Goals

This project is the first implementation of ZFS as a root file system
for Solaris.  The installation aspects of this project are for the
near-term only.  In the medium to long term, a new install project,
Caiman, will take over installation and upgrade.

Because of this, it is our goal that the installation of systems with
ZFS roots be "good enough" in this project.  We're not going to invest
a lot of work in code that will end up being thrown away soon.  The existing
install code is very old (written in the early 1990's).  It was written
for a different world, where disks were small and expensive (along with
many other differences from today, though disk size is the one that affects
the install design the most).

However, this implementation of ZFS root does have to prepare for the long
run.  Once users move to the new model of pooled storage, they're not going
to want another paradigm shift after that.  We need to define the "way
it works" in such a way that the transition from the initial release of
ZFS root to Caiman will be smooth.  Caiman will have both an initial
install and an ongoing maintenance component.  Caiman's ongoing maintenance
component in particular needs to be able to manage transitions from systems
that were installed with the tools and processes described in this initial
ZFS boot project.

With those concerns in mind, these are the goals of this design:

- Provide the tools to install systems that can make immediate use of
  many of the features of ZFS that are most valuable for system software.
  These features include:
	* pooled storage, giving users great flexibility in the setup
	  of bootable environments.
	* snapshots and clones, which enable users to quickly and safely
	  perform patches and upgrades.
	* robustness of storage, including mirroring
- Enable the migration of systems with UFS root file systems to ZFS roots.
- Define the administrative aspects of ZFS root file systems for the long
  term, as much as is reasonably possible at this time.  Change happens
  and we don't have a crystal ball, so we can't be sure that the decisions
  we make now will stand the test of time, but we can at least choose to do
  things "the ZFS way", so that system software management will more
  naturally track the developments in ZFS over the years.
- Prepare for Caiman.  Set up installed systems in such a way that Caiman
  will be able to manage them.  Define administrative procedures that are
  advantageous for Caiman's goals.

4. Terminology

The following terms are important in this design and will occur throughout
the document:

root pool -- A ZFS storage pool that has been designated as a "root pool"
	by setting the value of the pool's "bootfs" property to
	something other than "none".  When any of the devices that
	compose a root pool are booted (by specifying it as the
	boot device to the boot loader), the pool as a whole will be
	"booted", which means that a dataset in the pool (selected as
	described below) will become the root file system and will
	provide the necessary files for booting, such as the boot archive.

pool dataset -- The dataset at the root of a pool's dataset namespace.  The
	pool dataset of a pool is the dataset that exists by default in
	every pool.  The pool dataset's name is the same as the pool's name.
	A pool named "tank" will have a pool dataset also called, simply,
	"tank".  A pool dataset does not necessarily contain a root file
	system.  In fact, usually it won't.  It probably won't contain much
	of anything.  It does have one very useful attribute:  each pool
	has exactly one pool dataset.  It cannot be deleted without deleting
	the pool itself.  This makes the pool dataset a good place for
	files that contain per-pool state (as opposed to per-dataset state).
	The menu.lst file (the menu file for GRUB) is an example of a file
	that contains per-pool state and thus will be stored in the pool
	dataset.

bootable environment -- Often abbreviated as a "BE".  A bootable environment
	is basically a Solaris root file system, plus everything that is
	subordinate to it (such as mounted file systems).   There can be
	multiple bootable environments in a root pool.  In the case of a
	ZFS root pool, the Solaris root file system in a BE is a ZFS
	dataset.  There can be multiple root file system datasets, and thus
	multiple BEs, in a root pool.

bootable dataset -- a dataset which contains a root file system
	and which is part of a BE.  Every BE contains exactly one
	bootable dataset.  A BE can contain other datasets (such 
	as a /usr or /opt file system), but those other datasets
	are not considered bootable.  A bootable dataset must contain
	a boot archive.

default bootable dataset -- a dataset named by the value of the root pool's
	"bootfs" property.  If a pool has a value of something other
	than "none" for its "bootfs" property, the pool is a root pool.
	If a device that is part of a root pool is booted, and a
	specific dataset to be booted is not explicitly identified in
	the command to the boot loader, the dataset identified by the
	"bootfs" property will be the root file system dataset to be booted.

5. Boot Design

The process of booting x86 platforms was revised by the Solaris Boot
Architecture project (PSARC/2004/454), also called "newboot".  The booting
of SPARC platforms is currently being modified to adhere to the newboot
architecture as well (PSARC/2006/525).  The main aspects of newboot that
characterize both the x86 and sparc implementations are:

*  The use of a boot archive, which is a file system image that contains
   the files required for booting, and
*  The use of a ramdisk as the root file system during the early stages of
   booting.  In the case of booting for the purpose of doing an installation,
   the ramdisk is the root file system for the entire installation process,
   which eliminates the need to be booted from a removable media.  At this
   time, the file system type of the ramdisk file system can be either HSFS
   or UFS, but not ZFS (ZFS is not well-suited for use a ramdisk file
   system, and there is no particular reason to use it that way.  UFS or
   HSFS is a better choice.)

ZFS boot, on both x86 and SPARC, assumes the newboot style of booting.

On both the x86 and SPARC platforms, for both ufs and ZFS root file systems,
the following tasks must be accomplished during the process of booting:

    1.  The boot device must be identified.
    2.  The PROM (either OBP or the BIOS) reads a boot loader from the
	boot device.  The boot loader is loaded into memory and executed.
    3.  The boot loader selects a bootable environment (BE) to be mounted
	as the root file system.  This selection can be made based on user
	input, variables set in NVRAM, or however else the particular boot
	loader has been programmed to make the choice. 
    4.  The boot archive for the selected boot environment is loaded into
	memory and that memory range is accessed as a ramdisk, from which
	the files needed for booting are loaded and executed.

The biggest difference between booting from UFS and booting from ZFS is
that on ZFS, a device identifier does not uniquely identify a root file
system (and thus a BE).  With ZFS, a device identifier uniquely identifies
a pool, which can contain multiple bootable datasets.  So with ZFS, there
must be a mechanism for identifying the dataset to be used as the root
file system.

The mechanism for specifying the dataset to be booted was defined
in PSARC/2007/083.   A pool property, "bootfs", specifies the default
bootable dataset for the pool.  When a device in a root pool is booted,
the dataset mounted by default as the root file system is the one
identified by the "bootfs" pool property.  The use can override this
selection however.  On x86 platforms, the GRUB menu can be used to
select an alternate bootable environment.  On sparc platforms, an
option to the OBP "boot" command can specify the dataset to be booted.

5.1 Booting from ZFS on x86 platforms

Much of this was already defined by PSARC/2007/083 (ZFS Bootable
datasets).  It is summarized here for review purposes.  The steps
by which a ZFS root file system is booted are:

   1.  The BIOS reads the Master Boot Record (MBR) from the boot disk.
       The MBR identifies the location where the GRUB boot loader has
       been installed on the disk.  The version of GRUB installed for
       the purpose of booting from ZFS has a ZFS-reader built into it.
   2.  The ZFS GRUB plug-in special-cases the menu.lst file.  When
       asked to read the menu.lst file, the plug-in reads it from the
       pool dataset.
   3.  GRUB presents the menu.
   4.  The menu entry for a boot environment with a ZFS dataset as its
       root looks like this:

	 title Solaris
         kernel$ /platform/i86pc/kernel/$ISADIR/unix -B $ZFS-BOOTFS
         module$ /platform/i86pc/$ISADIR/boot_archive

       Optionally, it can contain a command of the form:

	 bootfs <dataset-name>

       before the kernel$ command.

       The ZFS plug-in will replace the $ZFS-BOOTFS macro in the
       GRUB commands with the name of the dataset to be booted, which
       will be either the argument of the "bootfs" command (if one was
       specified) or the value of the "bootfs" pool property.  In this
       way, the dataset selected for booting is passed to the kernel.
    5. The kernel$ and module$ commands are executed, thereby loading
       the unix module and the boot archive into memory.  The unix
       module is now a multiboot-compliant executable as a result of
       the integration of direct boot (PSARC/2006/568).  When the
       unix module is executed, it reads the remainder of the files it
       needs for booting from the boot archive.  Eventually, zfs_mountroot
       is called.
    6. zfs_mountroot() gets the name of the dataset to be mounted from
       the zfs-bootfs boot property.  The vdev label from the boot device
       is read, which permits the reading of the pool metadata.  The
       selected bootable dataset is then found in the pool metadata and
       mounted as root.  

5.2 Booting from ZFS on sparc platforms

    1. The boot device must have had a ZFS boot block installed on it.
       OBP reads the boot block into memory and executes it.
    2. The ZFS boot block maps the boot device to a ZFS pool and reads
       enough of the pool's metadata to get the value of the "bootfs"
       pool property.  The ZFS booter supports two features that allow
       an alternate dataset to be booted:

	  i. The '-L' switch to the "boot" command is passed to the
	     booter and causes the booter to read the /etc/lutab file
	     in the pool dataset to get the list of available BEs.
	 ii. The '-Z <dataset>' switch to the booter causes the specified
	     dataset to be booted (instead of the dataset identified by the
	     "bootfs" pool property).

    3. The booter reads the file '/platform/`uname -m`/boot_archive'
       from the dataset selected for booting in the previous step (either
       by default or by explicit selection using '-Z').  The file is read
       into memory and set up as a ramdisk.
    4. The booter creates a 'zfs-bootobj' property whose value is the
       identity of the dataset selected as the root file system.  This
       is the same dataset from which the boot archive was loaded in the
       previous step.  The booter also sets the value of the "fstype"
       property to "zfs".
    5. The ramdisk set up by loading the boot archive in memory is itself
       booted.  (This just means that blocks 1 - 15 of the device are
       loaded into memory and executed.)  The ramdisk will have a boot
       block that matches the file system type of the ramdisk contents,
       which will be either HSFS or UFS. (No need for ZFS here.)  
    6. Unix is read and booted from the ramdisk.  When krtld gains
       control, it mounts the ramdisk and loads additional kernel modules
       from it.  Eventually, zfs_mountroot is called (since the value of
       the "fstype" property was set to "zfs").
    7. zfs_mountroot() imports the pool identified by the boot device
       and mounts the dataset identified by the "zfs-bootobj" property
       as root.

5.3 Checkpoint-Restart (CPR) Boot - SPARC only

The ZFS boot architecture supports the checkpoint-restart facility.
A CPR boot takes place as follows:

     When the system was suspended prior to the CPR boot, the following
     things would have occurred:

    1. The boot-command OBP variable would have been modified to:
       "boot -F cprboot".

    2. The contents of memory will have been written to a "state file"
       in the root file system.  The location of the state file is
       specified in the /etc/power.conf file (default location:  /.CPR )

    3. The "boot" command will have been issued as follows:

	ok boot -F cprboot

       The booter uses its file system specific reader to read
       the cprboot file from the UFS root file system (in the case
       of UFS) or the selected bootable dataset (in the case of ZFS)

       From here on, the cprboot program runs as usual.  It read its
       data (/etc/power.conf and the state file) from the root file
       system.  In the case of zfs, the boot loader loaded in the booter
       phase still provides the ability to read files from the selected
       bootable dataset, which is the zfs root file system.

       NOTE:  If the dataset that is booted at the time of the
       checkpoint is NOT the default bootable dataset, the checkpointing
       code may have to store the name of the booted dataset in an
       OBP variable (perhaps along with the "-F cprboot" string) so
       that the correct dataset is searched for the state file.

5.4 Limits on ZFS Root Pools

Initially, root pools are limited to mirrored configurations
only.  Striping of vdevs is not permitted.  Nor is RAID-Z.
The reason for this is that the firmware must be able to read
everything needed for booting from a single disk.  Each disk in a
root pool must be fully accessible from the firmware on the system.  

Partly as a result of this constraint, best practice will be to
have separate pools for system software and data.  Users may not
want to constrain the configuration of data pools to the limits
imposed on root pools.  This is not the only reason for this
segregation however.  In general, there are advantages to managing
the "personality" of a system separate from its data.  The
segregation of system software and data is not mandatory however.
Users can combine in them one pool if they wish.

In a future release, it is likely that booting from RAID-Z pools will
be supported.

Another restriction on root pools is that the devices in the pool must
have SMI labels (i.e., not EFI labels).  This is a restriction imposed
by OBP and the install software.  

6. ZFS Root in the Solaris Operating Environment

ZFS has a very different administrative paradigm than UFS, or just about
any other file system type.  For the most part, these new administrative
concepts, such as pooled storage, offer great advantages when ZFS is used
as a root file system, but they can also mean changing the rules.  When
deciding how to use ZFS as a root file system, we need to make tradeoffs
between the familiarity of "the way we've always done it" versus the
advantages of the almost radical new ways that ZFS lets us manage system
software.  The next sections describe the ways that Solaris's interactions
with root file systems will change with ZFS.

6.1 Division of the Solaris Name Space into Datasets

Back in the days of small disks, Solaris was routinely installed on
multiple disk slices.  The /usr, /opt and other directories in the Solaris
name space were often installed in separate file systems because they
wouldn't fit on one disk.  Then, as disks grew bigger, the default
was to install all of Solaris in one file system.

With ZFS, we might want to consider splitting the name space into separate
file systems again, not for reasons of space, but because of administrative
benefits.  For starters, there's no strong reason NOT to install Solaris
into multiple datasets.  With ZFS, file systems are more like directories
than what we used to call file systems.  They require no pre-allocated
storage and they have low overhead.  Some reasons to split the Solaris
name space into separate file systems are:

    1. Administrators might want to use different storage attributes
       for different parts of the name space.  Perhaps the user would
       like to compress /opt, but not the root file system.
    2. When cloning boot environments, some parts of the name space
       could be included in the new BE by reference.  For example,
       since /var/adm/log reflects the history of the entire system,
       not just one BE, perhaps there should only be one of them,
       which is mounted into each BE.  The cleanest way to do this
       is to make that directory its own file system.
    3. In order to support booting from RAID-Z in the future, it
       might be advantageous to keep the root file system as small
       as possible so that it can be replicated on each device in the
       pool without excessive space use.  Splitting directories such
       /usr, /var, and /opt into separate file systems will help keep
       root small.
    4. The installation of whole root zones would be simplified if
       the default inherited directories (/usr, /lib, /sbin, and
       /platform) were separate, clonable file systems.  Whole root zones
       aren't used all that often right now, at least partly because they
       require a lot of disk space and time to set up.  However, with ZFS
       cloning, whole root zones could be implemented quickly and use
       less space.  

The exact division of the Solaris name space into datasets is largely
TBD.  The proposal at this time is to split the name space along the
following lines:

	/
	/usr
	/var
	/opt
	/sbin
	/lib
	/platform
	/home
	/export

In addition, we would expect zone roots to be created in their own datasets.

One issue here is whether to further split /var into separate datasets. 
Some reasons for not splitting /var any further are:

    1. Where do you stop?  It would be nice if the various log files
       that you might want to share between BES were all in one place,
       but they aren't.  They're scattered all around /var.
    2. LiveUpgrade already has a mechanism for sync'ing volatile files
       between BEs.  Maybe Caiman's LiveUpgrade replacement will have a
       better solution to this problem.  In the meantime, it may be best
       to continue to use our existing synchronization tools.
    3. ZFS's copy-on-write capability allows us to clone a potentially
       large directory (such as /var/crash) without requiring any actual
       time-consuming, space-consuming copies.  So there's no particular
       cost in having those parts of the file system cloned instead of
       shared.

On the face of it, having all of these new datasets looks like it greatly
complicates the interfaces.  But there are a couple ways that the
complication will be mitigated or hidden:

    1. The tools that we provide for managing boot environments will have
       to shield users from the complexity of multi-dataset BEs anyway.
       Even if BEs are only divided into a couple datasets, the process
       of creating and maintaining BEs is going to result in fairly
       complex scenarios of snapshots, clones, and dependency relationships.
       The tools for doing this (LiveUpgrade for now, some aspect of Caiman
       later) must allow the user to manage the transitions at the BE-level,
       NOT the individual dataset level.  Once we've done that, then it
       becomes less of a problem to add additional datasets to the 
       recommended or required configuration.
    2. In the section below on "System Initialization and Mounting", we
       describe how the ZFS file system mount capabilities, together with
       some changes to the SMF methods for mounting the Solaris name space,
       can hide the dataset hierarchy, or at least make it something the
       administrator doesn't have to be aware of.  In particular, ZFS
       makes it possible to set up hierarchies of mounts without requiring
       entries in /etc/vfstab.
	
6.2 Mounting the Boot Environment

ZFS supports automatic mounting of file systems, without the need for
/etc/vfstab entries.  ZFS does support the traditional method of mounting
(if the dataset's "mountpoint" property is set to "legacy"), but there
are real advantages in using ZFS's native mount approach.  These advantages
are especially useful when dealing with datasets in a BE.

Consider what would happen if legacy mounts were used for the datasets
in a BE.  Every dataset in a BE would have to have an entry in /etc/vfstab,
with the dataset's name explicitly appearing in the "device to mount" field.
If the dataset's name were changed, every one of those entries in /etc/vfstab
would have to be updated.  When a snapshot is made of a BE, the snapshot
has an immediate internal inconsistency:  the name of the dataset in the
BE's /etc/vfstab entries is wrong, by definition.  The name can't be
corrected because snapshots are read-only.  Now, granted, we don't support
the booting of read-only roots yet, but we might want to do so in the 
future.  The clone IS bootable, because it's read-write, but its 
/etc/vfstab entry is initially incorrect and needs to be fixed.

Furthermore, the BE name in the /etc/vfstab entries is redundant information
at the time the entries are read.  The system already KNOWS the name of
the dataset that has been booted, because that's how it got booted in the
first place. 

The better alternative is to use ZFS's native mount method and name the
datasets in such a way that the mount of the root dataset for the BE
automatically results in the mount of the entire BE, thanks to ZFS
mountpoint inheritance and automatic mounts.  So suppose we have a BE
named "s10_u6".  It is composed of the following hierarchy of datasets:

    rootpool/s10_u6
    rootpool/s10_u6/usr
    rootpool/s10_u6/var
    rootpool/s10_u6/sbin
    rootpool/s10_u6/lib
    rootpool/s10_u6/platform
    rootpool/s10_u6/home
    rootpool/s10_u6/export

At the time the BE is installed, if the mountpoint property of
rootpool/s10_u6 is set to "/a", the remainder of the datasets will
automatically get mounted at /a/usr, /a/var, /a/sbin, and so on.

When the BE is booted and the dataset rootpool/s10_u6 is mounted
at "/", the remainder of the datasets automatically inherit mountpoints
/usr, /var, /sbin, and so on.  

6.3 System Initialization

With ZFS, there is no particular reason to do an initial read-only
mount of the root file system.  With UFS, root had to be mounted read-only
so it could have fsck run on it.  ZFS has no fsck.  The first mount of
a ZFS root file system will be read-write and there is no need for a
later remount of either root of /usr.

Moreover, if /sbin, /lib, and /platform are separate datasets, as is
desirable for the whole-root zone support, those datasets will need to
be mounted before the start of init(1M).  The reason for this is that the
very earliest userland code in the kernel expects to be able to find
necessary files and tools in /sbin and /lib, and maybe in /platform.
This means a change to the meaning of "vfs_mountroot", at least for
ZFS.  It now means "mount the datasets that are required for init(1M)".
This might seem like a rather large change, but really, we're just
mounting what we always mounted in "mountroot".  To the rest of the
system, and to the user-level code, those mounts all happen atomically,
and so userland has available what it has always counted on having.

The rest of the startup process will have to change accordingly.
This means that the SMF methods that do the local system mounts (fs-root,
fs-usr, fs-minimal) will need to check for root being zfs and perform
the necessary actions for that case.  The order of events will be
something like this:

     0. (already mounted: /, /sbin, /lib, /platform)
     1. /lib/svc/method/fs-root will determine whether root is zfs by
        reading /etc/mnttab, and if so, will acquire the root's dataset
	name from /etc/mnttab too.  Then it will derive the dataset
        name for /usr from the dataset name for root and mount /usr
	using "zfs mount"
     2. if root is zfs, /lib/svc/method/fs-usr will not need to remount 
        root or /usr
     3. if root is zfs, /lib/svc/method/fs-minimal will look for all
        the file systems it normally mounts (/var, /var/adm, /var/run)
	in the list of the BE's dataset and mount them using zfs mounts,
	if they are separate datasets.
     4. the /lib/svc/method/fs-local script will need to change to look
	at ALL the local mounts (both those in /etc/vfstab and the ZFS
	native mounts), order them correctly, and perform them.  This
	will require a change to the operation of "/sbin/mount -a".  
	(This change to the operation of mount -a is required no matter
	what.  It was always broken to perform all the /etc/vfstab
	mounts first, and then do the "zfs mount -a".  If there are
	mount in /etc/vfstab whose mount points are supplied in file
	systems mounted by ZFS, those mounts will fail.  There is already
	a bug against this and it has nothing to do with ZFS root.  But
	ZFS root will make the problem worse yet.)

	to get the remainder of file systems in the BE and mount them.

6.4 Swap/Dump

We would like to allocate the space for swap and dump from within the
pool if at all possible.  This has several advantages:  it simplifies 
the process of formatting disks for install, and it provides the ability
to re-size the swap/dump area without having to re-partition the disk.

A zvol might seem like the best option for this, but the semantics of
zvols aren't quite right for swap and dump.  For one thing, dump requires
that that space be preallocated, which zvols don't do.  Moreover, the
copy-on-write model for zvols isn't necessarily what you want to do
when swapping.  In general, there are performance issues and possibly
deadlock issues with using zvols as swap devices.  At some point, this
might get resolved (see the proposed VM2 project), but it won't get
resolved in the near term.  

So instead, this project will eventually propose a new interface to zfs
which allows the preallocation of areas of disk space for the purpose
of creating a logical swap/dump device.  This disk space will not be
a true zvol.  The details of this interface are still TBD.

7. Installation

7.1 Overview of Solaris Installation

There are several ways to install and upgrade Solaris:

   (a) Mini-root based installs

   Some of the install procedures are run while the system is
   booted from a miniroot.  A miniroot is the operating system
   on a DVD or CD, or which is downloaded to the system during a
   netinstall and which runs out of ramdisk.  In all of these
   scenarios, the system being installed is not booted off local
   writable storage.  This is the only kind of install that can
   be used on an uninstalled system (i.e., a system with no
   bootable local disks).

   Mini-root installs include initial install and upgrade.  Both
   initial install and upgrade can be done using an interactive
   install program (either tty-based or GUI) or by Jumpstart, which
   is a hands-off install/upgrade that updates the system in accord
   with a specification file, called a "profile".

   It is also possible to install and upgrade systems using Flash archives.  
   A Flash Archive is an image of an installed system which can be
   copied to a device using sequential I/O.  It's much faster than a
   regular install and allows systems to be pre-customized and configured.
   A "differential" Flash archive, which contains only a subset of
   the file on an installed system, can be used to upgrade a system.

   (b) Live installs/upgrades

   If a system has been configured with extra disk space, it is possible
   to do an install or upgrade of a system that is booted off a local disk.
   While booted off a previously-installed bootable environment on
   the system, a new bootable environment can be installed (on spare
   storage).  This new bootable environment can be an upgrade of the
   existing bootable environment.  The new bootable environment can also
   be built from Flash archive.  

   (c) Setup of Servers of Diskless Clients

   The setup of servers for diskless clients can either be
   done during a Jumpstart install, or by the smosservice(1M) command.

7.2 Changes to Installation to Support ZFS Boot

The following is an analysis of how the installation software will need to
be changed to support ZFS boot.  Before looking at how the many parts of
install will need to change, we need an overview of how the installation
of a system with a ZFS root will work.

7.2.1 Some Guiding Design Principles

     1. For now, ZFS boot disks will still have SMI labels.  By default,
	when a zpool is created on an entire disk, it is given an EFI
	label.  But the boot proms and the install software don't yet
	support EFI labels.  ZFS boot must live within that limitation,
	for now.  So when an install application specifies that a pool
	be created on a full disk, the disk will be formatted with an
	SMI label with a VTOC that dedicates all of the disk (except
	for the small amount of space required for the GRUB slice on
    	x86) to slice 0, which then becomes the vdev for the root pool.

     2. ZFS pools do not require entire disks.  Even though best
	practices recommend using the entire disk for a pool, we recognize
	that some systems have very large disks which the administrator
	might not want to dedicate entirely to a root pool.  So we support
	the splitting of disks into a slice for a root pool (or one mirror
	of a root pool) and the remainder of the disk for slices for other
	purposes, including pools that are not root pools.

7.2.2 Initial Install

     1. Determine whether system being installed will end up with a UFS
	root or a ZFS root.  It will not be possible to install part of
	the system software on ZFS and part on UFS.  It will, of course,
	be possible to have ZFS file systems on a system with a UFS root
	and vice versa, but the software installed by Solaris install or
	upgrade must either be all on ZFS or all on UFS.

     2. If UFS, install will work as it does now.  If ZFS, the user
	will have the opportunity to designate multiple disks or disk
	slices for the root pool.  In the first phase of ZFS boot support,
	these disks will be used to create a mirrored vdev.  (Striped
	and RAID-Z configuration will be supported in the future.)  The
	user will have the opportunity to set the size of the root pool
	(default should be the entire disk, but we can't require a
	whole disk).

     3. The software to be installed will be selected (as it is now
	with UFS root file systems).

     4. There will be a default division of the Solaris name space into
	separate datasets (see "Namespace Divisions" below).  The user
	will not have an opportunity to modify this default division.
	(TBD:  It might be possible to allow the user to add additional
	additional points at which separate datasets will be created.
	They may not remove the default divisions however.  Those will
	be required.)

     5. The disks will be partitioned as specified in step 2.

     6. The root pool will be created and all datasets determined in
	step 4 will be created.  The swap space will be allocated within
	the pool (see "Swap/Dump Implementation" below).

     7. The standard "software install" part of the install backend will
	install all of the Solaris packages into the root file system
	datasets. 

     8. The boot "overhead" will be installed as necessary.  The existence
	of a new bootable dataset will be recorded in the pool metadata.
	On x86, the menu.lst file will be created or updated to include
	the new boot environment.  The new boot environment will be
	recorded in /etc/lutab (and whatever other overhead files are
	required to establish a BE in LiveUpgrade).  The boot archive will
	be created and the boot loader will be installed on each disk in
	the root pool.

One thing to note about the above procedure is that step 8 requires the
setup of a LiveUpgrade boot environment.  Currently, when a system is
installed, a "BE" in the LiveUpgrade sense is not established.  This will
change.  ZFS bootable environments will always get recorded as LiveUpgrade-
compliant BEs.  The overhead files that establish LiveUpgrade BEs will
eventually be processed by the install tools that are part of Caiman.
(Caiman might not use the same files as LiveUpgrade, but it will understand
them and be able to process BEs established by LiveUpgrade.)

The above steps describe interactive installs.  Jumpstart installs
follow a similar series of steps but the "questions" are answered
from the profile, not by querying a user.

7.2.3 Upgrade

A system "upgrade" is the process of converting an installed instance
of Solaris to a later version, while preserving all local customizations.

The standard Solaris miniroot-based installation program currently
supports an option to upgrade the installed Solaris instance.  This
upgrade is done "in-place" (i.e., the new bits are written into the same
file system where the older version was, thereby replacing the old
version).

This kind of "in-place" upgrade, done from the miniroot-based install
program, will not be supported for zfs root file systems.  There are several
reasons for this:

    1.  "In-place" upgrade dates from the days when disks were much smaller
	and it was common not to have enough space for two Solaris
	instances.  That's not true now that disks are seldom smaller than
	80 GB or so, and Solaris instances are around 6 GB.
    2.  A "copy and upgrade" model has several advantages over an "in-place"
	upgrade.  It can be done while the system is "live", and can be
	easily backed out.  A "copy and upgrade" model does not have the
	so-called "toxic patch" problem.
    3.  ZFS is ideally suited for the "copy-and-upgrade" model of system
	upgrade.  With ZFS, a Solaris instance can be easily cloned and
	modified.  Because of pooled storage, the new Solaris instance
	doesn't require its own slice.

If an already-installed system is booted from an installation DVD/CD
or a netinstall image, the install "discovery" software will detect
the existence of any ZFS root pools on local storage.  UFS roots
are also detected.  The logic for interactive installs will be:

    All local storage will be categorized as follows:

	Category 1: contains an upgradable ufs root file system
	Category 2: is part of a ZFS root pool
	Category 3: is part of a ZFS non-root pool
	Category 4: all other file systems.

   If (there are any root pools present) {
	if (there are any upgradable ufs root file systems)
		Print a message indicating that the ZFS root pool
		    can't be upgraded, but one or more of the ufs root
		    file systems can. 
		Allow the user to either 
			* select a ufs file system for upgrade, or
			* do an initial install
	} else {
		Print a message indicating that the ZFS root pool
		    can't be upgraded,
		Allow the user to
			* do an initial install
	}
   } else {
	if (there are upgradable ufs root file systems) {
		allow the user to either
			* select a ufs file system for upgrade,
			* do an initial install
	} else {
		allow the user to an initial install
	}
    }

    In all cases, if the user opts for an action that would destroy
    an existing pool, the user must be warned of that and given an
    opportunity to abort the install.

So if we don't allow a miniroot-based upgrade of a ZFS root file system,
how DO we upgrade it?  The answer is LiveUpgrade, and eventually, its
follow-on, Caiman.  This document will discuss how upgrade will be
done with LiveUpgrade, since that will be the only upgrade mechanism
until Caiman is released. 

7.2.4 Servers of Diskless Clients

Since exported services are just local directories, it will be
possible to export services from zfs datasets.  No special support
for zfs is needed.

7.3 Details of Changes to the Install Tools

7.3.1  Jumpstart profile Interpretation (pfinstall) 

New keywords are defined to support the creation of ZFS pools and datasets.  
A detailed description of these keywords is provided in Appendix A.

Not all of these keywords will have corresponding screens in the interactive
install programs.  There is precedent for allowing more configuration
capabilities in profiles than are supported in the interactive install
programs (currently, the only way to set up a mirrored root with SVM is by
a profiled install).

7.3.2 Interactive Install Programs

The interactive miniroot-based install programs are ttinstall
(which has a character-based interface) and the install GUI.
At this time, there are no plans to support the setup of zfs
roots from the install GUI.  Only ttinstall will support
zfs root. (Naturally, the Caiman install will support ZFS fully.)

The ttinstall program has a "parade", which is a series of screens that
query the user for the details of how the system is to be installed.
The "parade" will need new screens to determine the following:

  1. whether the system to be installed will have a zfs root
  2. identify the disks to be added to the pool
  3. the name of the root pool and the root dataset
  4. how much of the disk should be dedicated to the root pool
     (default is "all")

In the initial (pre-Caiman) version of zfs root install, it will not
be possible to set up a system with zfs root using Flash Install.

7.3.3 LiveUpgrade

The required changes to LiveUpgrade for zfs fall into two areas:
  1.  The changes required to enable boot environments (BEs) to be in
      zfs datasets.
  2.  Changes required to support zfs datasets as subordinate file systems
      in BEs with a ufs root.  This includes zfs file systems mounted under
      a ufs root, and support for non-global zone roots in zfs datasets.

Technically, the items in (2) have nothing to do with zfs boot and
should have been done as part of the original zfs integration.  For
whatever reason, most of them were not done.  They need to be done now,
since the use of zfs as a root file system necessitates full support
for zfs in LiveUpgrade.

LiveUpgrade will be modified to make it possible to create a boot environment
(BE) whose root is a zfs file system.  These zfs-based BEs can be populated
by cloning an existing BE with either a UFS or ZFS root.  Cloning a UFS
root will be the most common way to migrate from UFS root to ZFS root.

If the source BE of a lucreate has a zfs root file system, the target BE
will be created as a zfs clone of the source BE.  This means that the
lucreate will be very fast and will initially occupy very little space.

Currently, liveupgrade never partitions disks or allocates disk space
for BEs.  It depends on the slices to be used for BEs having already been
created.  This will remain true for zfs liveupgrade support.  With zfs,
there are two steps in allocating space: formatting disks and creating
storage pools.  Lucreate will not format disks or create pools.  Both
actions must have been previously performed before lucreate can create a BE
in a zfs dataset.  Note that liveupgrade will work for zfs and can be used
to migrate systems from a ufs root to a zfs root, but the creation of the
root storage pool must have been previously done by the administrator
(since allocating storage has never been part of liveupgrade's job).

LiveUpgrade will work differently than it does now in the following ways:
*  Mirroring of zfs file systems will not be directly supported by
   liveupgrade.  Mirroring in zfs is fundamentally different than mirroring
   using SVM plus UFS.  With zfs, mirroring is done at the pool level, not
   the file system level.  Therefore, it's not meaningful to specify that
   a zfs mount be mirrored.  If the user wants mirrored storage for their
   BEs, they must create their storage pool using a mirrored vdev
   configuration.  So the "attach" and "detach" options will not apply to
   zfs mounts. 
*  One of the purposes of having SVM-mirrored BEs (or more exactly, mirrored
   file systems within BEs), is to allow the fast 'cloning' of a BE by
   detaching a side of the mirror and using that detached device as a new
   BE (and the basis of a patch or an upgrade).  ZFS-based BEs will support
   a fast cloning capability even though mirroring of individual ZFS file
   systems is not supported.  With ZFS, fast cloning of a BE will be
   performed by zfs's own dataset cloning capability.  This is much
   easier to manage and plan for than SVM mirroring because it isn't
   necessary to pre-allocate a fixed amount of space for the clone of a
   ZFS file system.

There are two variants of the lucreate command that can be used
with zfs root file system:

   1. Migrating a BE to a new pool.  

   This an either be a migration from one ZFS pool to another, or
   from a UFS root to a zfs root.  The form of the command is:

   lucreate -n <new-be> -t <mount_point>:<pool>[:<options>]

   This command requests that a new BE name <new-be> be
   create in the pool name <pool>.  At least one -t argument
   with a <mount-point> equal to "/" must be specified and
   that -t argument specifies where the new BE will be created.

   2. Cloning an existing ZFS-based BE to a new BE in the same pool

   lucreate -n <new-be>

   This command will be very easy to execute.  All that is
   required is that the new BE be named.  All datasets in the
   source BE (called the "PBE" in LiveUpgrade terminology)
   will be cloned and will appear as separate datasets in the
   new BE also.  

All other LiveUpgrade commands, such as luactivate and luupgrade
can be used on the new BE.

LiveUpgrade will take care of all the details of cloning all the
datasets in the BE.  The user shouldn't have to be aware that the
BE is composed of multiple file systems.  This is much easier than
the use of LiveUpgrade with UFS, where each mounted file system in a
BE must be represented by a "-m" option on the lucreate command line.

8. Summary

This project is the first step toward moving Solaris to an environment
in which system software is maintained entirely within pools, and
it is relatively easy and safe to modify, patch, or upgrade the
root file system and its subordinate file systems.  This is also an
important step toward making ZFS the principal local file system
type for Solaris.

9. References

	PSARC/2004/454  Solaris Boot Architecture
	PSARC/2006/525  Newboot Sparc
	PSARC/2007/083  ZFS Bootable Datasets


Appendix A - Jumpstart Profile Keywords for ZFS

Keyword:  install_type
Syntax: install_type <install_type>

Possible values for <install_type>:  initial_install, upgrade,
  flash_install, flash_update

"initial install" will continue to mean "bare-metal-install", as
defined above.  Preservation of existing slices/devices is allowed,
but modification of existing pools will not be allowed.  Ditto
for "flash-install".

"upgrade" will NOT allow an upgrade of an existing zfs bootable
dataset.  As it does now, it will allow the in-place upgrade of existing
ufs boot environments only.

---------------------------------------------------------------------
Keyword: pool 
Purpose: specify the characteristics of a pool to be created
Syntax: pool <pool_name> <size> <mnt_pnt> <Vdevlist>

Always creates a new root pool with the specified name, using the
specified Vdevlist.  A Vdevlist is either a single device name, or the
keyword "mirror", followed by one or more device names.  The keyword
"mirror" may seem redundant since it is the only allowed configuration if
more than device is in a pool (concatenation and RAID-Z are not supported),
but it is included to give us the flexibility to implement concatenation
and RAID-Z in the future, should that ever be possible.

Any of the devices in the vdevlist can be "any", which means
that a device of adequate size will be found on the system (if
there is no free disk or none of sufficient size, the install
will stop with an error.)  Yes, if you specify "any any", you
will get a mirroring of whatever two disks can be found.  The
first two suitable disks will be used.  If there aren't two
suitable disks available, the install will stop with an error.

at this time, the <mnt_pnt> can only be "/".

size:  can be one of the following values:

    auto - automatically select the size, based on other constraints
	   in the profile and the required size of the root pool.  By
	   default, the entire disk will be dedicated to the pool if
	   there are no other claims on the disk's space.
    existing - uses the existing size of the specified slice (only
 	   works for explicit slice designation.)
    all - use the entire disk for the root pool
    free - use the free space remaining on the designated disk
    <size> - size explicitly specified in megabytes.

----------------------------------------------------------------------
Keyword: dataset
Purpose: specify a dataset or BE to be created.  If the mount point
	is "/", this command will create an entire BE, with all required
	subordinate file systems.  That is, separate profile entries
	for all the subordinate file systems are not required.  If
	provided, they can override the options for those subordinate
	datasets.  Name space divisions other that the required ones
	can also be requested using this command.
Syntax: dataset <dataset_name> <size> <mnt_pnt> [<properties>]

	dataset_name:  must be of the form <file_hierarchy>
	
	size:  can have the value [<size_modifier>:]<size> or "auto".
	   where <size> is specified in megabytes.  The optional
	   <size_modifier> tells the profile interpreter how to
	   interpret the size value.  Options are:

		reserve - reserve the size in the pool for the dataset
		quota - establish a quota of specified size
		guidelines - make sure there is at least this much
		   space in the pool for the dataset (i.e.  use it
 		   to verify that the pool is big enough), but don't
		   reserve the space or make it a quota
		zvol - this value makes the dataset a zvol of the
		   specified size.

	   The default value of the size modifier is "guideline".

	   The entire <size> field can have the value "auto",
	   in which case the dataset will be created with no
	   specified size and will grow to whatever size it needs
	   during the install.

	   At this time, only the "auto" keyword is supported.

	mnt_pnt : absolute mount point name, "default", or "none".

        properties : (optional) a white-space separated list of
	   <property>=<value> pairings.  

[TBD - need examples]