ZFS as a Root File System (PSARC/2006/370)
Author:  Lori Alt (lori.alt@sun.com)
January 9, 2008

0. Contents

     1. Introduction
     2. Overview
     3. Design Goals
     4. Terminology
     5. Boot Design
	 5.1 Booting from ZFS on x86 platforms
	 5.2 Booting from ZFS on sparc platforms
	 5.3 Importing the Pool State during Boot
	 5.4 Limits on ZFS Root Pools
     6. ZFS Root in the Solaris Operating Environment
	 6.1 Division of the Solaris Name Space into Datasets
	 6.2 Naming the Datasets in the Boot Environment
	 6.2.1 New "noauto" value for "canmount" property.
	 6.3 Mounting the Boot Environment
	 6.4 System Initialization
	 6.5 Swap/Dump
	 6.6 Checkpoint-Restart (CPR) Boot - SPARC only
	 6.7 Managing Boot Environments
	 6.7.1 Format of the sparc menu.lst file
     7. Installation
	 7.1 Overview of Solaris Installation
	 7.2 Changes to Installation to Support ZFS Boot
	 7.2.1 Some Guiding Design Principles
	 7.2.2 Initial Install
	 7.2.3 Upgrade
	 7.2.4 Servers of Diskless Clients
	 7.3 Details of Changes to the Install Tools
	 7.3.1  Jumpstart profile Interpretation (pfinstall) 
	 7.3.2 Interactive Install Programs
	 7.3.3 LiveUpgrade
      8. References
      9. Interface Table
	 9.1 Exported Interfaces
	 9.2 Imported Interfaces
	 9.3 Interfaces Reimplemented

     Appendix A - Jumpstart Profile Keywords

1.  Introduction

The ZFS file system was introduced in Update 2 of Solaris 10.  One of
the limitations of the initial release of ZFS is that it could not be
used as a root file system.  This document describes the necessary changes
to Solaris to enable ZFS to be used as a root file system.  The
implementation of the design specified here will enable users to install
systems with ZFS roots.  It is intended for integration into Solaris Nevada
and then into the most appropriate Solaris 10 Update release.

2.  Overview

Enabling the ZFS file system to be Solaris's root file system requires the
following broad tasks:

 - Boot support must be added.  This includes (but is not limited to) the
   implementation of boot loaders for both the sparc and x86 architectures.

 - The operation of basic system management tasks, such as the booting
   of zones, or the mounting of file systems, must be defined in the
   context of a ZFS root.  What does a system with a ZFS root look like?
   How is it different from a system with a UFS root?  How are
   administrative tasks affected by the new root file system type?

 - The installation software must be modified to support the creation
   of ZFS file systems and the installation of Solaris into them.
   The installation software must also support upgrades and patching
   of systems with ZFS root file systems.

These three broad areas--booting, system operation, and installation--
will be addressed by this document.

3. Design Goals

This project is the first implementation of ZFS as a root file system
for Solaris.  The installation aspects of this project are for the
near-term only.  In the medium to long term, a new install project,
Caiman, will take over installation and upgrade.

Because of this, it is our goal that the installation of systems with
ZFS roots be "good enough" in this project.  We're not going to invest
a lot of work in code that will end up being thrown away soon.  The existing
install code is very old (written in the early 1990's).  It was written
for a different world, where disks were small and expensive (along with
many other differences from today, though disk size is the one that affects
the install design the most).

However, this implementation of ZFS root does have to prepare for the long
run.  Once users move to the new model of pooled storage, they're not going
to want another paradigm shift after that.  We need to define the "way
it works" in such a way that the transition from the initial release of
ZFS root to Caiman will be smooth.  Caiman will have both an initial
install and an ongoing maintenance component.  Caiman's ongoing maintenance
component in particular needs to be able to manage transitions from systems
that were installed with the tools and processes described in this initial
ZFS boot project.

With those concerns in mind, these are the goals of this design:

- Provide the tools to install systems that can make immediate use of
  many of the features of ZFS that are most valuable for system software.
  These features include:
	* pooled storage, giving users great flexibility in the setup
	  of bootable environments.
	* snapshots and clones, which enable users to quickly and safely
	  perform patches and upgrades.
	* robustness of storage, including mirroring
- Enable the migration of systems with UFS root file systems to ZFS roots.
- Define the administrative aspects of ZFS root file systems for the long
  term, as much as is reasonably possible at this time.  Whenever possible,
  do things "the ZFS way" "the ZFS way", so that system software management
  will more naturally track the developments in ZFS over the years.
- Prepare for Caiman.  Set up installed systems in such a way that Caiman
  will be able to manage them.  Define administrative procedures that are
  advantageous for Caiman's goals.

4. Terminology

The following terms are important in this design and will occur throughout
the document:

root pool -- A ZFS storage pool that has been designated as a "root pool"
	by setting the value of the pool's "bootfs" property to
	something other than "none".  When any of the devices that
	compose a root pool are booted (by specifying it as the
	boot device to the boot loader), the pool as a whole will be
	"booted", which means that a dataset in the pool (selected as
	described below) will become the root file system and will
	provide the necessary files for booting, such as the boot archive.

pool dataset -- The dataset at the root of a pool's dataset namespace.  The
	pool dataset of a pool is the dataset that exists by default in
	every pool.  The pool dataset's name is the same as the pool's name.
	A pool named "tank" will have a pool dataset also called, simply,
	"tank".  A pool dataset does not necessarily contain a root file
	system.  In fact, usually it won't.  It probably won't contain much
	of anything.  It does have one very useful attribute:  each pool
	has exactly one pool dataset.  It cannot be deleted without deleting
	the pool itself.  This makes the pool dataset a good place for
	files that contain per-pool state (as opposed to per-dataset state).
	The menu.lst file (the menu file for GRUB) is an example of a file
	that contains per-pool state and thus will be stored in the pool
	dataset.

bootable environment -- Often abbreviated as a "BE".  A bootable environment
	is basically a Solaris root file system, plus everything that is
	subordinate to it (such as mounted file systems).   There can be
	multiple bootable environments in a root pool.  In the case of a
	ZFS root pool, the Solaris root file system in a BE is a ZFS
	dataset.  There can be multiple root file system datasets, and thus
	multiple BEs, in a root pool.

bootable dataset -- a dataset which contains a root file system
	and which is part of a BE.  Every BE contains exactly one
	bootable dataset.  A BE can contain other datasets (such 
	as a /usr or /opt file system), but those other datasets
	are not considered bootable.  A bootable dataset must contain
	a boot archive.

default bootable dataset -- a dataset named by the value of the root pool's
	"bootfs" property.  If a pool has a value of something other
	than "none" for its "bootfs" property, the pool is a root pool.
	If a device that is part of a root pool is booted, and a
	specific dataset to be booted is not explicitly identified in
	the command to the boot loader, the dataset identified by the
	"bootfs" property will be the root file system dataset to be booted.

5. Boot Design

The process of booting x86 platforms was revised by the Solaris Boot
Architecture project (PSARC/2004/454), also called "newboot".  The booting
of SPARC platforms has also been modified to adhere to the newboot
architecture (PSARC/2006/525).  The main aspects of newboot that
characterize both the x86 and sparc implementations are:

*  The use of a boot archive, which is a file system image that contains
   the files required for booting, and
*  The use of a ramdisk as the root file system during the early stages of
   booting.  In the case of booting for the purpose of doing an installation,
   the ramdisk is the root file system for the entire installation process,
   which eliminates the need to be booted from a removable media.  At this
   time, the file system type of the ramdisk file system can be either HSFS
   or UFS, but not ZFS (ZFS is not well-suited for use a ramdisk file
   system, and there is no particular reason to use it that way.  UFS or
   HSFS is a better choice.)

ZFS boot, on both x86 and SPARC, assumes the newboot style of booting.

On both the x86 and SPARC platforms, for both ufs and ZFS root file systems,
the following tasks must be accomplished during the process of booting:

    1.  The boot device must be identified.
    2.  The PROM (either OBP or the BIOS) reads a boot loader from the
	boot device.  The boot loader is loaded into memory and executed.
    3.  The boot loader selects a bootable environment (BE) to be mounted
	as the root file system.  This selection can be made based on user
	input, variables set in NVRAM, or however else the particular boot
	loader has been programmed to make the choice. 
    4.  The boot archive for the selected boot environment is loaded into
	memory and that memory range is accessed as a ramdisk, from which
	the files needed for booting are loaded and executed.

The biggest difference between booting from UFS and booting from ZFS is
that on ZFS, a device identifier does not uniquely identify a root file
system (and thus a BE).  With ZFS, a device identifier uniquely identifies
a pool, which can contain multiple bootable datasets.  So with ZFS, there
must be a mechanism for identifying the dataset to be used as the root
file system.

The mechanism for specifying the dataset to be booted was defined
in PSARC/2007/083.   A pool property, "bootfs", specifies the default
bootable dataset for the pool.  When a device in a root pool is booted,
the dataset mounted by default as the root file system is the one
identified by the "bootfs" pool property.  The use can override this
selection however.  On x86 platforms, the GRUB menu can be used to
select an alternate bootable environment.  On sparc platforms, an
option to the OBP "boot" command can specify the dataset to be booted.

5.1 Booting from ZFS on x86 platforms

Much of this was already defined by PSARC/2007/083 (ZFS Bootable
datasets).  It is summarized here for review purposes.  The steps
by which a ZFS root file system is booted are:

   1.  The BIOS reads the Master Boot Record (MBR) from the boot disk.
       The MBR identifies the location where the GRUB boot loader has
       been installed on the disk.  The version of GRUB installed for
       the purpose of booting from ZFS has a ZFS-reader built into it.
   2.  The ZFS GRUB plug-in special-cases the menu.lst file.  When
       asked to read the menu.lst file, the plug-in reads it from the
       pool dataset.
   3.  GRUB presents the menu.
   4.  The menu entry for a boot environment with a ZFS dataset as its
       root looks like this:

	 title Solaris
         kernel$ /platform/i86pc/kernel/$ISADIR/unix -B $ZFS-BOOTFS
         module$ /platform/i86pc/$ISADIR/boot_archive

       Optionally, it can contain a command of the form:

	 bootfs <dataset-name>

       before the kernel$ command.

       The ZFS plug-in will replace the $ZFS-BOOTFS macro in the
       GRUB commands with the name of the dataset to be booted, which
       will be either the argument of the "bootfs" command (if one was
       specified) or the value of the "bootfs" pool property.  In this
       way, the dataset selected for booting is passed to the kernel.
    5. The kernel$ and module$ commands are executed, thereby loading
       the unix module and the boot archive into memory.  The unix
       module is now a multiboot-compliant executable as a result of
       the integration of direct boot (PSARC/2006/568).  When the
       unix module is executed, it reads the remainder of the files it
       needs for booting from the boot archive.  Eventually, zfs_mountroot
       is called.
    6. zfs_mountroot() gets the name of the dataset to be mounted from
       the zfs-bootfs boot property.  The vdev label from the boot device
       is read, which permits the reading of the pool metadata.  The
       selected bootable dataset is then found in the pool metadata and
       mounted as root.  

5.2 Booting from ZFS on sparc platforms

    1. The boot device must have had a ZFS boot block installed on it.
       OBP reads the boot block into memory and executes it.
    2. The ZFS boot block maps the boot device to a ZFS pool and reads
       enough of the pool's metadata to get the value of the "bootfs"
       pool property.  The ZFS booter supports two features that allow
       an alternate dataset to be booted:

	  i. The '-L' switch to the "boot" command is passed to the
	     booter and causes the booter to read the /boot/grub/menu.lst
	     file in the pool dataset and to present a simple menu
	     of BEs.
	 ii. The user can either select a BE from the menu printed
	     by the 'L' option or can boot a specific dataset by using
	     the '-Z <dataset>' switch to the booter.

    3. The booter reads the file '/platform/`uname -m`/boot_archive'
       from the dataset selected for booting in the previous step (either
       by default or by explicit selection using '-Z').  The file is read
       into memory and set up as a ramdisk.
    4. The booter creates a 'zfs-bootobj' property whose value is the
       identity of the dataset selected as the root file system.  This
       is the same dataset from which the boot archive was loaded in the
       previous step.  The booter also sets the value of the "fstype"
       property to "zfs".
    5. The ramdisk set up by loading the boot archive in memory is itself
       booted.  (This just means that blocks 1 - 15 of the device are
       loaded into memory and executed.)  The ramdisk will have a boot
       block that matches the file system type of the ramdisk contents,
       which will be either HSFS or UFS. (No need for ZFS here.)  
    6. Unix is read and booted from the ramdisk.  When krtld gains
       control, it mounts the ramdisk and loads additional kernel modules
       from it.  Eventually, zfs_mountroot is called (since the value of
       the "fstype" property was set to "zfs").
    7. zfs_mountroot() imports the pool identified by the boot device
       and mounts the dataset identified by the "zfs-bootobj" property
       as root.

5.3 Importing the Pool State during Boot

Early in the kernel startup stage, before the zfs root file can be
mounted, the state of the root pool must be imported.  The zpool.cache
file cannot be read at this time because it isn't in the boot archive
(see PSARC/2006/525 - Newboot Sparc).  However, the information for the
root pool is available in the metadata stored in the boot disk (which,
by definition, is a vdev in the root pool).  So the metadata used to
make the root pool active is read directly from the boot disk.  Later
in the boot process, after root is mounted, the zpool.cache file is
read as usual.  Once the zpool.cache file is read, all other pools
listed in it can be imported.  If the configuration of the root pool
in the boot disk's metadata differs from the configuration of that same
pool in the zpool.cache file, the data read from the boot disk takes
precedence.  (This could happen if booting a BE in the pool that is
old and existed prior to some changes to the pool configuration.)

5.4 Limits on ZFS Root Pools

Initially, root pools are limited to mirrored configurations
only.  Striping of vdevs is not permitted.  Nor is RAID-Z.
The reason for this is that the firmware must be able to read
everything needed for booting from a single disk.  Each disk in a
root pool must be fully accessible from the firmware on the system.  

Partly as a result of this constraint, best practice will be to
have separate pools for system software and data.  Users may not
want to constrain the configuration of data pools to the limits
imposed on root pools.  This is not the only reason for this
segregation however.  In general, there are advantages to managing
the "personality" of a system separate from its data.  The
segregation of system software and data is not mandatory however.
Users can combine in them one pool if they wish.

In a future release, it is likely that booting from RAID-Z pools will
be supported.

Another restriction on root pools is that the devices in the pool must
have SMI labels (i.e., not EFI labels).  This is a restriction imposed
by OBP and the install software.  

6. ZFS Root in the Solaris Operating Environment

ZFS has a very different administrative paradigm than UFS, or just about
any other file system type.  For the most part, these new administrative
concepts, such as pooled storage, offer great advantages when ZFS is used
as a root file system, but they can also mean changing the rules.  When
deciding how to use ZFS as a root file system, we need to make tradeoffs
between the familiarity of "the way we've always done it" versus the
advantages of the radical new ways that ZFS lets us manage system software.
The next sections describe the ways that Solaris's interactions with
root file systems will change with ZFS.

6.1 Division of the Solaris Name Space into Datasets

By default, the entire Solaris distribution will be installed into
a single dataset (that is, no separate datasets for /usr, /opt or /var).
There will be an option to put /var into a separate dataset.  The
reasons for this are:

    1. It's simple and reflects current practice.
    2. Without a clear plan for how Caiman would make use of
       additional datasets in the name space, there wasn't a good
       reason to have them.
    3. The exception for /var is because some environments require a
       a separate /var in order to prevent growth in /var from filling
       up the root file system and resulting in a Denial Of Service
       situation.  A space used by a separate /var dataset can be
       limited by a quota.

In addition, it will be a documented best practice to create zone roots
in their own datasets.

The installation applications create /export and /export/home directories.
These will be created as separate datasets and will be shared between BEs.

6.2 Naming the Datasets in the Boot Environment

The name convention used by the installation and BE management
software will be as follows:

There will be a "container" dataset defined with the name:

    <rootpool>/ROOT

and a "mountpoint" property value of "legacy".  (This directory
will not appear by default in any BE's file name space).

The directories immediately under the <rootpool>/ROOT directory
will contain BEs, named as follows and with the following values
for their mount point properties:

	dataset name				mountpoint property
        ------------				-------------------
        <rootpool>/ROOT/<bename>			/
        <rootpool>/ROOT/<bename>/var		   inherited

where the /var dataset is optional.

6.2.1 New "noauto" value for "canmount" property.

ZFS datasets currently have a "canmount" property.  If "on", the
dataset can be mounted and is automatically mounted at import, creation,
and system startup.  If "off", the dataset can't be mounted at all.
For zfs datasets in BEs, there needs to be third option:  "noauto".
The "noauto" value has the following effect:

  * The dataset can be mounted.
  * The dataset can only be mounted and unmounted explicitly.  It is
    not mounted automatically when the dataset is created or imported
    and it is not mounted by "zfs mount -a" or unmounted by "zfs unmount -a".

This property is required for datasets in a BE because there
could be multiple datasets with a mountpoint of "/".  Obviously,
only one of them can be mounted a time.  Furthermore, at the time
that the install software creates a new bootable dataset and
gives it a default mount point of "/", the creation would fail
if there were no "noauto" option because there is already a
file system mounted as "/". 

6.3 Mounting the Boot Environment

ZFS supports automatic mounting of file systems, without the need for
/etc/vfstab entries.  File systems that compose the BE will use zfs mounts,
not legacy mounts.  The fact that the root file system and the /var file
system (if one exists) don't have entries in /etc/vfstab will simplify
the process of cloning BEs because the /etc/vfstab file will not need to
be edited to modify the name of the dataset being mounted.  

6.4 System Initialization

With ZFS, there is no particular reason to do an initial read-only
mount of the root file system.  With UFS, root had to be mounted read-only
so it could have fsck run on it.  ZFS has no fsck.  The first mount of
a ZFS root file system will be read-write and there is no need for a
later remount.  Of course, the root dataset will be mounted read-only
anyway if it is a read-only dataset (such as a snapshot.  Again, booting
of snapshots is not supported yet, but probably will be at some point.)

6.5 Swap/Dump

On systems with zfs root, swap and dump will be allocated from within
the root pool.  This has several advantages:  it simplifies 
the process of formatting disks for install, and it provides the ability
to re-size the swap and dump areas without having to re-partition the disk.
Both swap and dump will be zvols.

On systems with zfs root, there will be separate swap and dump zvols.
The reasons for this are:

   1.  The implementation of swap and dump to zvols (not discussed
       here because it isn't a visible interface) does not allow the
       same device to be used for both.
   2.  The total space used for swap and dump is a relatively small
       fraction of a typical disk size so the cost of having separate
       swap and dump devices is small.
   3.  It will be common for systems to need different amounts of 
       swap and dump space.  When they are separate zvols, it's easier
       to define the proper size of each.

Because swap is implemented as a zvol, features such as encryption
and checksumming will automatically work for swap space (it it particularly
important that encyrption be supported).

The swap and dump zvols for a pool will be the datasets:

    <rootpool>/swap
    <rootpool>/dump

Their sizes and configuration values will be set to default
values by the installation software.  The default values will be:

    dump: dump content is "kernel pages".  Size is 1/4 size of
	physical memory.
    swap: 512 MB, or 1% of root pool size, whichever is larger.

Since it's easy for an administrator to change these sizes (by
changing the size of the zvols), we no longer provide prompts
in the install applications to set these sizes.  In the old days
when swap and dump occupied fixed-size slices, it was important
to set the size correctly when the disk was formatted by install.
Now, it can be changed later as necessary.

All BEs in the pool will share the swap and dump zvols.  The install
program will put this entry in /etc/vfstab:

    /dev/zvol/dsk/<rootpool>/swap  -    -     swap    -      no      -

This is the only way that zfs boot still relies on the /etc/vfstab file.

Install will use the dumpadm(1M) command to set up the dump device
attributes on the installed system.

6.6 Checkpoint-Restart (CPR) Boot - SPARC only

The ZFS boot architecture supports the checkpoint-restart facility
on SPARC platforms.  This is achieved by setting the "statefile"
value in the /etc/power.conf file to the device name of the dump zvol. 

6.7 Managing Boot Environments

On both x86 and sparc systems, the list of BEs in a root pool
will be contained in a "menu.lst" file.  Up to now, menu.lst has
been purely for GRUB use, but we will extend it to sparc as well.
It menu.lst files on systems will be managed as follows:

*  There will be one menu.lst file per root pool and it will
   reside in the "pool dataset"  (i.e., the dataset at the root of
   of the dataset hierarchy) at /boot/grub/menu.lst.

*  When LiveUpgrade has operated on a boot environment, the BE
   contains a file called /etc/lu/GRUB_slice, which specifies the slice
   that contains the active menu.lst.  The /etc/lu/GRUB_slice file
   in a zfs-based BE will look like this:
   
	PHYS_SLICE=<boot device>
	LOG_SLICE=<root pool name>
	LOG_FSTYP=zfs 

   The bootadm(1M) program will be modified to recognize a GRUB_slice
   in this format and use it to find the active menu.lst file when an
   administrator performs a "bootadm list-menu" command.

*  The "pool dataset" will be mounted in every BE at the location
   /pooldata.  This means that the menu.lst file will appear in
   the BE's name space at:  /pooldata/boot/grub/menu.lst.  It will
   NOT appear at /boot/grub/menu.lst.  The "bootadm list-menu"
   command will show the correct location of the active GRUB menu.

*  BEs residing in zfs pools (i.e., those whose root file system
   is a zfs dataset) will NOT have a /boot/grub/menu.lst file in
   their name space.  The menu.lst file can be accessed at
   /pooldata/boot/grub/menu.lst.

*  All of the disks in a mirrored root pool will share the
   same menu.lst file  (obviously).  However, if it is necessary
   to boot off a disk other than the primary boot disk in a
   mirror, only the "hd0" device in the menu.lst will be
   accessible.  This is an artifact of how the BIOS works.  
   Since the menu.lst file on sparc systems will not identify
   physical boot devices  (only BE name), this will not be an
   issue with sparc.  The complexities of booting from a
   backup mirror side on x86 are considerable (though not new),
   and will need to be documented in the context of zfs root pools.

6.7.1 Format of the sparc menu.lst file

The menu.lst file on sparc systems will contain two of the
GRUB commands:

    title - Provides a title for a boot environment.
    bootfs - full name of a bootable dataset

The "boot <device> -L" OBP command, when executed on a
device with a zfs boot loader, will print a menu with
the "title" values for all the menu.lst entries.  The user
can select a menu item to boot.  The dataset named by the
"bootfs" value for the menu item will be used for all
subsequent files to be read by the booter (such as the
boot archive and the various configuration files in /etc)
and that dataset will be mounted as root.

6.8 Failsafe Booting

Failsafe booting, as introduced in PSARC/2004/454 (Solaris
Boot Architecture), will be supported on systems booted from zfs.
As in the case of booting from UFS-rooted BEs, each BE will have
a failsafe archive.  The failsafe archive will be in same location
in the root file system as it is in a UFS-rooted BE.  Each failsafe
archive will have an entry in the pool-wide GRUB menu.  The "default"
failsafe archive will be the one in the default bootable dataset,
as indicated by the value of the pool's "bootfs" property.

7. Installation

7.1 Overview of Solaris Installation

There are several ways to install and upgrade Solaris:

   (a) Mini-root based installs

   Some of the install procedures are run while the system is
   booted from a miniroot.  A miniroot is the operating system
   on a DVD or CD, or which is downloaded to the system during a
   netinstall and which runs out of ramdisk.  In all of these
   scenarios, the system being installed is not booted off local
   writable storage.  This is the only kind of install that can
   be used on an uninstalled system (i.e., a system with no
   bootable local disks).

   (b) Live installs/upgrades

   If a system has been configured with extra disk space, it is possible
   to do an install or upgrade of a system that is booted off a local disk.
   While booted off a previously-installed bootable environment on
   the system, a new bootable environment can be installed (on spare
   storage).  This new bootable environment can be an upgrade of the
   existing bootable environment.

7.2 Changes to Installation to Support ZFS Boot

The following is an analysis of how the installation software will need to
be changed to support ZFS boot.  Before looking at how the many parts of
install will need to change, we need an overview of how the installation
of a system with a ZFS root will work.

7.2.1 Some Guiding Design Principles

     1. For now, ZFS boot disks will still have SMI labels.  By default,
	when a zpool is created on an entire disk, it is given an EFI
	label.  But the boot proms and the install software don't yet
	support EFI labels.  ZFS boot must live within that limitation,
	for now.  So when an install application specifies that a pool
	be created on a full disk, the disk will be formatted with an
	SMI label with a VTOC that dedicates all of the disk (except
	for the small amount of space required for the GRUB slice on
    	x86) to slice 0, which then becomes the vdev for the root pool.

     2. ZFS pools do not require entire disks.  Even though best
	practices recommend using the entire disk for a pool, we recognize
	that some systems have very large disks which the administrator
	might not want to dedicate entirely to a root pool.  So we support
	the splitting of disks into a slice for a root pool (or one mirror
	of a root pool) and the remainder of the disk for slices for other
	purposes, including pools that are not root pools.

7.2.2 Initial Install

     1. The software to be installed will be selected (as it is now
	with UFS root file systems).

     2. Determine whether system being installed will end up with a UFS
	root or a ZFS root.  It will not be possible to install part of
	the system software on ZFS and part on UFS.  It will, of course,
	be possible to have ZFS file systems on a system with a UFS root
	and vice versa, but the software installed by Solaris install or
	upgrade must either be all on ZFS or all on UFS.

     3. If UFS, install will work as it does now.  If ZFS, the user
	will have the opportunity to designate multiple disks or disk
	slices for the root pool.  In the first phase of ZFS boot support,
	these disks will be used to create a mirrored vdev.  (RAID-Z
	configurations will likely be supported in the future.)  The
	user will have the opportunity to set the size of the root pool
	(default should be the entire disk, but we can't require a
	whole disk).

     4. The disks will be partitioned as specified in step 3.

     5. The root pool will be created and the datasets composing the BE
	(/ and possibly/var) will be created.  The swap and dump
        space zvols will be created within the root pool.

     6. The standard "software install" part of the install backend will
	install all of the Solaris packages into the root file system
	datasets. 

     7. The boot "overhead" will be installed as necessary.  The existence
	of a new bootable dataset will be recorded in the pool metadata.
	The menu.lst file will be created.  The new boot environment will be
	recorded in /etc/lutab (and whatever other overhead files are
	required to establish a BE in LiveUpgrade).  The boot archive will
	be created and the boot loader will be installed on each disk in
	the root pool.

One thing to note about the above procedure is that step 7 requires the
setup of a LiveUpgrade boot environment.  Currently, when a system is
installed, a "BE" in the LiveUpgrade sense is not established.  This will
change.  ZFS bootable environments will always get recorded as LiveUpgrade-
compliant BEs.  The overhead files that establish LiveUpgrade BEs will
eventually be processed by the install tools that are part of Caiman.
(Caiman might not use the same files as LiveUpgrade, but it will understand
them and be able to process BEs established by LiveUpgrade.)

The above steps describe interactive installs.  Jumpstart installs
follow a similar series of steps but the "questions" are answered
from the profile, not by querying a user.

7.2.3 Upgrade

A system "upgrade" is the process of converting an installed instance
of Solaris to a later version, while preserving all local customizations.

The standard Solaris miniroot-based installation program currently
supports an option to upgrade the installed Solaris instance.  This
upgrade is done "in-place" (i.e., the new bits are written into the same
file system where the older version was, thereby replacing the old
version).

This kind of "in-place" upgrade, done from the miniroot-based install
program, will not be supported for zfs root file systems.  There are several
reasons for this:

    1.  "In-place" upgrade dates from the days when disks were much smaller
	and it was common not to have enough space for two Solaris
	instances.  That's not true now that disks are seldom smaller than
	80 GB or so, and Solaris instances are around 6 GB.
    2.  A "copy and upgrade" model has several advantages over an "in-place"
	upgrade.  It can be done while the system is "live", and can be
	easily backed out.  A "copy and upgrade" model does not have the
	so-called "toxic patch" problem.
    3.  ZFS is ideally suited for the "copy-and-upgrade" model of system
	upgrade.  With ZFS, a Solaris instance can be easily cloned and
	modified.  Because of pooled storage, the new Solaris instance
	doesn't require its own slice.

If an already-installed system is booted from an installation DVD/CD
or a netinstall image, the install "discovery" software will detect
the existence of any ZFS root pools on local storage.  UFS roots
are also detected.  The logic for interactive installs will be:

    All local storage will be categorized as follows:

	Category 1: contains an upgradable ufs root file system
	Category 2: is part of a ZFS root pool
	Category 3: all other storage.

    If there are any upgradable ufs root file systems, the user will
    get an opportunity to upgrade them.

    If there are any root pools, the user will see a message indicating
    that the upgrade of any BEs in those pools must be accomplished
    by LiveUpgrade, not by the (currently-running) miniroot-based install.

    In all cases, if the user opts for an action that would destroy
    an existing pool, the user must be warned of that and given an
    opportunity to abort the install.

7.2.4 Servers of Diskless Clients

Since exported services are just local directories, it will be
possible to export services from zfs datasets.  No special support
for zfs is needed.

7.3 Details of Changes to the Install Tools

7.3.1  Jumpstart profile Interpretation (pfinstall) 

New keywords are defined to support the creation of ZFS pools and datasets.  
A detailed description of these keywords is provided in Appendix A.

Not all of these keywords will have corresponding screens in the interactive
install programs.  There is precedent for allowing more configuration
capabilities in profiles than are supported in the interactive install
programs (currently, the only way to set up a mirrored root with SVM is by
a profiled install).

7.3.2 Interactive Install Programs

The interactive miniroot-based install programs are ttinstall
(which has a character-based interface) and the install GUI.
At this time, there are no plans to support the setup of zfs
roots from the install GUI.  Only ttinstall will support
zfs root. (Naturally, the Caiman install will support ZFS fully.)

The ttinstall program has a "parade", which is a series of screens that
query the user for the details of how the system is to be installed.
The "parade" will need new screens to determine the following:

  1. whether the system to be installed will have a zfs root
  2. identify the disks to be added to the pool
  3. the name of the root pool and the root dataset
  4. how much of the disk should be dedicated to the root pool
     (default is "all")

In the initial (pre-Caiman) version of zfs root install, it will not
be possible to set up a system with zfs root using Flash Install.

7.3.3 LiveUpgrade

The required changes to LiveUpgrade for zfs fall into two areas:
  1.  The changes required to enable boot environments (BEs) to be in
      zfs datasets.
  2.  Changes required to support zfs datasets as subordinate file systems
      in BEs with a ufs root.  This includes zfs file systems mounted under
      a ufs root, and support for non-global zone roots in zfs datasets.

Technically, the items in (2) have nothing to do with zfs boot and
should have been done as part of the original zfs integration.  For
whatever reason, most of them were not done.  They need to be done now,
since the use of zfs as a root file system necessitates full support
for zfs in LiveUpgrade.

LiveUpgrade will be modified to make it possible to create a boot environment
(BE) whose root is a zfs file system.  These zfs-based BEs can be populated
by cloning an existing BE with either a UFS or ZFS root.  Cloning a UFS
root will be the most common way to migrate from UFS root to ZFS root.

If the source BE of a lucreate has a zfs root file system, the target BE
will be created as a zfs clone of the source BE.  This means that the
lucreate will be very fast and will initially occupy very little space.

Currently, liveupgrade never partitions disks or allocates disk space
for BEs.  It depends on the slices to be used for BEs having already been
created.  This will remain true for zfs liveupgrade support.  With zfs,
there are two steps in allocating space: formatting disks and creating
storage pools.  Lucreate will not format disks or create pools.  Both
actions must have been previously performed before lucreate can create a BE
in a zfs dataset.  Note that liveupgrade will work for zfs and can be used
to migrate systems from a ufs root to a zfs root, but the creation of the
root storage pool must have been previously done by the administrator
(since allocating storage has never been part of liveupgrade's job).

LiveUpgrade will work differently than it does now in the following ways:
*  Mirroring of zfs file systems will not be directly supported by
   liveupgrade.  Mirroring in zfs is fundamentally different than mirroring
   using SVM plus UFS.  With zfs, mirroring is done at the pool level, not
   the file system level.  Therefore, it's not meaningful to specify that
   a zfs mount be mirrored.  If the user wants mirrored storage for their
   BEs, they must create their storage pool using a mirrored vdev
   configuration.  So the "attach" and "detach" options will not apply to
   zfs mounts. 
*  One of the purposes of having SVM-mirrored BEs (or more exactly, mirrored
   file systems within BEs), is to allow the fast 'cloning' of a BE by
   detaching a side of the mirror and using that detached device as a new
   BE (and the basis of a patch or an upgrade).  ZFS-based BEs will support
   a fast cloning capability even though mirroring of individual ZFS file
   systems is not supported.  With ZFS, fast cloning of a BE will be
   performed by zfs's own dataset cloning capability.  This is much
   easier to manage and plan for than SVM mirroring because it isn't
   necessary to pre-allocate a fixed amount of space for the clone of a
   ZFS file system.

There are two variants of the lucreate command that can be used
with zfs root file system:

   1. Migrating a BE to a new pool.  

   This an either be a migration from one ZFS pool to another, or
   from a UFS root to a zfs root.  The form of the command is:

   lucreate -n <new-be> -p <pool>

   This command requests that a new BE name <new-be> be
   create in the pool name <pool>.

   2. Cloning an existing ZFS-based BE to a new BE in the same pool

   lucreate -n <new-be>

   This command will be very easy to execute.  All that is
   required is that the new BE be named.  All datasets in the
   source BE (called the "PBE" in LiveUpgrade terminology)
   will be cloned and will appear as separate datasets in the
   new BE also.  

All other LiveUpgrade commands, such as luactivate and luupgrade
can be used on the new BE.

LiveUpgrade will take care of all the details of cloning all the
datasets in the BE.  The user shouldn't have to be aware that the
BE is composed of multiple file systems.  This is much easier than
the use of LiveUpgrade with UFS, where each mounted file system in a
BE must be represented by a "-m" option on the lucreate command line.

8. References

	PSARC/2004/454  Solaris Boot Architecture
	PSARC/2005/198  Install Interface Changes Under New Boot
	PSARC/2006/525  Newboot Sparc
	PSARC/2007/083  ZFS Bootable Datasets

9. Interface Table

9.1  Exported Interfaces

	zfs properties
	    new "noauto" value for "canmount"		Evolving
	menu.lst for sparc				Evolving
	zfsboot loader for SPARC			Evolving
	jumpstart keywords
		pool					Evolving
	/pooldata					Evolving

9.2  Imported Interfaces

	bootadm						Stable
	/etc/power.conf					Stable
	dumpadm						Stable

9.3  Interfaces Reimplemented

	/etc/lu/GRUB_slice				
	LiveUpgrade
	ttinstall
	jumpstart keywords
		bootenv
		rootdev
	
Appendix A - Jumpstart Profile Keywords for ZFS

ZFS interpretation of existing profile commands:

Basically, a profile can be used to set up a zfs root or a ufs root.
If the profile is being used to set up a ufs root, all the existing
profile keywords work as they do now.  There is only one exception to
that:  The "filesys" keyword can preserve a pool as well as an
individual slice now.

A profile which creates a zfs root is indicated by the presence of
two keywords:  a new "pool" keyword, and a "bootenv" keyword with
a new "installbe" subcommand.  If the profile has both of those
keywords, it's a "zfs-creating" profile and some keywords that are
allowed in a ufs-creating profile will not be allowed (such as
those specifying the creation of ufs mount points for parts of the
Solaris namespace).

Here is a list of keywords that are permitted in zfs-creating
profiles and their interpretation:

* filesys

    filesys <slice/pool> existing ignore preserve

    This tells pfinstall to leave the specified slice untouched
    and work around it.  The command will be extended to allow a pool
    name to appear in the <slice> field.  This causes
    all vdevs in the pool to be preserved.

    filesys <slice> <size_in_MB> unnamed

    where <slice> can be "any".

    This profile entry causes a raw slice of the specified
    size to be created.  (it can be newfs'ed or otherwise
    initialized by a finish script, if necessary).

    All other uses of the "filesys" keyword are prohibited.

* rootdev
    If present, specifies the device to be used for the root pool

New keywords:

* pool

    pool <name> [<poolsize> <swapsize> <dumpsize> <vdevlist>]

    This command can be used either to specify a new pool, or
    to identify an existing pool to be used for installation.

    fields:

    name - specifies the name of the pool

	If the pool <name> already exists, the install will be
	performed into newly created datasets in that pool.  In
	this case, all remaining arguments are ignored.  If the pool
	<name> doesn't already exist, it will be created with the
	specified size and on the specified vdevs.  If the pool
	already exists, but isn't a valid root pool, pfinstall will
        print a message and quit.  If the pool doesn't exist ands
	<size> and <vdevlist> aren't supplied, we error out.

	So this lets us both install into an existing pool, or
	define a new one.  If "-" is specified as the pool name,
	we look for the current root pool.  If the system isn't
	already set up for zfs boot (i.e., no current root pool),
	error out.

    pool_size - size of the pool to be created.  Can be  "auto" or
	"existing" (meaning the boundaries of the existing slices by
	that name are preserved and overwritten by a zpool.  Size is
	assumed to be in megabytes, unless terminated by "g" (gigabyte)

    swap_size - size in MB or GB of the swap zvol to be created.  Can be
	"auto" (the default swap size will be used)  (MB assumed,
	can specify GB by terminating the size with 'g')

    dump_size - size in MB or GB of the dump zvol to be created.  Can be
	"auto" (the default dump size will be used)
	(MB assumed, can specify GB by terminating the size with 'g')

    vdev_list - list of devices to be used to create the pool.  The
        format of the vdev list is the same as used for the "zpool create"
	command.  At this time, only mirrored configuration are supported.
	In the future, we will probably support the raid-z configuration.
	Devices in the vdev list can either be whole devices or slices.
	They can also be the string "any", which means that the install
	software will select a suitable device.

* bootenv installbe

    Actually, the bootenv already exists, but we will define
    a new subcommand.  It will have the syntax:

    bootenv  installbe  <name>

    which means:

    Create a new BE called <name> and install it.

----------------------
Examples:

Case 1 - new pool, mirrored

install_type initial_install
pool newrootpool auto auto auto mirror c0t0d0 c0t1d0
bootenv install_be s10_u6


Case 2 - new pool, mirrored, devices to be assigned, alternate
  metacluster selected.  Pool size specified.

install_type initial_install
cluster SUNWCuser
pool newrootpool 80g 2g auto mirror any any
bootenv install_be s10_u6

This profile creates a new pool of 80 GB with a 2GB swap
zvol and a 2GB dump zvol on a mirror of any two available
devices that are large enough to supply a 80 GB pool (if
two such devices aren't available, the install will fail).
A new BE called "s10_u6" will be created.

Case 3 - LiveInstall

install_type initial_install
pool currootpool
bootenv install_be s10_u6

This creates a new BE called s10_u6 in the existing pool
"currootpool".  

Case 4 - Upgrade

install_type upgrade

That's all you need.  Since this is called by LU, the
BE to be upgraded will already have been mounted.  The
root of it is passed to pfinstall by the -L option.