Live Upgrade Support for ZFS 1. Overview The Solaris Live Upgrade utilities currently do not support ZFS. This document describes the limitations, proposed changes, and the plan of action to achieve support. This project extends all of the necessary Live Upgrade mechanisms to support not only upgrade to ZFS file systems and zones within those file systems BUT provides support for ZFS boot as well. ZFS Live Upgrade Support will be integrated into Solaris Nevada since that is a requirement for integrating into a Solaris 10 update release. It is not anticipated that these changes will be part of the Nevada FCS since Caiman will replace it. These changes should only be seen by customers in Solaris 10 update releases. 2. Definitions PBE - Primary Boot Environment This is the boot environment that will be used by lucreate in building the alternate boot environment. By default this will be the current boot environment but this can be overridden with the '-s' option to lucreate. ABE - Alternate Boot Environment This is a boot environment that has been created by lucreate and possibly updated by luupgrade but is not currently the active or primary boot environment. This is done by running luactivate. default file systems These are the root file systems that will be created by default when using Live upgrade to migrate from UFS to a ZFS root. The current set of default file systems is: / /var (created if the BE being migrated has a separate /var file system) shared file systems The set of file systems that are shared between the ABE and PBE. This includes file systems like swap and /export. Shared file systems may contain zone roots. 3. Use Cases The following are some example use cases to test potential solutions against to ensure that they are complete and cover the problem space. Existing users of Live Upgrade will not see any difference in either the specification or the operation for currently supported configurations. Only when upgrading to ZFS file systems will the user see a change. 3.1 UFS Upgrade When doing a Live Upgrade that does not involve upgrading either from or to ZFS the operation of Live Upgrade will be unchanged. Both the command line parameters and the operation will remain the same. They will run lucreate which will copy the existing file systems to the file systems specified by the user. This is a time consuming operation that requires approximately double the disk space of the PBE. The minimal set of operations required to do a Live Upgrade are as follows: - Run lucreate to copy the PBE creating an ABE - Run luupgrade to upgrade the ABE - Run luactivate on the newly upgraded ABE so that when the system is rebooted it will be the new PBE. For those users that have zone roots in ZFS file systems that are not migrating to a ZFS root, upgrade of the ZFS zone roots will be supported. The behavior will be to snapshot and clone the ZFS zone roots if they are datasets. If not, a new zone root dataset will be created and the original zone root's contents copied in. Example The initial state is a UFS based Solaris release with a non-global zone in a ZFS pool, zpool/zones/z1. The following command is run to create a UFS based ABE: lucreate -n newBE -m /:/dev/dsk/c0t1d0s0:ufs In addition to the creation of new root file system, a copy of the zone will be created located at zpool/zones/z1-newBE. If the original zone root is a ZFS file system then it will be cloned. If not, then z1-newBE will be created as a ZFS file system and the contents of z1 will be copied. 3.2 ZFS Upgrade A prerequisite to using Live Upgrade to upgrade to ZFS file systems and ZFS boot is to have upgraded or done an initial install of the system to a Solaris release that supports these features. This is because Live Upgrade runs in the PBE environment and a prerequisite to upgrading to a ZFS pool is that the pool be created. The performance of Live Upgrade on a system with ZFS file systems and a ZFS root will be much improved. It will be both easier to run and execute much faster. When lucreate is run a clone of the PBE file systems is created. This is very fast and initially requires little additional disk space. The amount of space that will ultimately be required depends on how many of the files are replaced as part of the upgrade operation. Applying a patch will most likely use much less space than a full upgrade in which the majority of the files are replaced. 3.3 Migration Scenarios Live Upgrade provides a path that users can use to migrate to ZFS without completely reinstalling their operating system software. The lucreate and luactivate utilities of Live Upgrade can be used to migrate system software to ZFS boot. It is not a requirement that luupgrade be run if the desire is to simply migrate. Only migration from UFS to ZFS will be supported. There are a couple of reasons for enforcing this; we want to be moving our customers to ZFS not away and the amount of time required to to do the engineering, change the manuals, and expand the tests would be quite large for the amount of benefit derived. 3.3.1 Live Upgrade Migration Matrix The following matrix illustrates some key points in supporting ZFS with Live Upgrade. First, it should be clear that Live Upgrade will be an excellent tool for migrating current Solaris customers to ZFS. Additionally, the full advantage of combining Live Upgrade and ZFS are not derived unless a Live Upgrade is done within the same pool as the current PBE. When migrating to ZFS or migrating across ZFS pools requires the creation of each of the file systesms followed by physically copying the information from the PBE into each of the ABE file systems. ABE --------------------------------------------------------- | * | ZFS | ZFS (different pool) | |---------------------------------------------------------------- P | * | copy | copy | copy | B |---------------------------------------------------------------- E | ZFS | Not | snapshot/ | copy | | |Supported | clone | | |---------------------------------------------------------------- * - this is any of the existing file system (UFS & VxFS) and volume manager (SVM) support. To support the situation where the users will be required to update the ZFS on disk format, migration of a BE from one pool to another is supported. Since the BE is moving to another pool it will not be possible to use cloning and will fall back to the slower copy algorithm. 3.3.2 Zones Migration The only difference for zones users is that some have already migrated to ZFS and so now have a strong need/desire for Live Upgrade support. There will also be some existing zones customers that will want to use Live Upgrade to migrate their zones file system(s) from UFS to ZFS pools. The following zone upgrade scenarios will be supported: UFS root/UFS zone root -> UFS root/UFS zone root -> ZFS root/ZFS zone root -> ZFS root/UFS zone root UFS root/ZFS zone root -> ZFS root/ZFS zone root -> UFS root/ZFS zone root ZFS root/ZFS zone root -> ZFS root/ZFS zone root 4. Proposed Solution The command line interface to Live Upgrade will be enhanced to support upgrades to ZFS file systems. This will include only ZFS root and zone file systems. Non-root and shared file system migration will not be supported by Live Upgrade. This means that zone roots on UFS shared file systems will not be migrated to ZFS via Live Upgrade. The FMLI (Form and Menu Language Interpreter) part of Live Upgrade, lu, will not be enhanced to provide the additional support necessary to support ZFS. Since Solaris 8, only one Live Upgrade enhancement has been implemented in lu. This was the ability to install flash archive in alternative boot environment. Zulu provided no support for zones upgrade. Since the plan of record is that lu will be replaced with a more up to date Graphical User Interface and is now so far out of sync with the command line interface the proposal is to do no additional work as part of this project. The one change that will be made is to check for a ZFS PBE. If found an error message will be printed out stating that ZFS support is unavailable through the interface. Any attempts to migrate using lu will fail because ZFS pools will not be shown in the list of upgrade destinations. It will be necessary to change the assumption that is built into the code that file systems are built upon explicitly defined logical devices. The ZFS file systems are instead built on pools that can be easily extended with the addition of raw disks. The 1:1 association between a logical disk device and a file system is no longer true. 4.1 User Interface Changes 4.1.1 lucreate lucreate will be extended to accept two additional parameters. The '-p' parameter is used to specify the root pool that the newly created BE will reside in. -p When migrating to ZFS from UFS or moving from a BE from one ZFS pool to another this parameter is required. The specified pool name has to exist and must meet the criteria to be a root pool. These are specified in section 6.3.4. lucreate enables user to exclude or include certain files from the PBE. This inclusion or exclusion is specified with the '-f', '-x', '-y', '-Y', and '-z'. These command line options will be disabled when creating an ABE that is a clone of the PBE. If the user attempts to specify them an error message will be emitted. It is still possible to use these options in the following cases: UFS -> UFS UFS -> ZFS ZFS -> ZFS (different pool) 5.1 Customer View 5.1.1 Root upgrade The goal is to make the Live Upgrade experience as straightforward as possible. ZFS provides the ability to easily split the namespace into separate file systems. The default root file systems listed are defined above and will be the set of file systems created when migrating to a ZFS pool. An upgrade from a ZFS based PBE will clone whatever file systems are in the PBE. If the user installs a ZFS root system and specifies a single monlithic file system then the ABE that gets created will as well. To migrate a UFS BE to a ZFS root pool it will be necessary to provide the name of the pool. The default file systems will be created in the named pool and the non-shared file systems will then be copied into the root pool. An example of the command to do this is: lucreate [-c ufsBE] -n zfsBE -p rpool In the situation where the user is upgrading from a ZFS pool it will only be necessary to specify the name of the ABE. The specification of '-m' parameters will no longer be necessary or even supported for ZFS file systems. An example of the lucreate to create a new ABE from a ZFS PBE would be: lucreate [-c curBE] -n upgradeBE This command would define the current BE as 'curBE' and create an ABE, 'upgradeBE' by cloning the file systems defined in 'curBE'. It is only necessary to specify the '-c' parameter if the PBE is not already defined. Currently this is done by running lucreate but there is a proposal to populate the LU database with an initial PBE when a ZFS based Solaris system is installed. If this is implemented then the above command could be simplified to: lucreate -n upgradeBE To simplify the user model and the resulting implementation in Live Upgrade, the user is restricted to upgrading the Solaris distribution into a single pool. The root and subordinate file systems can then be uniquely identified by specifying the pool name and the BE name. Live Upgrade can be used to migrate a BE to A zfs pool, do upgrades within that pool, as well as migrate BEs between root pools but does not support migration to another file system type. For instance, migrating back to UFS is not supported. 5.1.2 Swap & Dump In the current implementation of Live Upgrade, all swap partitions on a PBE are shared with the ABE. The '-m' option can be used to specify an additional or new set of swap partitions on a PBE for sharing with an ABE. This will remain true until migrating to a ZFS root pool. The migration will result in the creation of both swap and dump zvols. The sizes will be the same as the space used in the PBE. If desired, the user can change the size of either the swap or dump zvols with the zfs command. When a PBE is cloned to create a new ABE the existing zvols will be used. In the case where a BE is being migrated from one root pool into another, a swap and dump zvols of the same size as the PBE will be created if they do not already exist. If the zvols do exist then they will be used. 5.1.3 Optional file systems Optional file systems are those that do not contain any installed software. Optional file systems (eg. /export, /mount, ...) will not automatically be migrated unless they are on the same mountpoint as root. For example, if / is on c0t0d0s0 and /export is on c0t1d0s0 then the following command would only migrate the root file system leaving the optional file system as is. lucreate -n newBE -p rpool The /export file system will be treated as a shared file system. Live Upgrade will not provide a mechanism for migrating this file system to ZFS. If desired, this can be done manually. 5.1.4 Zones upgrade The ZULU project changed the behavior of Live Upgrade so that zones on shared file systems were copied in place to a unique name in that file system to prevent BEs from sharing the same non-global zones. The zonepath in the ABE was -. Zone roots will be migrated to ZFS when the root file systems are migrated if they are part of the root file system. Zones will not be migrated to ZFS until the user specifically requests that the root file system be migrated using a command of the form: lucreate -n newBE -p dpool Until this is done the algorithm introduced by ZULU will be unchanged. When zone roots that are in the root file systems are migrated to ZFS they are created as datasets and then copied. Doing this means that all subsequent lucreates will then simply require a snapshot and clone operation for the zone roots. It will not be possible to either migrate shared UFS file systems to ZFS or copy shared ZFS file systems as part of the Live Upgrade process. Similarly, within zones it will not be possible to migrate UFS file systems that have been added via 'add fs' to ZFS. Example The initial state is a UFS based Solaris release with non-global zones in both /zoneroots (z1 in the root file system) and /export/zones (z2 in a shared file system). The UFS BE does not have a separate /var file system. The following command is run to migrate the system to a ZFS root pool: lucreate -n newBE -p dpool This will create the defined set of file systems within root. Since /zones is part of the root file system it will be created as well. To enable the snapshot/clone capability of zone clone the zones in /zones will be created as datasets. The zone roots in /export/zones will not be migrated to ZFS. When lucreate is run in this environment the following datasets will be created: dpool/ROOT dpool/ROOT/newBE dpool/ROOT/newBE/zoneroots dpool/ROOT/newBE/zonesoots/z1 Since zones users have already started migrating to ZFS it will be necessary for Live Upgrade to support the case where zone roots exist in a different pool than root. Live upgrade will determine whether or not a zone roots is a ZFS file system or simply a directory. The reason for this is that the Live Upgrade has no control over how the zones are created. Zones are automatically created on datasets on ZFS but it is possible for this to be overidden with the '-x nodataset' option. Since only datasets can be cloned, a dataset will first be created and the contents of that zone copied into the dataset. Once this is done subsequent upgrades will be able to clone the zone rather than copying. 6.0 Enhancements 6.1 Documentation The appropriate manual pages, system admin guide, and best practices documentation will have to be updated to reflect these changes. The changes will be substantial since this represents a fundamental change in how systems are maintained and upgraded. 6.2 Global enhancements to Live Upgrade code There are several classes of change required in the Live Upgrade code; get rid of the implicit assumption that file systems are built upon logical devices, support for the use of dataset paths instead of mount points both internally as well as in the intermediate files used by Live Upgrade to pass information between, and the changes necessary to take advantage of the ability to clone ZFS datasets. 6.2.1 Intermediate Files There are several file system table formats dealt with by Live Upgrade. They are vfstab, ICF (Internal Configuration File), fstbl, and lutab. Live Upgrade is a consumer of the first and both producer and consumer for the others. The pre-ZFS file formats are given below with the changes required to support ZFS: 1. vfstab entry format This is defined by vfstab(4); elements are whitespace-separated. ---1---- ---2--- --3-- --4- --5- ---6--- ---7--- -------- ------- ----- ---- ---- ------- ------- device device mount FS fsck mount mount to mount to fsck point type pass at boot options By default, all ZFS file systems are mounted by ZFS at boot by the SMF svc://system/filesystem/local service. If desired, file systems can also be explicitly managed through legacy mount interfaces by setting the mountpoint property to legacy by using ZFS set. Doing so prevents ZFS from automatically mounting and managing this file system. Legacy tools including the mount and umount commands, and the /etc/vfstab file must be used instead. Live Upgrade will not support legacy mounts. ZFS file systems within a BE will use the ZFS mount facility. 2. ICF entry format This is an internal file format used by LU to save boot environment definitions. --1--- ---2--- -----3----- --4--- --5- --6- ------ ------- ----------- ------ ---- ---- beName mountPt blockDevice fsType size zone There is an ICF file for each BE and an entry for each file system within the BE. The format will need to be expanded as follows: --1--- ---2--- -----3----- --4--- --5- --6- ------ ------- ----------- ------ ---- ---- beName mountPt ZFS dataset zfs size zone 3. Internal fstbl entry format This is an internal format containing all the information needed to represent file systems used in a boot environment. ---1---- ---2--- --3-- --4- --5- ---6--- ---7--- --8- -------- ------- ----- ---- ---- ------- ------- ---- device device mount FS fsck mount mount zone to mount to fsck point type pass at boot options name This is the format created by lu_beoGetBeFstbl and consumed in the "basic environment" plug in. The format will be expanded as follows to support ZFS: ---1---- ---2--- --3-- --4- --5- ---6--- ---7--- --8- -------- ------- ----- ---- ---- ------- ------- ---- ZFS - mount zfs - mount mount zone dataset point at boot options name 4. Internal lutab entry format This is an internal format containing all the information needed to define a boot environment. Format of each entry in the lutab file: beId:field-1:field-2:fieldType Three entries will exist for each BE, all with identical BE ids, but each with a unique field type. Lutab field type entries: 0 - BE name/BE status: beId:beName:beStatus:0 1 - root slice (mount to get to /) beId:/:datasetPath:1 2 - boot device (used to change boot prom) beId:boot-device:bootDevice:2 This file contains entries for each BE that is defined. The beId can be used to access both the ICF (/etc/lu/ICF.) and INODE (/etc/lu/INODE.) files. The only extension to this file required for ZFS support is changing the field type 2 entry to allow specification of a ZFS dataset. 6.2.2 Changes to support ZFS clones One of the major advantages of supporting ZFS with Live Upgrade is the ability to create a clone of the file systems that are being upgraded. This can be done quickly and only requires a fraction of the disk space. There is a problem here that should at least be acknowledged and solved if possible. The problem stems from the fact that you cannot destroy a file system from which a clone is created. With a simple algorithm of snapshot/clone of the PBE to create the ABE the user could end up with a very deep clone tree as shown below: BE-081506 | BE-091606 | BE-111506 . . . This is a situation that could quickly eat up the available disk space on a system that was Live Upgraded fairly often as would be the case with either an internal or external OpenSolaris developer. To solve this problem clone promotion will be used to promote the file systems clones of the ABE. This will allow any of the inactive BEs to be deleted. This will be done as part of luactivate. Unless the '-s' option to lucreate to specify the source BE to create from is used there will only ever be a single level and then implementation of this approach is fairly straightforward. When the '-s' option is used then it may be necessary to promote a clone several times in order to get it to the top of the clone hierarchy for the file system. There is a behavior of ZFS that can cause this to breakdown. The clone promotion will fail if there is a name space collision between the snapshots in the two clones. The example below illustrates: # zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 1.09M 10.8G 976K /export mypool/tst1 18K 10.8G 18K /export/tst1 mypool/tst1@today 0 - 18K - mypool/tst2 0 10.8G 18K /export/tst2 mypool/tst2@tomorrow 0 - 18K - mypool/tst3 0 10.8G 18K /export/tst3 mypool/tst3@today 0 - 18K - # zfs promote mypool/tst3 # zfs list NAME USED AVAIL REFER MOUNTPOINT mypool 1.10M 10.8G 976K /export mypool/tst1 18K 10.8G 18K /export/tst1 mypool/tst1@today 0 - 18K - mypool/tst2 0 10.8G 18K /export/tst2 mypool/tst3 0 10.8G 18K /export/tst3 mypool/tst3@tomorrow 0 - 18K - mypool/tst3@today 0 - 18K - # zfs promote mypool/tst3 cannot promote 'mypool/tst3': snapshot name 'mypool/tst1@today' from origin conflicts with 'mypool/tst3@today' from target As can be seen, as the clone is promoted it takes ownership of the snapshots. If there is a name space collision then the promotion will fail. This project will need to either ensure that there is no possibility of namespace collision on BE file system snapshots or have some way of dealing with this. Luactivate will detect this situation and print out an error message specifying the conflicting snapshot. The user will have to take corrective action before proceeding. Another complicating factor in the algorithm to promote clones as part of luactivate is the '-s' option of the lucreate command. The option enables users to use a BE other than the current BE as the source for creation of an ABE. There are 2 variations both of which are relevant to this project. The first is '-s -'. If the hyphen is specified as the argument lucreate will create the new BE, but does not populate it. It is intended for an installation of a flash archive so cloning the existing IBE doesn't make sense. Instead a dataset is created with all of the required file systems. Nothing is known at this point about the size of the flash archive so no space checking is done. The second case allows users to specify by name another BE to use as the source for creation. Without this variation of the '-s' option the BE clone tree is only ever 1 level deep so the clone promotion algorithm is straightforward. In this case the clone structure always has the form shown below in the clone diagram: IBE / | \ / | \ / | \ BE1 BE2 ... With the '-s' option it is possible to create a clone structure that looks like: IBE | BE1 | BE2 | . . . When the BE is activated it is promoted to the top of the BE clone structure. In order to do this it is necessary to know the relationship between the BEs so that the clone can be promoted to the top of the BE clone tree but no higher. 6.3 lucreate Enhancements 6.3.1 Interface changes lucreate will have to be changed to recognize the new command line parameter, '-p ' explained earlier in this document. The default operation will be changed when the user migrates to or upgrades from a ZFS root. When migrating to a ZFS root it is only possible to specify the root pool that the BE is being migrated to. All of the default file systems will be created automatically. 6.3.2 Space calculation The current implementation checks that the size of the file systems in the ABE are at least as large as the PBE and that the number of inodes is 1.2 times the size of the PBE. These computations take into account split and joined file systems. This will guarantee that the the lucreate will not fail because of size issues. The algorithm will need to be changed as follows: - ZFS has no inodes so this calculation is unnecessary when migrating to ZFS. - ZFS reports the amount of space available to each file system to be the amount of space available to the pool. The current algorithm assumes that each file system is on a distinct logical device. This change requires that when multiple file systems are being migrated to the same pool that this fact be taken into account when computing the amount of space available. - In the case of upgrading a file system from ZFS to the same pool, it is not necessary to verify that there is enough space to copy PBE to the ABE. A snapshot and clone are done which is both very fast and requires no additional space. Therefore in that situation it is not necessary to do any of the space calculations done in this script. 6.3.3 Space usage enhancements lucreate, as it is now implemented, copies the file systems specified with the '-m' parameter to a different device. When file systems are migrated to ZFS they will still have to be copied. It is possible once a file system is in ZFS for lucreate to snapshot and clone the PBE file system to create the ABE. This is much faster but will require that lucreate detect that this is the case to take advantage of it. Cloning will only be possible if the ABE is in the same pool as the PBE. 6.3.4 Boot Pool Minimal Requirement Check When migrating root file systems from UFS to a ZFS pool it is not only necessary for the pool to exist and have enough space available to hold the software. The pool also has other requirements so that the new BE is bootable when it is activated. lucreate will ensure that these requirements are met before attempting to copy the file systems. The requirements are: - the pool must either exist on a single slice or disk or - the pool can be mirrored but not RAID-Z or stripe Each submirror must be on a single slice or disk - the root pool disk must have an SMI label - On x86 system the root pool disk must contain an fdisk table 6.4 lumake enhancements lumake is used to recreate boot environments. In the current implementation this means copying file systems from the PBE to the ABE. All file systems on the ABE are re-created. It will now be necessary to determine for every file system in the ABE if it is a clone. If so, rather than re-creating and copying the file system, the snapshot and clone used to create the file system are destroyed and recreated. 6.5 luupgrade enhancements There are no interface changes to luupgrade. The application will have to be expanded to generate a ZFS compatible profile. Although not part of this project, the space calculation code in pfinstall will have to be enhanced to deal with ZFS file systems. Because clones will be used where ever possible, it will have to calculate the incremental size needed for each file system. It will also be necessary to understand and correctly compute file system sizes when upgrading compressed ZFS file systems. 6.6 luactivate enhancements There are no interface changes to luactivate. The application will now do clone migration on the ZFS file systems in the ABE as well recognize that a ZFS root has been created and perform the appropriate boot environment creation. 6.7 ludelete enhancements There are no interface changes to ludelete. ludelete will be enhanced to not only update the Live Upgrade metadata but to destroy the snapshots and clones that comprised the deleted BE. 6.8 lufslist enhancements There are no interface changes to lufslist. lufslist will be enhanced to output ZFS file system data. Example: # lufslist -n zfsBE boot environment name: zfsBE Filesystem fstype size Mount Point Mount Options --------------- ----- ------- ----------- ------------- rpool/ROOT/zfsBE zfs 123456 / - rpool/ROOT/zfsBE/var zfs 123456 /var - . . . /dev/dsk/c0t1d0s0 ufs 1234567 /export - 6.9 lumount enhancements There are no interface changes to lumount. lumount enable the user to mount all of the file systems in a BE so that they can be inspected or modified. In the current implementation all that is needed is the /etc/vfstab in the BE to mount the file systems. lumount will now be required to look at the pools, /etc/vfstab, and the Live Upgrade metadata to mount all of the file systems.