1. Introduction 1.1. Project/Component Working Name: Snapshot manager 1.2. Name of Document Author/Supplier: Author: Erwann Chenede, Tim Foster, Niall Power 1.3 Date of This Document: Sep 9 2008 4. Technical Description The snapshot manger's primary purpose is to enable automatic ZFS snapshots to be taken for the user. The goal is to have this "just work" and require minimal initial user configuration apart from deciding whether they want it on or off. The snapshot manager provides a sensible default configuration which the user is not required to configure in any way. This project is made of 3 separate parts : - auto-snapshot : an SMF service which allows the admin to performs regular, periodic snapshots of user/administrator-specified ZFS filesystems. It is loosely coupled with the ZFS codebase, using only the ZFS CLI, cron and SMF to perform it's functionality. - time-slider : an SMF service which aggregate auto-snapshot functionality to provide : - desktop user oriented default snapshot schedule. - an automatic cleanup functionality to avoid running out of disk space due to the periodic snapshot. - time-slider-setup : a python GNOME GUI application which let the user : - enable the snapshot scheduler service. - customize which zfs filesystems to automatically backup. - customize the automatic cleanup policy. This architecture was choosen to provide a sensible default configuration and yet let power users fine tune the scheduler as needed via the command line. 4.1 Details: 4.1.1 auto-snapshot The service works by having a separate service instance, each denoting a separate schedule of periodic snapshots, per group of fileystems. The SMF method script is responsible for adding/removing the snapshot cron job, which corresponds to enabling and disabling the service. The method script is also called directly from cron according to the crontab entries - in which case it is responsible for taking the snapshot. Filesystems are grouped together either by setting their names as a space separated list in an SMF instance property, or queried dynamically by the method script, by the service searching for an instance-specific ZFS user-property across all ZFS filesystems. With ZFS Delegated Administration (PSARC 2006/465), users can specify this property on their own filesystems, and need not reconfigure the SMF service. The service can also be responsible for destroying older snapshots taken by the service, allowing the administrator to keep a given number of snapshots into the past. The service can perform a backup command at each invocation of the cron job - the admin specifies what command to run at the end of a pipe that starts with "zfs send @", with the option of sending an incremental stream from the previous periodic snapshot. What does this offer that a simple "zfs snapshot @snap" entry in crontab doesn't? Using SMF allows the adminstrator to easily see when snapshots fail for some reason, allows them to easily enable/disable snapshots for groupings of filesystems and adds additional features, like performing backups of their filesystems. In the default configuration, we have daily, weekly, hourly, monthly and yearly snapshots - each managed under a different SMF instance. The administrator could add instances to take more frequent snapshots for some filesystems, less frequent snapshots for other filesystems - and have the service manage the complexity of dealing with cron for them. 4.1.2 time-slider : The default snapshot taking policy is : - Frequent snapshots, taken every 15 minutes, keeping the 4 most recent frequent snapshots. - Hourly snapshots taken once every hour, keeping the 24 most recent hourly snapshots. - Daily snapshots taken once every 24 hours, keeping the 7 most recent daily snapshots. - Weekly snapshots taken once every 7 days, keeping the 4 most recent weekly snapshots. - Monthly snapshots taken on the first day of every month, keeping the 12 most recent monthly snapshots. The above schedule aims to provide a balance between retaining a reasonable duration of available snapshots, but without having to keep an unreasonably large number of snapshots, with more granularity and frequency for recent snapshots. Apple's Time Machine adopts a similar backup scheduling policy. Power users who wish to customise the default policy may do so by modifying the SMF service properties for the hourly, daily, weekly or monthly schedules. 4.1.3 time-slider-setup : The time-slider-setup GUI displays the current configuration of the corresponding time-slider SMF service instance: svc:/application/time-slider:default and the filesystems the user has optionally selected to be automatically snapshotted. The status of the service and that of it's dependencies is queried using the svcs(1) command. The GUI enables and disables the service by invocation of the svcadm(1) command with the appropriate arguments. Configuration of the service is stored using SMF's native property group and property value configuration mechanisms which are queried and modified by the svcprop(1) and svccfg(1) commands respectively. Configuration options include: - Whether the default filesystem selection is chosen or if the user has opted for a customised selection. The SMF property for this is: "zfs/custom-selection" which is a boolean type witha default value of false, indicating that the default filesystem selection should be used. A true value indicates that the user has opted for customised file system selection. - The threshold level at which to reduce the number of snapshots kept on the system when a storage pool usage exceeds the specified percentage value. This is an initial warning level that the user is allowed to specify in the range between 70 and 90 percent. The default value is 80%. The SMF property value associated with this is "zpool/warning-level" Custom filesystem selection is based on zfs(1)'s user properties. The "com.sun:auto-snapshot" boolean property indicates whether or not the filesystem is selected for inclusion or exclusion from the automatic snapshotting by the svc:/system/filesystem.auto-snapshot SMF instances The user can select or deselect a filesystem from a list view in the time-slider-setup GUI. The selection status of a filesystem in the GUI gets applied as a true or false value to the "com.sun:auto-snapshot" property of the ZFS filesystem when the user clicks OK to complete the configuration of the time-slider service. 4.1.4 time-slider-cleanup : This is a helper program executed at regular intervals via crond. It is not intended for direct user invocation. User attempts to invoke the command directly will result in immediate exit of the program. The program gets added to root's crontab file whenever the time-slider service is enabled and get's removed if the service is disabled. The purpose of the program is to perform additional housekeeping by purging older snapshots before their natural expiry date as defined in section 4.1.2. In extreme situations, the user's zfs pool may become very low on available space due to the growth in size of a ZFS filesystem or the growth in size of one or more ZFS snapshots taken by the auto-snapshot service. In such cases, these snapshots need to be purged faster than normal in order to maintain a minimum level of space availability to the ZFS storage pool. The cleanup program takes a minimal approach to this task and is progressive in it's approach to deleting filesystem snapshots. It defines three levels of severity regarding space availability on a storage pool: warning, urgent, critical. These three levels are specified in the time-slider SMF instance's "zpool" property group. The three property values are: "warning-level", "critical-level" and "emergency-level" respectively. The default values of these properties are "80", "90" and "95" respectively, representing an integer value of the percentage of total disk capacity in use on a ZFS storage pool. The warning level has limited customisation ability by the user from the time-slider-seutp GUI. The lower limit of this value is 70 and the upper limit is 90, or the value of the critical-level if it has been customised. The GUI doe not offer the ability to customise the critical and emergency level values. When time-slider-cleanup gets invoked by crond, it checks the capacity level on each storage pool and compares this with the warning, critical and default levels. If the capacity exceeds one of these values it takes action appropriate to the severity of the problem: Warning level exceeded: Hourly, then daily snapshots will be delete, beginning with the oldest first until the storage pool capacity comes back down below the warning level value. Critical level exceeded: In addition to deletion of hourly and daily snapshots, weekly snapshots will also be deleted beginning with oldest first, until the storage pool capacity comes back down below the critical level. Emergency level exceeded: In addition to deletion of hourly, daily and weekly snapshots, monthly snapshots will also be deleted, starting with oldest first, until the storage pool capacity level comes back down below the emergency level value. If deletion of these snapshots fails to bring the pool's capacity back down below the emergency level then as a final measure, frequent (15-minute) snapshots are also deleted. This is a final measure that results in all automatically taken snapshots being deleted in order to try to keep the storage pool and the system usable. The frequent snapshots are deleted last as they are typically the smallest in size having the least devition from the main filesystem associated with them, and also because they will still provide the user with some level of protection from the most common human error such as accidental file deletion or corruption. time-slider-cleanup logs all remedial actions taken via syslog. Additionally it will send the desktop notification to any authorised users logged into the system and running a gnome desktop session. Authorised users are considered to be those posessing either the "Primary Administrator" or "ZFS Filesystem Management" profiles or the root user. 4.1.5 time-slider-notify : This is a helper program for time-slider-cleanup that displays warnings on the specified user's desktop's notification area. It is a wrapper for the /usr/bin/notify-send(1) command line utility. Please refer to : http://jdswiki.ireland.sun.com/twiki/bin/view/JDS/SnapshotManager 4.2 Bug/RFE Number(s): 6738645 4.4 Out of scope: A graphical means to explore zfs snapshots is provided via a new feature in the GNOME file manager (nautilus) see RFE 6738643 4.5 Interfaces: Imported Interfaces: -------------------- ZFS CLI (PSARC/2002/240) SMF (PSARC/2002/547) GNOME 2.24 (LSARC/2008/510) Exported Interfaces: -------------------- auto-snapshot : svc:/system/filesystem/zfs/auto-snapshot:daily Uncommitted svc:/system/filesystem/zfs/auto-snapshot:default Uncommitted svc:/system/filesystem/zfs/auto-snapshot:frequent Uncommitted svc:/system/filesystem/zfs/auto-snapshot:hourly Uncommitted svc:/system/filesystem/zfs/auto-snapshot:monthly Uncommitted svc:/system/filesystem/zfs/auto-snapshot:weekly Uncommitted time-slider : svc:/application/time-slider:default Uncommitted time-slider-setup : /usr/bin/time-slider-setup Uncommitted /usr/lib/time-slider-notify Uncommitted /usr/lib/time-slider-cleanup Uncommitted SUNWtime-slider-root Uncommitted SUNWtime-slider Uncommitted TIMFauto-snapshot Uncommitted 5. Documentation snapshot manager : http://jdswiki.ireland.sun.com/twiki/bin/view/JDS/SnapshotManager auto snapshot service : http://blogs.sun.com/timf/ 6. Resources and Schedule 6.1. Projected Availability OpenSolaris 2008.11 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: Desktop C-Team 6.5. ARC review type: FastTrack 6.6. ARC Exposure: open