1. Introduction 1.1 Portfolio Name: ZFS FMA Phase 1 1.2 Portfolio Author: Eric Schrock 1.3 Submission Date: 02/24/06 1.4 Project Team Aliases: zfs-team@sun.com 1.5 Interest List: zfs-team@sun.com 2. Portfolio description This document describes the first phase of ZFS FMA integration. This includes the ability to generate comprehensive ereports for checksum, I/O, device, and pool errors. A simple DE will be provided that is capable of diagnosing pool failure and complete device failure. The DE will not include the ability to perform predictive analysis of checksum and I/O errors, nor will it include an agent capable of taking proactive actions based on any faults. These facilities will be provided as part of a future project. 3. Fault Boundary Analysis (FBA) 3.1 ASRU and FRU (see RAS glossary for definitions) boundaries for the systems, subsystems, components or services that make up this portfolio. A ZFS fault exists at the pool or device level. The ASRU for one of these faults describes either a complete pool, or a single device within a pool. There is no FRU associated with these faults, as they are a purely software abstraction. zfs fmri name-value pairs Member Name Data Type Comment scheme string value="zfs" version uint8 Version of this FMRI specification. Initially set to 1. authority authority System identifier pool_guid uint64 zfs pool global identifier vdev_guid uint64 zfs vdev global identifier The zfs scheme specifies a zfs pool and/or vdev associated with a particular fault event. zfs FMRI string syntax zfs://[]/pool=/vdev= For example, the FMRI zfs://pool=test/vdev=ced7b92ef29e7fb5 describes the zfs pool test and the virtual device in the pool ced7b92ef29e7fb5. Future work may tie together the notion of a ZFS 'virtual device' with the underlying physical device, once a more comprehensive strategy for I/O FMA is developed. 3.2 Diagrams or a description of the faults that may be present in the subsystem. There are two faults which can be present on the system: a device fault and a pool fault. A faulted pool is one which cannot be opened due to corrupted metadata and/or unavailable devices. If enough devices cannot be opened such that a toplevel vdev is unavailable, the pool cannot be opened. Similarly, if a piece of metadata needed to open the pool is unreadable, the pool cannot be opened. A device fault indicates that a device which was previously open is no longer openable. In the future, this will be expanded to include devices explicitly faulted by the DE after a certain number of errors has occurred. A faulted device remains in the ZFS configuration but is not used for any I/O. Provide a pointer to the latest version of your fault tree (.esc): N/A Provide a pointer to a summary of your changes to the Event Registry, as summarized by the "ercheck" tool. The HTML report from ercheck is the preferred format: http://nome.eng/u/ws/dilpreet/zfs-events/report.html 4. Diagnosis Strategy 4.1 How are the faults described in section 2 diagnosed? I/O, checksum, and data errors not associated with pool open activity are used to update internal counters, but are otherwise discarded. Future work will enable predictive analysis based on these error patterns. All device, I/O, checksum, and data errors associated with an attempt to open a pool are grouped in a single case. If a pool-wide ereport is seen, then an appropriate pool fault is generated. If no pool ereport is seen after a certain period of time, the case is closed an no fault is generated. If a device ereport is received independent of any pool open attempt, a device fault is immediately generated for the corresponding ZFS device. 4.2 If your fault management activity (error handling, diagnosis or recovery) spans multiple fault manager regions, explain how each activity is coordinated between regions. N/A. 5. Error Handling Strategy 5.1 How are errors handled? An I/O or checksum error results in an immediate retry for the majority of I/Os. Certain types of I/O (including speculative I/Os that are expected to fail) do not go through the same retry logic, and do not generate ereports. If the retry fails, an appropriate ereport is generated. The system also attempts to re-open the device to detect if it is still available on the system. If this re-open fails, then a device-wide ereport is generated. If the I/O failure results in a logical piece of data being unavailable due to lack of available replicas, then a subsequent data ereport is generated. The ZFS stack (DMU, DSL, ZPL, and SPA) has been hardened to survive read failures for any data on the system. Write failures (or reads needed in order to write, such as determining available space) result in a system panic. Future work will enable the system to retry writes on different devices, and we ultimately hope to survive write failures while syncing a transaction group. If a series of errors results in a pool being unopenable, then a pool-wide ereport is generated. The ereports are chained together using a common ENA. Any I/O or device failures incurred as part of a pool open us the same ENA. During normal operation, ereports belonging to the same logical piece of data use the same ENA. 5.2 What new error report (ereport) events will be defined and registered with the SMI event registry? http://nome.eng/u/ws/dilpreet/zfs-events/report.html 5.3 If you are *not* using a reference fault manager (fmd(1M)) on your system, how are you persisting ereports and communicating them to Sun Services? N/A. 6. Recovery/Reaction 6.1 Are you introducing any new recovery agents? No. 6.2 What existing fma modules will be used in response to your faults? [ ] cpumem-retire 6.3 Are you modifying any existing (Section 6.2) recovery agents? No. 7. Test 7.1 Describe the fault and error injection and error simulation coverage for error handling and diagnosis engines. A new tool, zinject(1M), is introduced (but not shipped with Solaris). A corresponding in-kernel framework allows two types of errors to be injected: device errors and data errors. Device errors are injected across an entire device, and can simulate continued I/O failure as well as inability to reopen the device. Data errors allow I/O and checksum errors to be injected into a particular logical piece of data, regardless of how it's stored on the underlying devices. This allows any simulated failure of any block in the pool, including files, directories, and various types of pool metadata. A translation scheme is provided so that blocks can be referred to by name rather numeric identifier. 7.2 How are error handling code paths tested? Using the above error injection tool, a variety of errors are injected into active pools and verifying that the correct ereports are generated, and the appropriate diagnosis is made. The output of 'zpool status' is also verified to be correct, using the existing faults from FMA Portfolio 2005/019. The zinject tool can control certain behaviors, such as flushing the ZFS cache and reloading a pool, so that different scenarios can be tested. For example, after injecting an I/O error into a particular file and flushing the cache, no fault is generated, but 'zpool status' reports the following: pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 1 0 0 mirror ONLINE 1 0 0 c0t0d0 ONLINE 2 0 0 c0t1d0 ONLINE 2 0 0 errors: The following persistent errors have been detected: DATASET OBJECT RANGE test 4 0-512 8. Gaps 8.1 List any gaps that prevent a full FMA feature set. This portfolio provdes a complete set of ereports from ZFS, as well as a simplistic DE capable of identifying the most obvious faults. Future work will cover the following gaps to complete the feature set: - The ability to perform predictive analysis of checksum and I/O errors to determine if a device is failing or likely to fail. - An automated agent that is capable of proactively offlining or replacing failing devices, coordinated with a ZFS hot sparing strategy. - Integration with hotplug events to better diagnose when devices become (un)available on the system. - Integration with a larger I/O FMA strategy to coordinate device failure with the underlying physical hardware. 9. Dependencies 9.1 List all project and other portfolio dependencies to fully realize the targeted FMA feature set for this portfolio. N/A. 10. References http://www.opensolaris.org/os/community/zfs