1. Introduction 1.1 Portfolio Name: ZFS FMA Phase 2 1.2 Portfolio Author: Eric Schrock 1.3 Submission Date 1.4 Project Team Aliases: zfs-eng@sun.com 1.5 Interest List: zfs-eng@sun.com 1.6 List of Reviewers 2. Portfolio description FMA portfolio 2005/019 (ZFS FMA Phase 0) established basic knowledge articles for use by 'zpool status'. This was later extended by 2006/005 (ZFS FMA Phase 1) to include diagnosis of failed pools and devices. One of the major remaining gaps for ZFS FMA is providing predictive diagnosis of I/O and checksum errors. ZFS currently generates ereports for I/O and checksum errors, but no diagnosis of these ereports is done and no action is taken in response. ZFS is able to identify devices which fail completely (those cannot be re-opened) but will continue to use devices which generate persistent I/O errors. The result is that a bad drive experiencing I/O errors (and associated timeouts) can slow all pool I/O to a crawl with no notification to the administrator other than a stream of ereports and an increasing number of errors in 'zpool status' output. This project enhances the ZFS diagnosis engine to diagnose these ereports and generate faults, as well as an agent that communicates these faults to ZFS. 3. Fault Boundary Analysis (FBA) 3.1 For systems, subsystems, components or services that make up this portfolio, list all resources that will be diagnosed and all the ASRUs and FRUs (see RAS glossary for definitions) associated with each diagnosis in which the resource may be a suspect. This project will perform additional diagnosis on the FRUs and ASRUs introduced by 2006/005. In particular, it will operate on individual vdevs within a pool, such as: zfs://pool=test/vdev=fdf22317eb1ed115 No new FRUs or ASRUs are introduced. 3.2 Diagrams or a description of the faults that may be present in the subsystem. A suitable format for this information is an Eversholt Fault Tree (see http://eversholt.central) that describes the ASRU and FRU boundaries, the faults that can be present within those boundaries and the error propagation telemetry for those faults. This project does not change any of the ereports introduced by 2006/005. Two new faults are introduced, which are generated in response to checksum and I/O ereports. (1) ereport.fs.zfs.io | V fault.fs.zfs.vdev.io (2) ereport.fs.zfs.checksum | V fault.fs.zfs.vdev.checksum 4. Diagnosis Strategy 4.1 Provide a diagnosis philosophy document or a pointer to a portfolio that describes the algorithm used to diagnose the faults described in Section 3.2 and the reasons for using said strateg(y/ies). The ZFS diagnosis engine is enhanced to include per-vdev I/O and checksum SERD engines. These SERD engines will be fed I/O and checksum ereports and generate the corresponding faults when the SERD engine fires. This diagnosis is overly simplistic in the face of many disk failure pathologies. Regions of devices (blocks) are not identified, so a single bad block in an unreplicated pool could generate a series of checksum errors when accessed repeatedly. In this case, the entire disk may be diagnosed as faulty, when in reality it is only a single bad block (perhaps due to a phantom write or administrative error). A more comprehensive diagnosis strategy is desirable, but there is currently insufficient data to design such a diagnosis, and even the simplistic diagnosis presented here is a vast improvement over what is available today. By the same token, the initial N and T values for the SERD engines will be only estimates. A vdev can represent an arbitrary device (disk, file, iSCSI target, etc), and the rate at which I/O is performed is proportional to the amount of activity within the ZFS pool. This makes it difficult to define a 'one size fits all' set of SERD values. This decision is also hampered by a lack of data, particularly for checksum errors, which before ZFS would have gone undetected. The values will be intentionally pessimistic, intended to diagnose only truly faulted devices regardless of type. Finally, this diagnosis does not correlate the vdev to the underlying device. Virtually all of the I/O errors can be attributed to the underlying device (modulo any transport errors). Without any underlying I/O diagnosis, it is impossible to perform coordinated diagnosis of the underlying device. More sophisticated analysis beyond the simple catch-all SERD engines should likely be done by the underlying I/O framework and coordinated with ZFS. Along these lines, the ZFS diagnosis engine will not consume the disk faults introduced by 2006/012 until a more complete strategy is defined. 4.2 If your fault management activity (error handling, diagnosis or recovery) spans multiple fault manager regions, explain how each activity is coordinated between regions. For example, a Service Processor and Solaris domain may need to coordinate common error telemetry for diagnosis or provide interfaces to effect recovery operations. N/A 5. Error Handling Strategy 5.1 How are errors handled? Include a description of the immediate error reactions taken to capture error state and keep the system available without compromising the integrity of the rest of the system or user data. In the case of a device driver being hardened, describe the recovery/retry behavior, if any. N/A 5.2 What new error report (ereport) events will be defined and registered with the SMI event registry? Include all FMA Protocol ereport specifications. Provide a pointer to your ercheck output. No new ereports are being generated, however new faults are being introduced as described in 3.2. ercheck output at: http://zday.sfbay/export/eschrock/zfs-events/summary.html 5.3 If you are *not* using a reference fault manager (fmd(1M)) on your system, how are you persisting ereports and communicating them to Sun Services? N/A 5.4 For more complex system portfolios (like Niagara2), provide a comprehensive error handling philosophy document that describes how errors are handled by all components involved in error handling (including Service Processors, LDOMs, etc.) [As an example, for sun4v platforms this may include specs for reset/config, POST, hypervisor, Solaris, and service processor software components.] N/A 6. Recovery/Reaction 6.1 Are you introducing any new recovery agent(s)? If so, please provide a description of the recovery agent(s). N/A 6.2 What existing fma modules will be used in response to your faults? zfs-agent 6.3 Are you modifying any existing (Section 6.2) recovery agents? If so, please indicate the agents below, with a brief description of how they will be modified. The zfs-agent module will be respond to these faults by notifying ZFS (via libzfs) of the new vdev state. This will update the in-kernel ZFS state to match the diagnosis, preventing any further I/O from being issued to a faulted device. This will also trigger a hot spare, if available. 6.4 Describe any immediate (e.g. offlining) and long-term (e.g. (e.g. black-listing) recovery. For I/O faults (fault.fs.zfs.vdev.io), the device will be offlined and marked FAULTED in the pool configuration. This information will be persistently recorded on disk to eliminate the window between pool open and fmd fault replay. For checksum faults (fault.fs.zfs.vdev.checksum), the device will be marked DEGRADED in the pool configuration and stored persistently on disk. This will not cause any change in ZFS behavior, although in the future ZFS may be enhanced to avoid allocating new blocks from such devices when possible. This is preferred over faulting the device because the device itself is still functional (I/O can be successfully completed), but the data being read from the device doesn't match the expected values. Faulting the device is unnecessarily drastic, and may do more harm than good (such as faulting the entire pool). The persistent on-disk state will be ignored when importing a pool, as the diagnosis may only be applicable to the system on which the original diagnosis was done. If the underlying devices are truly at fault, then FMA will quickly diagnose the vdevs as faulty on the new system. Either of these states can be cleared through 'fmadm repair' as well as 'zpool clear' (already a documented ZFS repair procedure). Both these actions will communicate between ZFS and FMA to ensure that the in-kernel ZFS state and fmd resource cache stay in sync. 6.5 Provide pointers to dictionary/po entries and knowledge articles. Updated dictionary/po entries at: http://zday.sfbay/export/eschrock/zfs-events/generated/dicts/ZFS.dict http://zday.sfbay/export/eschrock/zfs-events/generated/dicts/ZFS.po New knowledge articles can be found at: http://zday.sfbay/export/eschrock/zfs-events/generated/msgdoc/.articles/ZFS/8000-EY http://zday.sfbay/export/eschrock/zfs-events/generated/msgdoc/.articles/ZFS/8000-FD 7. FRUID Implementation 7.1 Complete this section if you're submitting a portfolio for a platform. N/A 8. Test 8.1 Provide a pointer to your test plan(s) and specification(s). Make sure to list all FMA functionalities that are/are not covered by the test plan(s) and specification(s). The ZFS test suite has an existing tool, zinject, which is capable of simulating faults for vdevs. The test suite will be expanded to verify the appropriate faults are diagnosed, correct actions are taken, and any repair actions are correctly reflected by the system state. 8.2 Explain the risks associated with the test gaps, if any. Due to the wide variety of possible devices which can serve as a ZFS vdev, the myriad of ways that devices can be connected to the system, and the general difficulty in obtaining devices in the process of failing, it will be impossible to test all scenarios on real hardware. Every attempt will be made to procure failing devices, but the SERD values used may not reflect realistic failure modes. 9. Gaps 9.1 List any gaps that prevent a full FMA feature set. This includes but is not limited to insufficient error detectors, error reporting, and software infrastructure. As described in section 4.1, the current diagnosis does not account for block level information, nor does it coordinate diagnosis with the underlying I/O framework. 9.2 Provide a risk assessment of the gaps listed in Section 9.1. Describe the customer and/or service impact if said gaps are not addressed. Since no existing I/O diagnosis exists today (SCSI, SATA, or otherwise) it is difficult to theorize how they would interact with the existing ZFS system. Once the first such framework is developed, some of the design decisions made here may need to be revisited. This also requires coordination with the planned io-retire agent. If any such project is introduced without coordination with ZFS, the resulting user experience would be suboptimal (such as multiple diagnoses for the a single underlying fault). The lack of any block-level diagnosis may result in poor diagnosis, particularly for persistent checksum errors within an unreplicated pool. Given that even this simplistic diagnosis is better than the non-existent diagnosis today, this does not pose a significant risk. 9.3 List future projects/get-well plans to address the gaps listed in Section 9.1. Provide target date and/or release information as to when these gaps will be addressed. There are no current plans to address these gaps in the near future. Once the initial SCSI FMA and/or io-retire agent portfolio is completed, the issue of how to correlate faults between devices and ZFS can be reexamined. 10.Dependencies 10.1 List all project and other portfolio dependencies to fully realize the targeted FMA feature set for this portfolio. A portfolio may have dependencies on infrastructure projects. For example, The "Sun4u PCI hostbridge" and "PCI-X" projects have a dependency on the events/ereports defined within the "PCI Local Bus" portfolio. This portfolio has the following dependencies: 2005/019 ZFS FMA Phase 0 2006/005 ZFS FMA Phase 1 11. References 11.1 Provide pointers to all documents referenced in previous sections (for example, list pointers to error handling and diagnosis philosophy documents, test plans, etc.) Prior FMA portfolios: http://fma.eng/documents/engineering/portfolios/2005/019.zfs/ http://fma.eng/documents/engineering/portfolios/2006/005.zfs-phase1/ Ercheck output: http://zday.sfbay/export/eschrock/zfs-events/summary.html Events workspace: /net/zday.sfbay/export/eschrock/zfs-events ON project workspace (includes unrelated work): /net/zday.sfbay/export/eschrock/zfs-fma