# # ident "@(#)portfolio 1.1 06/12/13 SMI" # /* * Copyright 2007 Sun Microsystems, Inc. All rights reserved. * Use is subject to license terms. */ # ident "@(#)portfolio_template.txt 1.10 06/12/08 SMI" 1. Introduction 1.1 Portfolio Name ZFS failmode 1.2 Portfolio Author eric.kustarz@sun.com 1.3 Submission Date 3/31/08 1.4 Project Team Aliases: zfs-eng@sun.com 1.5 Interest List 1.6 List of Reviewers 2. Portfolio description Currently, ZFS triggers a cmn_err() message in certain conditions when an I/O failure happens (see PSARC 2007/567 zpool failmode property). This project will replace the cmn_err() message with an ereport and fault events. 3. Fault Boundary Analysis (FBA) 3.1 For systems, subsystems, components or services that make up this portfolio, list all resources that will be diagnosed and all the ASRUs and FRUs (see RAS glossary for definitions) associated with each diagnosis in which the resource may be a suspect. o Provide a pointer to your topology. The output of `fmtopo -v` (from a real system or a mock-up) will be the best way to present this information. (See example below): No new FRUs or ASRUs are introduced. 3.2 Diagrams or a description of the faults that may be present in the subsystem. A suitable format for this information is an Eversholt Fault Tree (see http://eversholt.central) that describes the ASRU and FRU boundaries, the faults that can be present within those boundaries and the error propagation telemetry for those faults. o Provide a pointer to the latest version of your fault tree which lists each fault and how errors are propagated both internally within the subsystem and between subsystems. Note that for more complex subsystems, it is a requirement to present a block diagram showing the fault boundaries. Two new faults are introduced, which are generated in response to I/O failures: (1) ereport.fs.zfs.io_failure | V fault.fs.zfs.io_failure_wait (2) ereport.fs.zfs.io_failure | V fault.fs.zfs.io_failure_continue Which fault that is generated via "ereport.fs.zfs.io_failure" is based on the ZFS pool's 'failmode' property setting. The property can be set to one of three things: panic, wait, or continue. If it is set to panic, then no fault is generated as the system immediatley panics. If its set to wait, then all I/Os (read and writes) are blocked until manual intervention occurs. If its set to continue, then all writes I/Os (but not reads) are blocked until manual intervention occurs. o Provide a pointer to a summary of your changes to the Event Registry, as summarized by the "ercheck" tool. The HTML report from ercheck is the preferred format: ercheck -H summary.html (NOTE: supply "summary.html" file with your portfolio, see http://events.central/process/change.html for details) http://jurassic-x4600.sfbay/net/zday.sfbay/export/ekstarz/fma_failmode/summary.html 4. Diagnosis Strategy 4.1 Provide a diagnosis philosophy document or a pointer to a portfolio that describes the algorithm used to diagnose the faults described in Section 3.2 and the reasons for using said strateg(y/ies). If you are using the Eversholt diagnosis system, please provide a pointer to the propogation rules. If a I/O failure causes a cmn_err() message today, we will instead issue an ereport ("ereport.fs.zfs.io_failure"). The ZFS diagnosis engine has a simple 1:1 mapping between receiving this ereport and issuing a fault (either "fault.fs.zfs.io_failure_wait" or "fault.fs/zfs.io_failure_continue" - based on the ZFS pool's 'failmode' property setting). 4.2 If your fault management activity (error handling, diagnosis or recovery) spans multiple fault manager regions, explain how each activity is coordinated between regions. For example, a Service Processor and Solaris domain may need to coordinate common error telemetry for diagnosis or provide interfaces to effect recovery operations. N/A 5. Error Handling Strategy 5.1 How are errors handled? Include a description of the immediate error reactions taken to capture error state and keep the system available without compromising the integrity of the rest of the system or user data. In the case of a device driver being hardened, describe the recovery/retry behavior, if any. See section 4.1 and PSARC case 2007/567 zpool failmode property. 5.2 What new error report (ereport) events will be defined and registered with the SMI event registry? Include all FMA Protocol ereport specifications. Provide a pointer to your ercheck output. ereport.fs.zfs.io_failure http://jurassic-x4600.sfbay/net/zday.sfbay/export/ekstarz/fma_failmode/summary.html 5.3 If you are *not* using a reference fault manager (fmd(1M)) on your system, how are you persisting ereports and communicating them to Sun Services? N/A 5.4 For more complex system portfolios (like Niagara2), provide a comprehensive error handling philosophy document that descibes how errors are handled by all components involved in error handling (including Service Processors, LDOMs, etc.) N/A 6. Recovery/Reaction 6.1 Are you introducing any new recovery agent(s)? If so, please provide a description of the recovery agent(s). No new recovery agents. 6.2 What existing fma modules will be used in response to your faults? N/A 6.3 Are you modifying any existing (Section 6.2) recovery agents? If so, please indicate the agents below, with a brief description of how they will be modified. N/A 6.4 Describe any immediate (e.g. offlining) and long-term (e.g. (e.g. black-listing) recovery. vdevs that experience I/O failures will be marked as FAULTED. Depending on which vdevs are affected and the ZFS pool configuration, this could cause the pool to become DEGRADED, UNAVAIL, or FAULTED. 6.5 Provide pointers to dictionary/po entries and knowledge articles. Updated dictionary/po entries at: http://jurassic-x4600.sfbay/net/zday.sfbay/export/ekstarz/fma_failmode/generated/dicts/ZFS.dict http://jurassic-x4600.sfbay/net/zday.sfbay/export/ekstarz/fma_failmode/generated/dicts/ZFS.po New knowledge articles can be found at: http://jurassic-x4600.sfbay/net/zday.sfbay/export/ekstarz/fma_failmode/generated/msgdoc/.articles/ZFS/8000-HC http://jurassic-x4600.sfbay/net/zday.sfbay/export/ekstarz/fma_failmode/generated/msgdoc/.articles/ZFS/8000-JQ 7. FRUID Implementation 7.1 Complete this section if you're submitting a portfolio for a platform. N/A (Refer to http://webhome.sfbay/FRUID/ for additional information on FRU ID requirements and reference material.) 7.1.1 Summarize the platform's level of conformance to the policies described in "The Policies and Best Practices for the Recording of FMA Status and Event Data in FRUID Storage Devices". [Refer to http://fma.eng.sun.com/developer/psh_tech/psh-tech.html for a copy of this document.] 7.1.2 Indicate which FRUs listed in Section 3.1 comply with the policies & best practices and which FRUs do not. 7.1.3 Provide a link to the document describing the component map for each FRU. An example can be found in Appendix C of the FRUID Common Dynamic Data Defnition Version 1.2.3. (Refer to http://fruid.sfbay/externalspecs/fruiddyn1) 7.1.4 Provide a link to the document describing what platform specific event information, if any, will be recorded in the "diagdata" field of the Status_EventsR record for each message id. 8. Test 8.1 Provide a pointer to your test plan(s) and specification(s). Make sure to list all FMA functionalities that are/are not covered by the test plan(s) and specification(s). Unit testing via 'zinject' as described in: http://monaco.sfbay.sun.com/detail.jsf?cr=6623234 8.2 Explain the risks associated with the test gaps, if any. N/A 9. Gaps 9.1 List any gaps that prevent a full FMA feature set. This includes but is not limited to insufficient error detectors, error reporting, and software infrastructure. Currently, ZFS diagnoses are done to a vdev, not to an actual device or FRU or ASRU. This is a known limitation, and the functionality is part of the overall unified disk diagnosis portfolio. See RFE: 6683960. 9.2 Provide a risk assessment of the gaps listed in Section 9.1. Describe the customer and/or service impact if said gaps are not addressed. Customer/admin may not be able to figure out what particular device to actually replace/fix. May have trouble locating where that physical piece actually is. 9.3 List future projects/get-well plans to address the gaps listed in Section 9.1. Provide target date and/or release information as to when these gaps will be addressed. See RFE: 6683960 10.Dependencies 10.1 List all project and other portfolio dependencies to fully realize the targeted FMA feature set for this portfolio. A portfolio may have dependencies on infrastructure projects. For example, The "Sun4u PCI hostbridge" and "PCI-X" projects have a dependency on the events/ereports defined within the "PCI Local Bus" portfolio. 2005/019 ZFS FMA Phase 0 2006/005 ZFS FMA Phase 1 2007/006 ZFS FMA Phase 2 11. References 11.1 Provide pointers to all documents referenced in previous sections (for example, list pointers to error handling and diagnosis philosophy documents, test plans, etc.) CR: http://monaco.sfbay.sun.com/detail.jsf?cr=6623234 Prior FMA portfolios: http://fma.eng/documents/engineering/portfolios/2005/019.zfs/ http://fma.eng/documents/engineering/portfolios/2006/005.zfs-phase1/ http://fma.eng/documents/engineering/portfolios/2007/006.ZFS-P2/portfolio.txt Ercheck output: http://jurassic-x4600.sfbay/net/zday.sfbay/export/ekstarz/fma_failmode/summary.html Events workspace: /net/zday.sfbay/export/ekstarz/fma_failmode ON project workspace: /net/zday.sfbay/export/ekstarz/minor_things