#
# ident    "@(#)portfolio 1.1     06/12/13 SMI"
#

/*
 * Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
 * Use is subject to license terms.
 */

# ident "@(#)portfolio_template.txt     1.10    06/12/08 SMI"


1. Introduction

        1.1 Portfolio Name

            ZFS failmode
 
        1.2 Portfolio Author

            eric.kustarz@sun.com

        1.3 Submission Date 

            3/31/08

        1.4 Project Team Aliases: 

            zfs-eng@sun.com

        1.5 Interest List

        1.6 List of Reviewers

2. Portfolio description

   Currently, ZFS triggers a cmn_err() message in certain conditions when
   an I/O failure happens (see PSARC 2007/567 zpool failmode property).
   This project will replace the cmn_err() message with an ereport and
   fault events.


3. Fault Boundary Analysis (FBA)
     3.1 For systems, subsystems, components or services that make up 
         this portfolio, list all resources that will be diagnosed and
         all the ASRUs and FRUs (see RAS glossary for definitions) 
         associated with each diagnosis in which the resource may be a 
         suspect.

           o Provide a pointer to your topology.  The output of `fmtopo -v`
              (from a real system or a mock-up) will be the best way to
              present this information.  (See example below): 

		No new FRUs or ASRUs are introduced.

        3.2 Diagrams or a description of the faults that may be present in
            the subsystem.  A suitable format for this information is an
            Eversholt Fault Tree (see http://eversholt.central) that describes
            the ASRU and FRU boundaries, the faults that can be present within
            those boundaries and the error propagation telemetry for those faults.

            o Provide a pointer to the latest version of your fault tree
              which lists each fault and how errors are propagated both
              internally within the subsystem and between subsystems. 
              Note that for more complex subsystems, it is a requirement 
              to present a block diagram showing the fault boundaries.

		Two new faults are introduced, which are generated in
		response to I/O failures:

		(1)
			ereport.fs.zfs.io_failure
				|
				V
			fault.fs.zfs.io_failure_wait

		(2)
			ereport.fs.zfs.io_failure
				|
				V
			fault.fs.zfs.io_failure_continue

		Which fault that is generated via "ereport.fs.zfs.io_failure"
		is based on the ZFS pool's 'failmode' property setting.  The
		property can be set to one of three things: panic, wait, or
		continue.  If it is set to panic, then no fault is generated
		as the system immediatley panics.  If its set to wait, then
		all I/Os (read and writes) are blocked until manual intervention
		occurs.  If its set to continue, then all writes I/Os (but
		not reads) are blocked until manual intervention occurs.
 
            o Provide a pointer to a summary of your changes to the Event Registry,
              as summarized by the "ercheck" tool.  The HTML report from ercheck
              is the preferred format:

                ercheck -H summary.html (NOTE: supply "summary.html" file with your 
                portfolio, see http://events.central/process/change.html for details)

		http://jurassic-x4600.sfbay/net/zday.sfbay/export/ekstarz/fma_failmode/summary.html
        
4. Diagnosis Strategy
        4.1 Provide a diagnosis philosophy document or a pointer to a
            portfolio that describes the algorithm used to diagnose the 
            faults described in Section 3.2 and the reasons for using said 
            strateg(y/ies).

            If you are using the Eversholt diagnosis system, please 
            provide a pointer to the propogation rules.

		If a I/O failure causes a cmn_err() message today, we will
		instead issue an ereport ("ereport.fs.zfs.io_failure").  The
		ZFS diagnosis engine has a simple 1:1 mapping between receiving
		this ereport and issuing a fault (either
		"fault.fs.zfs.io_failure_wait" or
		"fault.fs/zfs.io_failure_continue" - based on the ZFS pool's
		'failmode' property setting).


        4.2 If your fault management activity (error handling, diagnosis 
            or recovery) spans multiple fault manager regions, explain
            how each activity is coordinated between regions.  For example,
            a Service Processor and Solaris domain may need to coordinate
            common error telemetry for diagnosis or provide interfaces
            to effect recovery operations.

                N/A



5. Error Handling Strategy
        5.1 How are errors handled?  Include a description of the immediate
            error reactions taken to capture error state and keep the 
            system available without compromising the integrity of the 
            rest of the system or user data.  In the case of a device 
            driver being hardened, describe the recovery/retry behavior,
            if any.

		See section 4.1 and PSARC case 2007/567 zpool failmode property.

        5.2 What new error report (ereport) events will be defined and
            registered with the SMI event registry? Include all FMA Protocol 
            ereport specifications.  Provide a pointer to your ercheck
            output.

		ereport.fs.zfs.io_failure
		http://jurassic-x4600.sfbay/net/zday.sfbay/export/ekstarz/fma_failmode/summary.html

        5.3 If you are *not* using a reference fault manager (fmd(1M))
            on your system, how are you persisting ereports and communicating
            them to Sun Services?

		N/A

        5.4 For more complex system portfolios (like Niagara2), provide a
            comprehensive error handling philosophy document that descibes 
            how errors are handled by all components involved in error 
            handling (including Service Processors, LDOMs, etc.)

                N/A

6. Recovery/Reaction
        6.1 Are you introducing any new recovery agent(s)?  If so, please
            provide a description of the recovery agent(s).

                No new recovery agents.

        6.2 What existing fma modules will be used in response to your faults?

		N/A

        6.3 Are you modifying any existing (Section 6.2) recovery agents? 
            If so, please indicate the agents below, with a brief description
            of how they will be modified.

		N/A

        6.4 Describe any immediate (e.g. offlining) and long-term (e.g.
            (e.g. black-listing) recovery.

		vdevs that experience I/O failures will be marked as
		FAULTED.  Depending on which vdevs are affected and
		the ZFS pool configuration, this could cause the pool
		to become DEGRADED, UNAVAIL, or FAULTED.

        6.5 Provide pointers to dictionary/po entries and knowledge
            articles.

		Updated dictionary/po entries at:
		http://jurassic-x4600.sfbay/net/zday.sfbay/export/ekstarz/fma_failmode/generated/dicts/ZFS.dict
		http://jurassic-x4600.sfbay/net/zday.sfbay/export/ekstarz/fma_failmode/generated/dicts/ZFS.po

		New knowledge articles can be found at:
		http://jurassic-x4600.sfbay/net/zday.sfbay/export/ekstarz/fma_failmode/generated/msgdoc/.articles/ZFS/8000-HC
		http://jurassic-x4600.sfbay/net/zday.sfbay/export/ekstarz/fma_failmode/generated/msgdoc/.articles/ZFS/8000-JQ
                

7.  FRUID Implementation

        7.1 Complete this section if you're submitting a portfolio for a 
            platform.

                N/A

            (Refer to http://webhome.sfbay/FRUID/ for additional information 
             on FRU ID requirements and reference material.)
         
                7.1.1 Summarize the platform's level of conformance to the 
                      policies described in "The Policies and Best Practices 
                      for the Recording of FMA Status and Event Data in FRUID 
                      Storage Devices".  [Refer to  
                      http://fma.eng.sun.com/developer/psh_tech/psh-tech.html
                      for a copy of this document.]
                
                7.1.2 Indicate which FRUs listed in Section 3.1 comply with
                      the policies & best practices and which FRUs do not.
                
                7.1.3 Provide a link to the document describing the component
                      map for each FRU.  An example can be found in Appendix C
                      of the FRUID Common Dynamic Data Defnition Version 1.2.3.
                      (Refer to http://fruid.sfbay/externalspecs/fruiddyn1)
                
                7.1.4 Provide a link to the document describing what platform 
                      specific event information, if any, will be recorded in 
                      the "diagdata" field of the Status_EventsR record for 
                      each message id.

8. Test
        8.1 Provide a pointer to your test plan(s) and specification(s).
            Make sure to list all FMA functionalities that are/are not
            covered by the test plan(s) and specification(s).

                Unit testing via 'zinject' as described in:
                http://monaco.sfbay.sun.com/detail.jsf?cr=6623234

       8.2 Explain the risks associated with the test gaps, if any.

                N/A

9. Gaps
        9.1 List any gaps that prevent a full FMA feature set.
            This includes but is not limited to insufficient error 
            detectors, error reporting, and software infrastructure.

		Currently, ZFS diagnoses are done to a vdev, not to an
		actual device or FRU or ASRU.  This is a known limitation,
		and the functionality is part of the overall unified disk
		diagnosis portfolio. See RFE: 6683960.

        9.2 Provide a risk assessment of the gaps listed in Section 9.1.
            Describe the customer and/or service impact if said gaps
            are not addressed.

		Customer/admin may not be able to figure out what particular
		device to actually replace/fix.  May have trouble locating
		where that physical piece actually is.

        9.3 List future projects/get-well plans to address the gaps listed
            in Section 9.1.  Provide target date and/or release information 
            as to when these gaps will be addressed.

		See RFE: 6683960

10.Dependencies 
       10.1 List all project and other portfolio dependencies to fully realize
            the targeted FMA feature set for this portfolio. A portfolio may
            have dependencies on infrastructure projects. For example,
            The "Sun4u PCI hostbridge" and "PCI-X" projects have a dependency
            on the events/ereports defined within the "PCI Local Bus" 
            portfolio.

		2005/019 ZFS FMA Phase 0
		2006/005 ZFS FMA Phase 1
		2007/006 ZFS FMA Phase 2

11. References
      11.1 Provide pointers to all documents referenced in previous
            sections (for example, list pointers to error handling
            and diagnosis philosophy documents, test plans, 
            etc.)

	CR:
		http://monaco.sfbay.sun.com/detail.jsf?cr=6623234

	Prior FMA portfolios:

		http://fma.eng/documents/engineering/portfolios/2005/019.zfs/
		http://fma.eng/documents/engineering/portfolios/2006/005.zfs-phase1/
		http://fma.eng/documents/engineering/portfolios/2007/006.ZFS-P2/portfolio.txt

	Ercheck output:
		http://jurassic-x4600.sfbay/net/zday.sfbay/export/ekstarz/fma_failmode/summary.html

	Events workspace:

		/net/zday.sfbay/export/ekstarz/fma_failmode

	ON project workspace:

		/net/zday.sfbay/export/ekstarz/minor_things