1. Introduction

	1.1 Portfolio Name: ZFS FMA Phase 2
 
	1.2 Portfolio Author: Eric Schrock

	1.3 Submission Date 

	1.4 Project Team Aliases: zfs-eng@sun.com

	1.5 Interest List: zfs-eng@sun.com

	1.6 List of Reviewers


2. Portfolio description

	FMA portfolio 2005/019 (ZFS FMA Phase 0) established basic
	knowledge articles for use by 'zpool status'.  This was later
	extended by 2006/005 (ZFS FMA Phase 1) to include diagnosis of
	failed pools and devices.  One of the major remaining gaps for
	ZFS FMA is providing predictive diagnosis of I/O and checksum
	errors.

	ZFS currently generates ereports for I/O and checksum errors,
	but no diagnosis of these ereports is done and no action is
	taken in response.  ZFS is able to identify devices which fail
	completely (those cannot be re-opened) but will continue to use
	devices which generate persistent I/O errors. The result is that
	a bad drive experiencing I/O errors (and associated timeouts)
	can slow all pool I/O to a crawl with no notification to the
	administrator other than a stream of ereports and an increasing
	number of errors in 'zpool status' output.

	This project enhances the ZFS diagnosis engine to diagnose these
	ereports and generate faults, as well as an agent that
	communicates these faults to ZFS.


3. Fault Boundary Analysis (FBA)

     3.1 For systems, subsystems, components or services that make up 
         this portfolio, list all resources that will be diagnosed and
         all the ASRUs and FRUs (see RAS glossary for definitions) 
         associated with each diagnosis in which the resource may be a 
         suspect.

	This project will perform additional diagnosis on the FRUs and
	ASRUs introduced by 2006/005.  In particular, it will operate on
	individual vdevs within a pool, such as:

		zfs://pool=test/vdev=fdf22317eb1ed115

	No new FRUs or ASRUs are introduced.


	3.2 Diagrams or a description of the faults that may be present in
 	    the subsystem.  A suitable format for this information is an
	    Eversholt Fault Tree (see http://eversholt.central) that describes
	    the ASRU and FRU boundaries, the faults that can be present within
	    those boundaries and the error propagation telemetry for
	    those faults.

	This project does not change any of the ereports introduced by 2006/005.
	Two new faults are introduced, which are generated in response to 
	checksum and I/O ereports.

 		(1)
 			ereport.fs.zfs.io
 				|
 				V
 			fault.fs.zfs.vdev.io

 		(2)
 			ereport.fs.zfs.checksum
 				|
 				V
 			fault.fs.zfs.vdev.checksum


4. Diagnosis Strategy
	4.1 Provide a diagnosis philosophy document or a pointer to a
	    portfolio that describes the algorithm used to diagnose the 
	    faults described in Section 3.2 and the reasons for using said 
	    strateg(y/ies).

	The ZFS diagnosis engine is enhanced to include per-vdev I/O and
	checksum SERD engines.  These SERD engines will be fed I/O and
	checksum ereports and generate the corresponding faults when the
	SERD engine fires.

	This diagnosis is overly simplistic in the face of many disk
	failure pathologies.  Regions of devices (blocks) are not
	identified, so a single bad block in an unreplicated pool could
	generate a series of checksum errors when accessed repeatedly.
	In this case, the entire disk may be diagnosed as faulty, when
	in reality it is only a single bad block (perhaps due to a
	phantom write or administrative error).  A more comprehensive
	diagnosis strategy is desirable, but there is currently
	insufficient data to design such a diagnosis, and even the
	simplistic diagnosis presented here is a vast improvement over
	what is available today.

	By the same token, the initial N and T values for the SERD
	engines will be only estimates.  A vdev can represent an
	arbitrary device (disk, file, iSCSI target, etc), and the rate
	at which I/O is performed is proportional to the amount of
	activity within the ZFS pool.  This makes it difficult to define
	a 'one size fits all' set of SERD values.  This decision is also
	hampered by a lack of data, particularly for checksum errors,
	which before ZFS would have gone undetected.  The values will be
	intentionally pessimistic, intended to diagnose only truly
	faulted devices regardless of type.
	
	Finally, this diagnosis does not correlate the vdev to the
	underlying device.  Virtually all of the I/O errors can be
	attributed to the underlying device (modulo any transport
	errors).  Without any underlying I/O diagnosis, it is impossible
	to perform coordinated diagnosis of the underlying device.  More
	sophisticated analysis beyond the simple catch-all SERD engines
	should likely be done by the underlying I/O framework and
	coordinated with ZFS.  Along these lines, the ZFS diagnosis
	engine will not consume the disk faults introduced by 2006/012
	until a more complete strategy is defined.

	4.2 If your fault management activity (error handling, diagnosis 
	    or recovery) spans multiple fault manager regions, explain
	    how each activity is coordinated between regions.  For example,
	    a Service Processor and Solaris domain may need to coordinate
  	    common error telemetry for diagnosis or provide interfaces
	    to effect recovery operations.

		N/A


5. Error Handling Strategy
	5.1 How are errors handled?  Include a description of the immediate
	    error reactions taken to capture error state and keep the 
	    system available without compromising the integrity of the 
	    rest of the system or user data.  In the case of a device 
	    driver being hardened, describe the recovery/retry behavior,
	    if any.

		N/A

	5.2 What new error report (ereport) events will be defined and
	    registered with the SMI event registry? Include all FMA Protocol 
	    ereport specifications.  Provide a pointer to your ercheck
	    output.

	No new ereports are being generated, however new faults are being
	introduced as described in 3.2.  ercheck output at:

		http://zday.sfbay/export/eschrock/zfs-events/summary.html

 	5.3 If you are *not* using a reference fault manager (fmd(1M))
            on your system, how are you persisting ereports and communicating
	    them to Sun Services?

		N/A

	5.4 For more complex system portfolios (like Niagara2), provide a
	    comprehensive error handling philosophy document that describes 
	    how errors are handled by all components involved in error 
	    handling (including Service Processors, LDOMs, etc.)
	    [As an example, for sun4v platforms this may include specs for 
	    reset/config, POST, hypervisor, Solaris, and service processor 
	    software components.]

		N/A


6. Recovery/Reaction

	6.1 Are you introducing any new recovery agent(s)?  If so, please
	    provide a description of the recovery agent(s).

		N/A

	6.2 What existing fma modules will be used in response to your faults?

		zfs-agent

	6.3 Are you modifying any existing (Section 6.2) recovery agents? 
	    If so, please indicate the agents below, with a brief description
	    of how they will be modified.

	The zfs-agent module will be respond to these faults by
	notifying ZFS (via libzfs) of the new vdev state.  This will
	update the in-kernel ZFS state to match the diagnosis,
	preventing any further I/O from being issued to a faulted
	device.  This will also trigger a hot spare, if available.

	6.4 Describe any immediate (e.g. offlining) and long-term (e.g.
	    (e.g. black-listing) recovery.

	For I/O faults (fault.fs.zfs.vdev.io), the device will be
	offlined and marked FAULTED in the pool configuration.  This
	information will be persistently recorded on disk to eliminate
	the window between pool open and fmd fault replay.

	For checksum faults (fault.fs.zfs.vdev.checksum), the device
	will be marked DEGRADED in the pool configuration and stored
	persistently on disk.  This will not cause any change in ZFS
	behavior, although in the future ZFS may be enhanced to avoid
	allocating new blocks from such devices when possible.  This is
	preferred over faulting the device because the device itself is
	still functional (I/O can be successfully completed), but the
	data being read from the device doesn't match the expected
	values.  Faulting the device is unnecessarily drastic, and may
	do more harm than good (such as faulting the entire pool).

	The persistent on-disk state will be ignored when importing a
	pool, as the diagnosis may only be applicable to the system on
	which the original diagnosis was done.  If the underlying
	devices are truly at fault, then FMA will quickly diagnose the
	vdevs as faulty on the new system.  Either of these states can
	be cleared through 'fmadm repair' as well as 'zpool clear'
	(already a documented ZFS repair procedure).  Both these actions
	will communicate between ZFS and FMA to ensure that the
	in-kernel ZFS state and fmd resource cache stay in sync.

	6.5 Provide pointers to dictionary/po entries and knowledge
	    articles.

	Updated dictionary/po entries at:

	http://zday.sfbay/export/eschrock/zfs-events/generated/dicts/ZFS.dict
	http://zday.sfbay/export/eschrock/zfs-events/generated/dicts/ZFS.po

	New knowledge articles can be found at:

	http://zday.sfbay/export/eschrock/zfs-events/generated/msgdoc/.articles/ZFS/8000-EY
	http://zday.sfbay/export/eschrock/zfs-events/generated/msgdoc/.articles/ZFS/8000-FD

7.  FRUID Implementation

    	7.1 Complete this section if you're submitting a portfolio for a 
    	    platform.

		N/A

8. Test
	8.1 Provide a pointer to your test plan(s) and specification(s).
	    Make sure to list all FMA functionalities that are/are not
	    covered by the test plan(s) and specification(s).

	The ZFS test suite has an existing tool, zinject, which is
	capable of simulating faults for vdevs.  The test suite will be
	expanded to verify the appropriate faults are diagnosed, correct
	actions are taken, and any repair actions are correctly
	reflected by the system state.

       8.2 Explain the risks associated with the test gaps, if any.

	Due to the wide variety of possible devices which can serve as a
	ZFS vdev, the myriad of ways that devices can be connected to
	the system, and the general difficulty in obtaining devices in
	the process of failing, it will be impossible to test all
	scenarios on real hardware.  Every attempt will be made to
	procure failing devices, but the SERD values used may not
	reflect realistic failure modes.  


9. Gaps
	9.1 List any gaps that prevent a full FMA feature set.
	    This includes but is not limited to insufficient error 
	    detectors, error reporting, and software infrastructure.

	As described in section 4.1, the current diagnosis does not
	account for block level information, nor does it coordinate
	diagnosis with the underlying I/O framework.

	9.2 Provide a risk assessment of the gaps listed in Section 9.1.
	    Describe the customer and/or service impact if said gaps
	    are not addressed.

	Since no existing I/O diagnosis exists today (SCSI, SATA, or
	otherwise) it is difficult to theorize how they would interact
	with the existing ZFS system.  Once the first such framework is
	developed, some of the design decisions made here may need to be
	revisited.  This also requires coordination with the planned
	io-retire agent.  If any such project is introduced without
	coordination with ZFS, the resulting user experience would be
	suboptimal (such as multiple diagnoses for the a single
	underlying fault).

	The lack of any block-level diagnosis may result in poor
	diagnosis, particularly for persistent checksum errors within an
	unreplicated pool.  Given that even this simplistic diagnosis
	is better than the non-existent diagnosis today, this does not
	pose a significant risk.

	9.3 List future projects/get-well plans to address the gaps listed
	    in Section 9.1.  Provide target date and/or release information 
	    as to when these gaps will be addressed.

	There are no current plans to address these gaps in the near
	future.  Once the initial SCSI FMA and/or io-retire agent
	portfolio is completed, the issue of how to correlate faults
	between devices and ZFS can be reexamined.


10.Dependencies	

       10.1 List all project and other portfolio dependencies to fully realize
	    the targeted FMA feature set for this portfolio. A portfolio may
	    have dependencies on infrastructure projects. For example,
	    The "Sun4u PCI hostbridge" and "PCI-X" projects have a dependency
	    on the events/ereports defined within the "PCI Local Bus" 
	    portfolio.

	This portfolio has the following dependencies:

		2005/019 ZFS FMA Phase 0
		2006/005 ZFS FMA Phase 1


11. References

      11.1 Provide pointers to all documents referenced in previous
	    sections (for example, list pointers to error handling
	    and diagnosis philosophy documents, test plans, 
	    etc.)

	Prior FMA portfolios:

		http://fma.eng/documents/engineering/portfolios/2005/019.zfs/
		http://fma.eng/documents/engineering/portfolios/2006/005.zfs-phase1/

	Ercheck output:

		http://zday.sfbay/export/eschrock/zfs-events/summary.html

	Events workspace:

		/net/zday.sfbay/export/eschrock/zfs-events

	ON project workspace (includes unrelated work):

		/net/zday.sfbay/export/eschrock/zfs-fma