1. Introduction

	1.1 Portfolio Name: ZFS FMA Phase 1
 
	1.2 Portfolio Author: Eric Schrock

	1.3 Submission Date: 02/24/06

	1.4 Project Team Aliases: zfs-team@sun.com

	1.5 Interest List: zfs-team@sun.com

2. Portfolio description

    This document describes the first phase of ZFS FMA integration.  This
    includes the ability to generate comprehensive ereports for checksum,
    I/O, device, and pool errors.  A simple DE will be provided that is
    capable of diagnosing pool failure and complete device failure.

    The DE will not include the ability to perform predictive analysis of
    checksum and I/O errors, nor will it include an agent capable of taking
    proactive actions based on any faults.  These facilities will be
    provided as part of a future project.

3. Fault Boundary Analysis (FBA)
	3.1 ASRU and FRU (see RAS glossary for definitions) boundaries for
	    the systems, subsystems, components or services that make up this
	    portfolio.

	    A ZFS fault exists at the pool or device level.  The ASRU for
	    one of these faults describes either a complete pool, or a single
	    device within a pool.  There is no FRU associated with these
	    faults, as they are a purely software abstraction.  

	    zfs fmri name-value pairs
	    Member Name		Data Type	Comment
	    scheme		string		value="zfs"
	    version		uint8		Version of this FMRI
						specification.  Initially set
						to 1.
	    authority		authority	System identifier
	    pool_guid		uint64		zfs pool global identifier
	    vdev_guid		uint64		zfs vdev global identifier

	    The zfs scheme specifies a zfs pool and/or vdev associated with a
	    particular fault event.
	    
	    zfs FMRI string syntax
	    zfs://[<authority>]/pool=<pool_name/pool_guid>/vdev=<vdev_guid>
	    For example, the FMRI
	    
	    zfs://pool=test/vdev=ced7b92ef29e7fb5

	    describes the zfs pool test and the virtual device in the pool
	    ced7b92ef29e7fb5.

	    Future work may tie together the notion of a ZFS 'virtual device'
	    with the underlying physical device, once a more comprehensive
	    strategy for I/O FMA is developed.

	3.2 Diagrams or a description of the faults that may be present in
 	    the subsystem. 

	    There are two faults which can be present on the system: a device
	    fault and a pool fault.

	    A faulted pool is one which cannot be opened due to corrupted
	    metadata and/or unavailable devices.  If enough devices cannot be
	    opened such that a toplevel vdev is unavailable, the pool cannot
	    be opened.  Similarly, if a piece of metadata needed to open the
	    pool is unreadable, the pool cannot be opened.

	    A device fault indicates that a device which was previously open
	    is no longer openable.  In the future, this will be expanded to
	    include devices explicitly faulted by the DE after a certain number
	    of errors has occurred.  A faulted device remains in the ZFS
	    configuration but is not used for any I/O.

	    Provide a pointer to the latest version of your fault tree (.esc):

	    N/A
 
	    Provide a pointer to a summary of your changes to the Event
	    Registry, as summarized by the "ercheck" tool.  The HTML
	    report from ercheck is the preferred format:

	    http://nome.eng/u/ws/dilpreet/zfs-events/report.html

4. Diagnosis Strategy
	4.1 How are the faults described in section 2 diagnosed?  

	    I/O, checksum, and data errors not associated with pool open
	    activity are used to update internal counters, but are otherwise
	    discarded.  Future work will enable predictive analysis based
	    on these error patterns.

	    All device, I/O, checksum, and data errors associated with an
	    attempt to open a pool are grouped in a single case.  If a pool-wide
	    ereport is seen, then an appropriate pool fault is generated.  If
	    no pool ereport is seen after a certain period of time, the case
	    is closed an no fault is generated.

	    If a device ereport is received independent of any pool open
	    attempt, a device fault is immediately generated for the
	    corresponding ZFS device.

	4.2 If your fault management activity (error handling, diagnosis 
	    or recovery) spans multiple fault manager regions, explain
	    how each activity is coordinated between regions.
  
	    N/A.

5. Error Handling Strategy
	5.1 How are errors handled?

	    An I/O or checksum error results in an immediate retry for
	    the majority of I/Os.  Certain types of I/O (including
	    speculative I/Os that are expected to fail) do not go
	    through the same retry logic, and do not generate ereports.

	    If the retry fails, an appropriate ereport is generated.  The
	    system also attempts to re-open the device to detect if it is
	    still available on the system.  If this re-open fails, then
	    a device-wide ereport is generated.  If the I/O failure results
	    in a logical piece of data being unavailable due to lack of
	    available replicas, then a subsequent data ereport is generated.

	    The ZFS stack (DMU, DSL, ZPL, and SPA) has been hardened to survive
	    read failures for any data on the system.  Write failures (or 
	    reads needed in order to write, such as determining available
	    space) result in a system panic.  Future work will enable the
	    system to retry writes on different devices, and we ultimately
	    hope to survive write failures while syncing a transaction group.

	    If a series of errors results in a pool being unopenable, then
	    a pool-wide ereport is generated.

	    The ereports are chained together using a common ENA.  Any I/O
	    or device failures incurred as part of a pool open us the same
	    ENA.  During normal operation, ereports belonging to the same
	    logical piece of data use the same ENA.

	5.2 What new error report (ereport) events will be defined and
	    registered with the SMI event registry?

	    http://nome.eng/u/ws/dilpreet/zfs-events/report.html

 	5.3 If you are *not* using a reference fault manager (fmd(1M))
            on your system, how are you persisting ereports and communicating
	    them to Sun Services?

	    N/A.

6. Recovery/Reaction
	6.1 Are you introducing any new recovery agents?

	    No.

	6.2 What existing fma modules will be used in response to your faults?

		[ ] cpumem-retire

	6.3 Are you modifying any existing (Section 6.2) recovery agents? 

	    No.

7. Test
	7.1 Describe the fault and error injection and error simulation
	    coverage for error handling and diagnosis engines.

	    A new tool, zinject(1M), is introduced (but not shipped with
	    Solaris).  A corresponding in-kernel framework allows two types of
	    errors to be injected: device errors and data errors.

	    Device errors are injected across an entire device, and can
	    simulate continued I/O failure as well as inability to
	    reopen the device.

	    Data errors allow I/O and checksum errors to be injected into
	    a particular logical piece of data, regardless of how it's
	    stored on the underlying devices.  This allows any simulated
	    failure of any block in the pool, including files, directories,
	    and various types of pool metadata.  A translation scheme is
	    provided so that blocks can be referred to by name rather
	    numeric identifier.

	7.2 How are error handling code paths tested?

	    Using the above error injection tool, a variety of errors
	    are injected into active pools and verifying that the correct
	    ereports are generated, and the appropriate diagnosis is made.
	    The output of 'zpool status' is also verified to be correct, using
	    the existing faults from FMA Portfolio 2005/019.

	    The zinject tool can control certain behaviors, such as flushing
	    the ZFS cache and reloading a pool, so that different scenarios
	    can be tested.  For example, after injecting an I/O error into
	    a particular file and flushing the cache, no fault is generated,
	    but 'zpool status' reports the following:

	  pool: test
	 state: ONLINE
	status: One or more devices has experienced an error resulting in data
		corruption.  Applications may be affected.
	action: Restore the file in question if possible.  Otherwise restore the
		entire pool from backup.
	   see: http://www.sun.com/msg/ZFS-8000-8A
	 scrub: none requested
	config:

		NAME         STATE     READ WRITE CKSUM
		test         ONLINE       1     0     0
		  mirror     ONLINE       1     0     0
		    c0t0d0   ONLINE       2     0     0
		    c0t1d0   ONLINE       2     0     0

	errors: The following persistent errors have been detected:

		  DATASET  OBJECT  RANGE
		  test     4       0-512

8. Gaps
	8.1 List any gaps that prevent a full FMA feature set.

	    This portfolio provdes a complete set of ereports from ZFS, as
	    well as a simplistic DE capable of identifying the most
	    obvious faults.  Future work will cover the following gaps to
	    complete the feature set:

		- The ability to perform predictive analysis of checksum
		  and I/O errors to determine if a device is failing or
		  likely to fail.

		- An automated agent that is capable of proactively offlining or
		  replacing failing devices, coordinated with a ZFS hot sparing
		  strategy.

		- Integration with hotplug events to better diagnose when
		  devices become (un)available on the system.

		- Integration with a larger I/O FMA strategy to coordinate
		  device failure with the underlying physical hardware.

9. Dependencies	
        9.1 List all project and other portfolio dependencies to fully realize
	    the targeted FMA feature set for this portfolio.

	    N/A.

10. References

	http://www.opensolaris.org/os/community/zfs