/*
 * Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
 * Use is subject to license terms.
 *
 * portfolio.txt 1.2 07/11/16
 */

# ident	"@(#)portfolio_template.txt	1.11	07/02/07 SMI"

1. Introduction

	1.1 Portfolio Name

	    Repair Observability changes
 
	1.2 Portfolio Author

	    steve.hanson@sun.com

	1.3 Submission Date 
	    07/24/2008

	1.4 Project Team Aliases: 
	    fma-core@sun.com

	1.5 Interest List
 
	1.6 List of Reviewers
	    List any individuals/groups that have reviewed and/or approved
	    this portfolio.  It is recommended that the portfolio be pre-
	    reviewed by groups such as Service, RAS review committees, Quality 
	    Engineering, etc.
	
	    fm-convict-policy@sun.com
		- steve.chessin@sun.com   
		- gavin.maltby@sun.com
		- russ.mcmanus@sun.com
		- alex.noordergraaf@sun.com       
		- gary.wuollet@sun.com            
		- roman.zajcew@sun.com            
		
2. Portfolio description

	To record more detailed information regarding individual members of a
	suspect list, as to whether they have been physically
	repaired/replaced/removed or have been acquitted. The acquittal could
	be automatic (as part of an improved conviction policy) or manual (eg
	by following instructions in a knowledge article).

	Two new "lists" are defined, "list.updated" and "list.resolved".
	See file:///net/nome.eng/u1/ws/stephh/repair_obs_report.html

	The following new interfaces are added

	fmadm commands
	--------------
	fmadm repaired fmri | label
        fmadm replaced fmri | label
        fmadm acquit fmri | label
        fmadm acquit uuid [ fmri | label ]

	(Note that the existing "fmadm repair uuid | fmri | label" is
         retained for compatibility and is efectively subsumed by
         "fmadm repaired").

	fmd api interfaces
	------------------
	fmd_nvl_fmri_has_fault(fmd_hdl_t *hdl, nvlist_t *nvl, int type,
	    char *class)
	fmd_nvl_fmri_replaced(fmd_hdl_t *hdl, nvlist_t *nvl)
	fmd_nvl_fmri_service_state(fmd_hdl_t *hdl, nvlist_t *nvl)
	fmd_case_uuresolved(fmd_hdl_t *handle, const char *uuid)

	scheme interfaces
	-----------------
	fmd_fmri_replaced(nvlist_t *nvl)
	fmd_nvl_fmri_service_state(fmd_hdl_t *hdl, nvlist_t *nvl)

	topo interfaces
	---------------
	topo_fmri_replaced(topo_hdl_t *thp, nvlist_t *nvl, int *err)
	topo_fmri_service_state(topo_hdl_t *thp, nvlist_t *nvl, int *err)

	eversholt functions
	-------------------
	has_fault(path, class)
	    
	For details see design document fmadm_acquit.txt
	
3. Fault Boundary Analysis (FBA)
     3.1 For systems, subsystems, components or services that make up 
         this portfolio, list all resources that will be diagnosed and
         all the ASRUs and FRUs (see RAS glossary for definitions) 
         associated with each diagnosis in which the resource may be a 
         suspect.

	 N/A

4. Diagnosis Strategy
	4.1 Provide a diagnosis philosophy document or a pointer to a
	    portfolio that describes the algorithm used to diagnose the 
	    faults described in Section 3.2 and the reasons for using said 
	    strateg(y/ies).

	    N/A

	4.2 If your fault management activity (error handling, diagnosis 
	    or recovery) spans multiple fault manager regions, explain
	    how each activity is coordinated between regions.  For example,
	    a Service Processor and Solaris domain may need to coordinate
  	    common error telemetry for diagnosis or provide interfaces
	    to effect recovery operations.

	    N/A

5. Error Handling Strategy
	5.1 How are errors handled?  Include a description of the immediate
	    error reactions taken to capture error state and keep the 
	    system available without compromising the integrity of the 
	    rest of the system or user data.  In the case of a device 
	    driver being hardened, describe the recovery/retry behavior,
	    if any.

	    N/A


	5.2 What new error report (ereport) events will be defined and
	    registered with the SMI event registry? Include all FMA Protocol 
	    ereport specifications.  Provide a pointer to your ercheck
	    output.

	    N/A


 	5.3 If you are *not* using a reference fault manager (fmd(1M))
            on your system, how are you persisting ereports and communicating
	    them to Sun Services?

	    N/A

	5.4 For more complex system portfolios (like Niagara2), provide a
	    comprehensive error handling philosophy document that descibes 
	    how errors are handled by all components involved in error 
	    handling (including Service Processors, LDOMs, etc.)
	    [As an example, for sun4v platforms this may include specs for 
	    reset/config, POST, hypervisor, Solaris, and service processor 
	    software components.]

	    N/A

6. Recovery/Reaction
	6.1 Are you introducing any new recovery agent(s)?  If so, please
	    provide a description of the recovery agent(s).

	    N/A

	6.2 What existing fma modules will be used in response to your faults?

	    N/A

	6.3 Are you modifying any existing (Section 6.2) recovery agents? 
	    If so, please indicate the agents below, with a brief description
	    of how they will be modified.

	    There are some changes to the existing recovery agents so that
	    thay unisolate individually repaired suspects on receipt of a
	    list.updated event. These agents are also modified to call
	    fmd_case_uuresolved() when all suspects have been successfully
	    unretired on receipt of a list.repaired event.

	    These changes affect cpumem-retire, io-retire, zfs-retire and
            disk-monitor

	6.4 Describe any immediate (e.g. offlining) and long-term (e.g.
	    (e.g. black-listing) recovery.

	    N/A

	6.5 Provide pointers to dictionary/po entries and knowledge
	    articles.

	    See FMA.dict and FMA.po in
	    file:///net/nome.eng/u/ws/stephh/fma2/webrev/index.html

7.  FRUID Implementation

    	7.1 Complete this section if you're submitting a portfolio for a 
    	    platform.

	    N/A

8. Test
	8.1 Provide a pointer to your test plan(s) and specification(s).
	    Make sure to list all FMA functionalities that are/are not
	    covered by the test plan(s) and specification(s).

	    Tested using various io and cpumem fault injections, trying
	    out the new commands, simulating removal and replacement of 
            frus etc. The testing especially looked at the cases with multi-
	    entry suspect lists.

	    Note that the "fakenotpresent" configurable which allowed
            simulation of a device being removed from the system is being
            extended here to all both removal and replacement to be
            simulated.

	    Specific test cases include
	    - pciex hardware fault injector
	    - cpumem fault harness testing (sparc/intel/amd)
	    - pciex fmsim tests
	    - pci fmsim tests
            - check fmadm acquit/repaired/replaced with multi-entry suspect
	      lists generated via above and check behaviour of fmadm faulty,
              fmdump and syslog as individual suspects are acquitted/repaired.
	    - use fakenotpresent feature to simulate removal/replacement
              of suspects and check behaviour of fmadm faulty, fmdump and syslog
	    - check replay code on fmd restart for list.suspect/list.updated/
              list.repaired/list.replaced
            - run fmstress and check no leaks
	    - simulate an io device reporting service impact degraded and check
              this is reported correctly.
	    - check behaviour of retire agents is correct.

       8.2 Explain the risks associated with the test gaps, if any.

	   N/A

9. Gaps

	9.1 List any gaps that prevent a full FMA feature set.
	    This includes but is not limited to insufficient error 
	    detectors, error reporting, and software infrastructure.
	
	    N/A

	9.2 Provide a risk assessment of the gaps listed in Section 9.1.
	    Describe the customer and/or service impact if said gaps
	    are not addressed.

	    N/A


	9.3 List future projects/get-well plans to address the gaps listed
	    in Section 9.1.  Provide target date and/or release information 
	    as to when these gaps will be addressed.

	    N/A

10.Dependencies	
       10.1 List all project and other portfolio dependencies to fully realize
	    the targeted FMA feature set for this portfolio. A portfolio may
	    have dependencies on infrastructure projects. For example,
	    The "Sun4u PCI hostbridge" and "PCI-X" projects have a dependency
	    on the events/ereports defined within the "PCI Local Bus" 
	    portfolio.

	    N/A

11. References
      11.1 Provide pointers to all documents referenced in previous
	    sections (for example, list pointers to error handling
	    and diagnosis philosophy documents, test plans, 
	    etc.)

	    N/A