/* * Copyright 2007 Sun Microsystems, Inc. All rights reserved. * Use is subject to license terms. * * portfolio.txt 1.2 07/11/16 */ # ident "@(#)portfolio_template.txt 1.11 07/02/07 SMI" 1. Introduction 1.1 Portfolio Name Repair Observability changes 1.2 Portfolio Author steve.hanson@sun.com 1.3 Submission Date 07/24/2008 1.4 Project Team Aliases: fma-core@sun.com 1.5 Interest List 1.6 List of Reviewers List any individuals/groups that have reviewed and/or approved this portfolio. It is recommended that the portfolio be pre- reviewed by groups such as Service, RAS review committees, Quality Engineering, etc. fm-convict-policy@sun.com - steve.chessin@sun.com - gavin.maltby@sun.com - russ.mcmanus@sun.com - alex.noordergraaf@sun.com - gary.wuollet@sun.com - roman.zajcew@sun.com 2. Portfolio description To record more detailed information regarding individual members of a suspect list, as to whether they have been physically repaired/replaced/removed or have been acquitted. The acquittal could be automatic (as part of an improved conviction policy) or manual (eg by following instructions in a knowledge article). Two new "lists" are defined, "list.updated" and "list.resolved". See file:///net/nome.eng/u1/ws/stephh/repair_obs_report.html The following new interfaces are added fmadm commands -------------- fmadm repaired fmri | label fmadm replaced fmri | label fmadm acquit fmri | label fmadm acquit uuid [ fmri | label ] (Note that the existing "fmadm repair uuid | fmri | label" is retained for compatibility and is efectively subsumed by "fmadm repaired"). fmd api interfaces ------------------ fmd_nvl_fmri_has_fault(fmd_hdl_t *hdl, nvlist_t *nvl, int type, char *class) fmd_nvl_fmri_replaced(fmd_hdl_t *hdl, nvlist_t *nvl) fmd_nvl_fmri_service_state(fmd_hdl_t *hdl, nvlist_t *nvl) fmd_case_uuresolved(fmd_hdl_t *handle, const char *uuid) scheme interfaces ----------------- fmd_fmri_replaced(nvlist_t *nvl) fmd_nvl_fmri_service_state(fmd_hdl_t *hdl, nvlist_t *nvl) topo interfaces --------------- topo_fmri_replaced(topo_hdl_t *thp, nvlist_t *nvl, int *err) topo_fmri_service_state(topo_hdl_t *thp, nvlist_t *nvl, int *err) eversholt functions ------------------- has_fault(path, class) For details see design document fmadm_acquit.txt 3. Fault Boundary Analysis (FBA) 3.1 For systems, subsystems, components or services that make up this portfolio, list all resources that will be diagnosed and all the ASRUs and FRUs (see RAS glossary for definitions) associated with each diagnosis in which the resource may be a suspect. N/A 4. Diagnosis Strategy 4.1 Provide a diagnosis philosophy document or a pointer to a portfolio that describes the algorithm used to diagnose the faults described in Section 3.2 and the reasons for using said strateg(y/ies). N/A 4.2 If your fault management activity (error handling, diagnosis or recovery) spans multiple fault manager regions, explain how each activity is coordinated between regions. For example, a Service Processor and Solaris domain may need to coordinate common error telemetry for diagnosis or provide interfaces to effect recovery operations. N/A 5. Error Handling Strategy 5.1 How are errors handled? Include a description of the immediate error reactions taken to capture error state and keep the system available without compromising the integrity of the rest of the system or user data. In the case of a device driver being hardened, describe the recovery/retry behavior, if any. N/A 5.2 What new error report (ereport) events will be defined and registered with the SMI event registry? Include all FMA Protocol ereport specifications. Provide a pointer to your ercheck output. N/A 5.3 If you are *not* using a reference fault manager (fmd(1M)) on your system, how are you persisting ereports and communicating them to Sun Services? N/A 5.4 For more complex system portfolios (like Niagara2), provide a comprehensive error handling philosophy document that descibes how errors are handled by all components involved in error handling (including Service Processors, LDOMs, etc.) [As an example, for sun4v platforms this may include specs for reset/config, POST, hypervisor, Solaris, and service processor software components.] N/A 6. Recovery/Reaction 6.1 Are you introducing any new recovery agent(s)? If so, please provide a description of the recovery agent(s). N/A 6.2 What existing fma modules will be used in response to your faults? N/A 6.3 Are you modifying any existing (Section 6.2) recovery agents? If so, please indicate the agents below, with a brief description of how they will be modified. There are some changes to the existing recovery agents so that thay unisolate individually repaired suspects on receipt of a list.updated event. These agents are also modified to call fmd_case_uuresolved() when all suspects have been successfully unretired on receipt of a list.repaired event. These changes affect cpumem-retire, io-retire, zfs-retire and disk-monitor 6.4 Describe any immediate (e.g. offlining) and long-term (e.g. (e.g. black-listing) recovery. N/A 6.5 Provide pointers to dictionary/po entries and knowledge articles. See FMA.dict and FMA.po in file:///net/nome.eng/u/ws/stephh/fma2/webrev/index.html 7. FRUID Implementation 7.1 Complete this section if you're submitting a portfolio for a platform. N/A 8. Test 8.1 Provide a pointer to your test plan(s) and specification(s). Make sure to list all FMA functionalities that are/are not covered by the test plan(s) and specification(s). Tested using various io and cpumem fault injections, trying out the new commands, simulating removal and replacement of frus etc. The testing especially looked at the cases with multi- entry suspect lists. Note that the "fakenotpresent" configurable which allowed simulation of a device being removed from the system is being extended here to all both removal and replacement to be simulated. Specific test cases include - pciex hardware fault injector - cpumem fault harness testing (sparc/intel/amd) - pciex fmsim tests - pci fmsim tests - check fmadm acquit/repaired/replaced with multi-entry suspect lists generated via above and check behaviour of fmadm faulty, fmdump and syslog as individual suspects are acquitted/repaired. - use fakenotpresent feature to simulate removal/replacement of suspects and check behaviour of fmadm faulty, fmdump and syslog - check replay code on fmd restart for list.suspect/list.updated/ list.repaired/list.replaced - run fmstress and check no leaks - simulate an io device reporting service impact degraded and check this is reported correctly. - check behaviour of retire agents is correct. 8.2 Explain the risks associated with the test gaps, if any. N/A 9. Gaps 9.1 List any gaps that prevent a full FMA feature set. This includes but is not limited to insufficient error detectors, error reporting, and software infrastructure. N/A 9.2 Provide a risk assessment of the gaps listed in Section 9.1. Describe the customer and/or service impact if said gaps are not addressed. N/A 9.3 List future projects/get-well plans to address the gaps listed in Section 9.1. Provide target date and/or release information as to when these gaps will be addressed. N/A 10.Dependencies 10.1 List all project and other portfolio dependencies to fully realize the targeted FMA feature set for this portfolio. A portfolio may have dependencies on infrastructure projects. For example, The "Sun4u PCI hostbridge" and "PCI-X" projects have a dependency on the events/ereports defined within the "PCI Local Bus" portfolio. N/A 11. References 11.1 Provide pointers to all documents referenced in previous sections (for example, list pointers to error handling and diagnosis philosophy documents, test plans, etc.) N/A