# # Copyright 2007 Sun Microsystems, Inc. All rights reserved. # Use is subject to license terms. # # ident "%Z%%M% %I% %E% SMI" 1. Introduction 1.1 Portfolio Name Disk enumeration for Sun Fire X4200 and X4200 M2 1.2 Portfolio Authors Carla Mowers, David Zhang 1.3 Submission Date 08/02/2007 1.4 Project Team Aliases: Eric.Schrock@sun.com, Chris.Horne@sun.com, Carla.Mowers@sun.com, David.Zhang@sun.com 1.5 Interest List sanmas@sun.com 1.6 List of Reviewers Eric.Schrock@Sun.Com 2. Portfolio description This portfolio describes the first phase of FMA integration for disks in the X4200 and X4200 M2 chassis. It follows the design described in FMA 2007/016 (Generic Disk Topology) and presents a similar topology for the 4 internal disks. This case does not define any LED capabilities for the drives. The 'ok2rm' LED is not physically connected to anything on these chassis. While it would be possible to expose the 'fault' LED only, the plan is to wait for future phases of FMA 2007/015 (Unified Disk FMA) that move this functionality into libtopo in a more generic fasion. This portfolio allows SMART data to be extracted from the disks and diagnosed by the disk-transport module introduced in FMA 2007/007 (Generic Disk Monitoring). 3. Fault Boundary Analysis (FBA) 3.1 For systems, subsystems, components or services that make up this portfolio, list all resources that will be diagnosed and all the ASRUs and FRUs (see RAS glossary for definitions) associated with each diagnosis in which the resource may be a suspect. Please refer to section 11.3 for a background in the enumeration of a generic SCSI device. The X4200 will use this generic model to generate a new platform specifc xml file describing the internal storage of an X4200. The only change made was to the perl script that generates the product-specific xml file. The underlying resource diagnosed is as the same that described in FMA 2007/007, and the same underlying methodology will be used: hc://:product-id=Sun-Fire-X4200-Server:chassis-id=0648AM0378: server-id=dmg4200b:serial=3110SYWX 3LB0SYWX: part=SEAGATE-ST973401LSUN72G:revision=0556/bay=0/disk=0 ASRU: dev:///:devid=id1,sd@SSEAGATE_ST973401LSUN72G_3110SYWX____________3LB0SYWX //pci@0,0/pci1022,7450@2/pci1000,3060@3/sd@0,0 FRU: hc://:product-id=Sun-Fire-X4200-Server:chassis-id=0648AM0378: server-id=dmg4200b:serial=3110SYWX 3LB0SYWX: part=SEAGATE-ST973401LSUN72G:revision=0556/bay=0/disk=0 Label: HDD_0 3.2 Diagrams or a description of the faults that may be present in the subsystem. A suitable format for this information is an Eversholt Fault Tree (see http://eversholt.central) that describes the ASRU and FRU boundaries, the faults that can be present within those boundaries and the error propagation telemetry for those faults. See 11.3 <2007/007 Generic Disk Monitoring (sfx4500 phase 2) > for description of faults 4. Diagnosis Strategy 4.1 Provide a diagnosis philosophy document or a pointer to a portfolio that describes the algorithm used to diagnose the faults described in Section 3.2 and the reasons for using said strateg(y/ies See 11.3 <2007/007 Generic Disk Monitoring (sfx4500 phase 2) > 4.2 If your fault management activity (error handling, diagnosis or recovery) spans multiple fault manager regions, explain how each activity is coordinated between regions. For example, a Service Processor and Solaris domain may need to coordinate common error telemetry for diagnosis or provide interfaces to effect recovery operations. N/A 5. Error Handling Strategy 5.1 How are errors handled? Include a description of the immediate error reactions taken to capture error state and keep the system available without compromising the integrity of the rest of the system or user data. In the case of a device driver being hardened, describe the recovery/retry behavior, if any. See 11.3 <2007/007 Generic Disk Monitoring (sfx4500 phase 2) > 5.2 What new error report (ereport) events will be defined and registered with the SMI event registry? Include all FMA Protocol ereport specifications. Provide a pointer to your ercheck output. N/A 5.3 If you are *not* using a reference fault manager (fmd(1M)) on your system, how are you persisting ereports and communicating them to Sun Services? N/A 5.4 For more complex system portfolios (like Niagara2), provide a comprehensive error handling philosophy document that describes how errors are handled by all components involved in error handling (including Service Processors, LDOMs, etc.) [As an example, for sun4v platforms this may include specs for reset/config, POST, hypervisor, Solaris, and service processor software components.] N/A 6. Recovery/Reaction 6.1 Are you introducing any new recovery agent(s)? If so, please provide a description of the recovery agent(s). N/A 6.2 What existing fma modules will be used in response to your faults? See 11.4 <2007/004 IO Retire Agents> 6.3 Are you modifying any existing (Section 6.2) recovery agents? If so, please indicate the agents below, with a brief description of how they will be modified. N/A 6.4 Describe any immediate (e.g. offlining) and long-term (e.g. (e.g. black-listing) recovery. N/A 6.5 Provide pointers to dictionary/po entries and knowledge articles. N/A 7. FRUID Implementation 7.1 Complete this section if you're submitting a portfolio for a platform. N/A 8. Test 8.1 Provide a pointer to your test plan(s) and specification(s). Make sure to list all FMA functionalities that are/are not covered by the test plan(s) and specification(s). N/A 8.2 Explain the risks associated with the test gaps, if any. N/A 9. Gaps 9.1 List any gaps that prevent a full FMA feature set. This includes but is not limited to insufficient error detectors, error reporting, and software infrastructure. This portfolio is only part of the long-term disk diagnosis strategy outlined in 2007/16. As such, it does not seek to address any of the known gaps outlined in that portfolio, namely LED management, SCSI transport diagnosis, or unified ZFS diagnosis. 9.2 Provide a risk assessment of the gaps listed in Section 9.1. Describe the customer and/or service impact if said gaps are not addressed. See 11.2 <2007/015 Unified Disk Diagnosis> 9.3 List future projects/get-well plans to address the gaps listed in Section 9.1. Provide target date and/or release information as to when these gaps will be addressed. See 11.2 <2007/015 Unified Disk Diagnosis> 10.Dependencies 10.1 List all project and other portfolio dependencies to fully realize the targeted FMA feature set for this portfolio. A portfolio may have dependencies on infrastructure projects. For example, The "Sun4u PCI hostbridge" and "PCI-X" projects have a dependency on the events/ereports defined within the "PCI Local Bus" portfolio. This portfolio has the following dependencies: 2007/016 Generic Topology for Internal Disks (sfx4500 phase 3) 2007/015 Unified Disk Diagnosis 2007/007 Generic Disk Monitoring (sfx4500 phase 2) 2006/012 Sun Fire X4500 Disk Failures: Phase I 2007/004 IO Retire Agent 11. References 11.1 Provide pointers to all documents referenced in previous sections (for example, list pointers to error handling and diagnosis philosophy documents, test plans, etc.) [1] "Sun Fire X4500 Disk Failure: Phase I" portfolio: http://fma.eng/documents/engineering/portfolios/2006/012.sfx4500-disk [2] "Unified Disk Diagnosis" http://fma.eng/documents/engineering/portfolios/2007/015.Unified-Disk-FMA [3] "Generic Disk Monitoring (sfx4500 phase 2)" portfolio: http://fma.eng/documents/engineering/portfolios/2007/007.Generic-disk-monitoring-sfx4500-p2 [4] "IO Retire Agent (2007.004.IO_Retireagent)" portfolio: http://fma.eng/documents/engineering/portfolios/2007/004.IO_Retireagent/ [5] Project workspace: /net/anthrax.central/export/ws/cth/onnv-sanfma /net/anthrax.central/export/ws/cth/events-sanfma /net/boora.central/brmnas/yz203490/ws_fma