ntroduction 1.1 Portfolio Name Intel 5400 chipset Memory Controller Hub 1.2 Portfolio Author Adrian Frost 1.3 Submission Date 11 March 2008 1.4 Project Team Aliases: Adrian.Frost@sun.com 1.5 Interest List Fadi.Salem@Sun.COM, Foz.Saeed@Sun.COM, Sridhar.Yedunuthula@Sun.COM, Kim.V.Tran@Sun.COM, fma-core@sun.com 1.6 List of Reviewers List any individuals/groups that have reviewed and/or approved this portfolio. It is recommended that the portfolio be pre- reviewed by groups such as Service, RAS review committees, Quality Engineering, etc. Reviewer Group Version Date Comments of Reviewed (Approved/Rejected/Others -------- ----- ------- -------- ------------------------- 2. Portfolio description The 5400 memory controller hub is an enhancement to the 5000 series. It support faster front side bus, lower latency I/O and larger memory dimms. 3. Fault Boundary Analysis (FBA) 3.1 For systems, subsystems, components or services that make up this portfolio, list all resources that will be diagnosed and all the ASRUs and FRUs (see RAS glossary for definitions) associated with each diagnosis in which the resource may be a suspect. This is the same as 5000 series see section 3.1 of FMA portfolio 2007.022.Intel http://wikihome.sfbay.sun.com/fma-portfolio/Wiki.jsp?page=2007.022.Intel 3.2 Diagrams or a description of the faults that may be present in the subsystem. A suitable format for this information is an Eversholt Fault Tree (see http://eversholt.central) that describes the ASRU and FRU boundaries, the faults that can be present within those boundaries and the error propagation telemetry for those faults. The topology of the 5400 is the same as 5000 Series, there is one new fault fault.cpu.intel.nb.otf, two new ereports ereport.cpu.intel.nb.otf and ereport.cpu.intel.nb.spd. All three are raised against existing topology nodes. http://hyper.sfbay.sun.com/net/hyper/tank/ws/af/intel5400.portfolio/fmtopo-p http://hyper.sfbay.sun.com/net/hyper/tank/ws/af/intel5400.portfolio/ercheck.html 4. Diagnosis Strategy 4.1 Provide a diagnosis philosophy document or a pointer to a portfolio that describes the algorithm used to diagnose the faults described in Section 3.2 and the reasons for using said strateg(y/ies). There are two new ereports ereport.cpu.intel.nb.otf which will always be diagnosed to fault.cpu.intel.nb.otf. ereport.cpu.intel.nb.spd is a recoverable error which should only happen at system initialization, the ereport is discarded in eversholt 4.2 If your fault management activity (error handling, diagnosis or recovery) spans multiple fault manager regions, explain how each activity is coordinated between regions. For example, a Service Processor and Solaris domain may need to coordinate common error telemetry for diagnosis or provide interfaces to effect recovery operations. N/A 5. Error Handling Strategy 5.1 How are errors handled? Include a description of the immediate error reactions taken to capture error state and keep the system available without compromising the integrity of the rest of the system or user data. In the case of a device driver being hardened, describe the recovery/retry behavior, if any. see FMA portfolio 2007.022.Intel 5.2 What new error report (ereport) events will be defined and registered with the SMI event registry? Include all FMA Protocol ereport specifications. Provide a pointer to your ercheck output. The payload of some other ereports is extended to include extra error registers http://hyper.sfbay.sun.com/net/hyper/tank/ws/af/intel5400.portfolio/ercheck.html 5.3 If you are *not* using a reference fault manager (fmd(1M)) on your system, how are you persisting ereports and communicating them to Sun Services? N/A 5.4 For more complex system portfolios (like Niagara2), provide a comprehensive error handling philosophy document that descibes how errors are handled by all components involved in error handling (including Service Processors, LDOMs, etc.) [As an example, for sun4v platforms this may include specs for reset/config, POST, hypervisor, Solaris, and service processor software components.] N/A 6. Recovery/Reaction 6.1 Are you introducing any new recovery agent(s)? If so, please provide a description of the recovery agent(s). N/A 6.2 What existing fma modules will be used in response to your faults? [ X ] cpumem-retire 6.3 Are you modifying any existing (Section 6.2) recovery agents? If so, please indicate the agents below, with a brief description of how they will be modified. N/A 6.4 Describe any immediate (e.g. offlining) and long-term (e.g. (e.g. black-listing) recovery. N/A 6.5 Provide pointers to dictionary/po entries and knowledge articles. http://hyper.sfbay.sun.com/net/hyper/tank/ws/af/intel5400.portfolio/INTEL.dict http://hyper.sfbay.sun.com/net/hyper/tank/ws/af/intel5400.portfolio/INTEL.po 7. FRUID Implementation 7.1 Complete this section if you're submitting a portfolio for a platform. (Refer to http://webhome.sfbay/FRUID/ for additional information on FRU ID requirements and reference material.) 7.1.1 Summarize the platform's level of conformance to the policies described in "The Policies and Best Practices for the Recording of FMA Status and Event Data in FRUID Storage Devices". [Refer to http://fma.eng.sun.com/developer/psh_tech/psh-tech.html for a copy of this document.] 7.1.2 Indicate which FRUs listed in Section 3.1 comply with the policies & best practices and which FRUs do not. 7.1.3 Provide a link to the document describing the component map for each FRU. An example can be found in Appendix C of the FRUID Common Dynamic Data Defnition Version 1.2.3. (Refer to http://fruid.sfbay/externalspecs/fruiddyn1) 7.1.4 Provide a link to the document describing what platform specific event information, if any, will be recorded in the "diagdata" field of the Status_EventsR record for each message id. 8. Test 8.1 Provide a pointer to your test plan(s) and specification(s). Make sure to list all FMA functionalities that are/are not covered by the test plan(s) and specification(s). Testing will be caried out using software error injection on Intel Stoakley system and protype Sun Venus. There will be regression testing on systems with 5000 series and 7300 Northbridge. 8.2 Explain the risks associated with the test gaps, if any. N/A 9. Gaps 9.1 List any gaps that prevent a full FMA feature set. This includes but is not limited to insufficient error detectors, error reporting, and software infrastructure. N/A 9.2 Provide a risk assessment of the gaps listed in Section 9.1. Describe the customer and/or service impact if said gaps are not addressed. N/A 9.3 List future projects/get-well plans to address the gaps listed in Section 9.1. Provide target date and/or release information as to when these gaps will be addressed. N/A 10.Dependencies 10.1 List all project and other portfolio dependencies to fully realize the targeted FMA feature set for this portfolio. A portfolio may have dependencies on infrastructure projects. For example, The "Sun4u PCI hostbridge" and "PCI-X" projects have a dependency on the events/ereports defined within the "PCI Local Bus" portfolio. N/A 11. References 11.1 Provide pointers to all documents referenced in previous sections (for example, list pointers to error handling and diagnosis philosophy documents, test plans, etc.) http://download.intel.com/design/chipsets/datashts/318610.pdf