1. Introduction 1.1 Portfolio Name Panic/FMA integration Reference: 2010/015.panic-integration-ph1 1.2 Portfolio Author Chris.Beal@oracle.com Gavin.Maltby@oracle.com 1.3 Submission Date 24 May 2010 1.4 Project Team Aliases: swfma@sun.com 1.5 Interest List N/A 1.6 List of Reviewers Reviewer Group Version Date Comments -------- ----- ------- -------- --------------------- 2. Portfolio description 2.1 Are you leveraging any existing portfolio(s), or is this an umbrella portfolio that includes other portfolios? Please provide a pointer to the location of those portfolio(s). We leverage 2010/006.stabilities.ireport.swscheme and 2010/007.fmd-infrastructure-additions for the publication of ireports from a userland utility and for hosting our diagnosis activities in software-diagnosis. 2.2 Provide a Customer visible FMA features document We model Solaris kernel panics with a corresponding defect diagnosis in FMA. 2.3 Provide a Architecture specification/document It is important for all systems portfolios to provide an architecture spec that describes the various components comprising the system's FMA (fault boundaries, telemetry flow, etc.) Panics in Solaris Today When the Solaris kernel panics a crash dump is written to the dump device (if any). When the dumpadm service starts during the subsequent reboot it checks whether savecore is enabled (dumpadm -y) and, if it is, runs savecore to extract the dump to the appointed savecore directory. With recent changes, this process does not by default uncompress the dump. If savecore is not enabled then no message is produced (not even to mention the dump present on the device). If savecore is enabled but encounters some error during the attempt to extract the dump then at most token messaging is produced. Kernel hardware error handlers may also initiate a panic, and telemetry describing the terminal event is preserved on the dump device. What's *Not* Changing We will not undo any of the cpu and I/O savings that compressed crashdumps have brought; that is, we will not force 'dumpadm -z off' into effect, and we will not run savecore to inflate any vmdump.N files in the savecore directory. This portfolio also does not introduce any analysis tools or attempt to analyse or summarize any fresh panic. The CPU and IO overhead of uncompressing a dump and/or running analysis tools on it is too great - we don't want to delay the return to production of a recently-crashed system. Instead we feel that this sort of analysis/summary activity belongs on service-owned datacenter management hubs instead of on the production host, and the high-level information we can obtain without decompression (panic string, time, panic stack) is sufficient for the purposed of automated call logging. We will not produce a defect diagnosis for an FMA-initiated panic --- it is for other diagnosis software to deduce the fault, and the panic is simply a symptom. Changes with this portfolio a) At each boot a Solaris instance will generate a UUID for that instance of the operating system and embed this within the running kernel. This UUID serves to identify a given boot of the operating system and any crashdump arising therefrom. A separate ARC case will cover the introduction of a Solaris image UUID. If this Solaris instance should panic then we will use the instance UUID for the case UUID that we will solve in diagnosing a defect. b) The dumpadm service start method will always run savecore, even if dumpadm -n is in effect. If an unconsumed crash dump is present on the dump device then, if this is not an FMA-initiated panic, savecore raises an event to indicate that a fresh crash is pending on the device. If dumpadm -n is in effect then savecore indicates this in the event and then exits without extracting the crashdump. The event raised includes the panic time, string, stack etc all extracted from a (modified) dump header. FMA-initiated panics do not result in an event - these will be diagnosed by fmd and we do not want two diagnoses from a single panic. Note that fmd does not require that a dump be extracted from the dump device in order to replay ereport telemetry carried on the dump device, so dumpadm -n does not affect that functionality. c) If savecore is enabled (dumpadm -y) then savecore attempts to extract the dump. On success it raises an event to indicate that the dump is available; on failure it raises an event to record that failure. d) When our diagnosis software receives the "dump pending" event, it opens a new fmd case with fmd_case_open_uuid (to match the dump UUID). If the event indicates that savecore is not enabled (dumpadm -n) then we go ahead and solve the case now. The defect we diagnose can still record panic time, uuid, panic string and panic stack as extracted from the dump header and included in the "pending" event. If savecore is later run manually to extract the dump we will not update our diagnosis. While there are perhaps optimizations that may be made around this dumpadm -n case, it is our opinion that enterprise installations will not run with dumpadm -n and incremental improvements around that case are not worthwhile. If savecore is enabled then we expect one of the "success" or "failed" events. We arm a timer to guard against the case that neither arrives - if that timer trips we will solve the case with what we have. If "success" is received then we know that the dump is available in the savecore directory, and the event tells us the pathname. We solve the case, appending this information to the defect. If "failure" is received we solve the case anyway, but we cannot point to a path for the extracted dump. e) When we solve the case as a defect diagnosis it appears to observers just as any other fmd problem diagnosis - e.g., appearing in fmadm faulty, will be rendered by syslog-msgs, can be notified by email or smtp. The fault-specific data in the fault-list array entry includes pointers to the dump etc. 3. Fault Boundary Analysis 3.1 For systems, subsystems, components or services that make up this portfolio, list all resources that will be diagnosed and all the ASRUs and FRUs associated with each diagnosis in which the resource may be a suspect. An example shows the resource and asru we use. On reboot after panic we see the following on the console: SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major EVENT-TIME: Tue May 18 19:08:25 PDT 2010 PLATFORM: Sun-Fire-V40z, CSN: XG051535088, HOSTNAME: parity SOURCE: software-diagnosis, REV: 0.1 EVENT-ID: 5099bb12-ab7b-6601-f07a-c51c1484f6cb DESC: The system has rebooted after a kernel panic. Refer to http://sun.com/msg/SUNOS-8000-KL for more information. AUTO-RESPONSE: The failed system image was dumped to the dump device. If savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to the savecore directory /var/crash/opensolaris. IMPACT: There may be some performance impact while the panic is copied to the savecore directory. Disk space usage by panics can be substantial. REC-ACTION: Please log a call with you support vendor and provide them with this information. If savecore is not enabled then please take steps to preserve the crash image. # fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- May 18 19:08:25 5099bb12-ab7b-6601-f07a-c51c1484f6cb SUNOS-8000-KL Major Host : parity Platform : Sun-Fire-V40z Chassis_id : XG051535088 Product_sn : Fault class : defect.sunos.kernel.panic Problem in : sw:///:path=/var/crash/opensolaris/.5099bb12-ab7b-6601-f07a-c51c1484f6cb faulted and taken out of service Description : The system has rebooted after a kernel panic. Refer to http://sun.com/msg/SUNOS-8000-KL for more information. Response : The failed system image was dumped to the dump device. If savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to the savecore directory /var/crash/opensolaris. Impact : There may be some performance impact while the panic is copied to the savecore directory. Disk space usage by panics can be substantial. Action : Please log a call with you support vendor and provide them with this information. If savecore is not enabled then please take steps to preserve the crash image. # fmdump -Vp -u 5099bb12-ab7b-6601-f07a-c51c1484f6cb TIME UUID SUNW-MSG-ID May 18 2010 19:08:25.931755000 5099bb12-ab7b-6601-f07a-c51c1484f6cb SUNOS-8000-KL TIME CLASS ENA May 18 19:08:25.8732 ireport.os.sunos.panic.dump_available 0x0000000000000000 May 18 19:07:37.1681 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000 nvlist version: 0 version = 0x0 class = list.suspect uuid = 5099bb12-ab7b-6601-f07a-c51c1484f6cb code = SUNOS-8000-KL diag-time = 1274234905 891131 de = fmd:///module/software-diagnosis fault-list-sz = 0x1 fault-list = (array of embedded nvlists) (start fault-list[0]) nvlist version: 0 version = 0x0 class = defect.sunos.kernel.panic certainty = 0x64 resource = sw:///:path=/var/crash/opensolaris/.5099bb12-ab7b-6601-f07a-c51c1484f6cb savecore-succcess = 1 os-instance-uuid = 5099bb12-ab7b-6601-f07a-c51c1484f6cb savedir = /var/crash/opensolaris instance = 1 panicstr = forced crash dump initiated at user request panicstack = genunix:kadmin+16e () | genunix:uadmin+10f () | unix:brand_sys_syscall32+272 () | fm-panic = 0 crashtime = 1274234794 compressed = 1 (end fault-list[0]) fault-status = 0x1 severity = Major __ttl = 0x1 __tod = 0x4bf34819 0x378973f8 Note that the path used in the resource "sw" scheme FMRI does not exist in the filesystem - we continue to save dumps at /{unix/vmcore,vmdump}.N - but we need a unique resource fmri to enter into the fmd resource cache. We choose /. based upon the mooted suggestion of one day gathering aspects of a single dump (image, analysis file etc) in a single subdirectory for that dump. The fault-list entry contains a number of fault-specific items. These are aimed at software that may subscribe to the diagnosis. 3.2 Diagrams or a description of the faults that may be present. o Provide a pointer to the latest version of your fault tree which lists each fault and how errors are propagated both internally within the subsystem and between subsystems. ireport.os.sunos.panic.dump_pending_on_device | | V defect.sunos.kernel.panic This diagnosis is made in all cases, as described in 2.3. o Provide a pointer to a summary of your changes to the Event Registry, as summarized by the "ercheck" tool. The HTML report (summary.html) from ercheck is the preferred format: See ercheck.html, and webrevs and erapache per email. 4. Diagnosis Strategy 4.1 How are faults diagnosed? Provide a diagnosis philosophy document or a pointer to a portfolio that describes the algorithm(s) used to diagnose the faults described in Section 3.2 and the reasons for using said strateg(y/ies). See 2.3 4.2 What new fault events will be defined and registered with the SMI event registry? Include all FMA Protocol ereport specifications, and provide a pointer to your ercheck output. See ercheck.html, and webrevs and erapache per email 4.3 If your fault management activity (error handling, diagnosis or recovery) spans multiple fault manager regions, explain how each activity is coordinated between regions. N/A 5. Error Handling & Reporting Strategy 5.1 How are errors detected and handled? Include a description of the immediate error reactions taken to capture error state and keep the system available without compromising the integrity of the rest of the system or user data. In the case of a device driver being hardened, describe the recovery/retry behavior, if any. Describe how storms of errors are mitigated. N/A 5.2 How are ereports generated? Descrbe the software components involved in generating ereports. Describe any filtering, hysteresis, etc. that occurs prior to ereport generation. We use the userland event publication mechanism of 2010/007.fmd-infrastructure-additions 5.3 What new error report (ereport) events will be defined and registered with the SMI event registry? Include all FMA Protocol ereport specifications, and provide a pointer to your ercheck output. See ercheck.html, and webrevs and erapache per email. 5.4 If you are *not* using a reference fault manager (fmd(1M)) on your system, how are you persisting ereports and communicating them to Sun Services? N/A 5.5 For more complex system portfolios provide a comprehensive error handling philosophy document that descibes how errors are handled by all components involved in error handling. N/A 5.6 If this portfolio includes new errors, but leverages existing ereports, provide a description of how the errors map to the ereports. N/A 5.7 Describe the error telemetry flow between all software components A block diagram would be a good method to do this. N/A 5.8 How can error telmetry (ereports) be disabled? Describe any cofiguration file settings, environment variables, etc. that can be used to turn of ereport generation. When savecore extracts a dump it marks it as consumed, so in the normal course of events we will not raise repeated events from a single panic. Since we use the UUID of the panic image, only one fmd_case_open_uuid will ever succeed anyway. So if someone repeats the savecore run (eg savecore -d to disregard the header and save anyway) we'll only open and solve a case once. If a system panics repeatedly, you'll have one case solved per unique panic. A special case is that of a booted Solaris instance that is snapshotted in some virtualization environment (e.g., VirtualBox) and that instance resumed repeatedly. Such a setup can break the usual rule that a given OS instance UUID can only panic once - we could see different panic signatures all with the same UUID (a resume before each, of course); at each panic the case UUID will be new (since fmd state is part of the OS snapshot) and we can still solve a case - but an external observer may see repeated panics for the same UUID. 6. Recovery/Reaction 6.1 Are you introducing or modifying any response agent(s)? If so, provide a description of the agent(s). 6.2 What existing fma modules will be used in response to your faults? None 6.3 Are you modifying any existing (Section 6.2) response agents? If so, indicate the agents below, with a brief description of how they will be modified. None 6.4 Describe any immediate (e.g. offlining) and long-term (e.g. black-listing) retiring/disabling of components. N/A 6.5 Provide pointers to dictionary/po entries and knowledge articles. See ercheck.html, and webrevs and erapache per email. 7. Event Transport Mechanisms 7.1 Are you introducing or modifying any event transport mechanisms? If so please provide a description of the transport mechanism(s). 7.2 What events are transported? Provide a list of ereport, fault, and list.* events which are transported if this is not already covered above in sections 4 and 5. This transport does not have any subscriptions. 8. FRUID Implementation 8.1 Complete this section if you're submitting a portfolio for a platform. N/A. This is not a platform portfolio. 9. Test 9.1 Provide a pointer to your test plan(s) and specification(s). Make sure to list all FMA functionalities that are/are not covered by the test plan(s) and specification(s). http://ontestreview.central.sun.com/wiki/index.php/CoreFMA http://ontestreview.central.sun.com/wiki/index.php/CoreFMA_Assertion 9.2 Explain the risks associated with the test gaps, if any. See http://infoshare.sfbay/twiki/bin/view/Fma/SwFmaTestGaps 10. Gaps 10.1 List any gaps that prevent a full FMA feature set. This includes but is not limited to insufficient error detectors, error reporting, and software infrastructure. # # Note: # # For each gap listed: # - Provide a description of the gap. # - Provide a reason/justification for the gap. # - Provide a risk assessment of the gap. # Describe the customer and/or service impact if the # gap is not addressed. # - List future projects/get-well plans to address the gap. # Provide bugids, target date, and/or release information # as to when these gap will be addressed. # None at this time. 11.Dependencies 11.1 List all project and other portfolio dependencies to fully realize the targeted FMA feature set for this portfolio. A portfolio may have dependencies on infrastructure projects. # # For Example, # The "Sun4u PCI hostbridge" and "PCI-X" projects have # a dependency on the events/ereports defined within the # "PCI Local Bus" portfolio. # None. 12. References 12.1 Provide pointers to all documents referenced in previous sections. # # Note: # # - Documents referenced here should include things like: # architecture spec, 1pagers, functional spec, # error handling specs, diagnosis philosophy docs, # and test plans/specs. # - All documents that directly relate to this portfolio # should eventually be archived in the repository, and # have a reference here. # - The document pointer may initially point to someplace # other than the repository, but when/if the document is # archived in the repository the pointer should change. # - References to these documments in the sections above # should include a document identifier and/or name. # # # An Example of a document references in this section # might look like this: # [1] Error Handling Philosophy Spec # http://... # # A Example of reference to the document in a section # above may look like this: # For a a description of error handling see [1]. #