1. Introduction

	1.1 Portfolio Name

		Panic/FMA integration

		Reference: 2010/015.panic-integration-ph1
 
	1.2 Portfolio Author

		Chris.Beal@oracle.com
		Gavin.Maltby@oracle.com

	1.3 Submission Date 

		24 May 2010

	1.4 Project Team Aliases: 

		swfma@sun.com

	1.5 Interest List

		N/A
 
	1.6 List of Reviewers

		Reviewer	Group	Version	Date     Comments
		--------	-----	------- -------- ---------------------

2. Portfolio description

	2.1 Are you leveraging any existing portfolio(s), or is this an
	    umbrella portfolio that includes other portfolios?
	    Please provide a pointer to the location of those portfolio(s).

		We leverage 2010/006.stabilities.ireport.swscheme and
		2010/007.fmd-infrastructure-additions for the publication
		of ireports from a userland utility and for hosting
		our diagnosis activities in software-diagnosis.

	2.2 Provide a Customer visible FMA features document

		We model Solaris kernel panics with a corresponding
		defect diagnosis in FMA.

	2.3 Provide a Architecture specification/document
	    It is important for all systems portfolios to provide an
	    architecture spec that describes the various components 
	    comprising the system's FMA (fault boundaries, telemetry
	    flow, etc.)

	    Panics in Solaris Today

		When the Solaris kernel panics a crash dump is written
		to the dump device (if any).

		When the dumpadm service starts during the subsequent reboot
		it checks whether savecore is enabled (dumpadm -y) and, if it
		is, runs savecore to extract the dump to the appointed
		savecore directory.  With recent changes, this process
		does not by default uncompress the dump.  If savecore is
		not enabled then no message is produced (not even to mention
		the dump present on the device).  If savecore is enabled
		but encounters some error during the attempt to extract
		the dump then at most token messaging is produced.

		Kernel hardware error handlers may also initiate a panic,
		and telemetry describing the terminal event is preserved
		on the dump device.

	    What's *Not* Changing

		We will not undo any of the cpu and I/O savings that
		compressed crashdumps have brought; that is, we will
		not force 'dumpadm -z off' into effect, and we will not
		run savecore to inflate any vmdump.N files in the
		savecore directory.

		This portfolio also does not introduce any analysis tools
		or attempt to analyse or summarize any fresh panic.
		The CPU and IO overhead of uncompressing a dump and/or
		running analysis tools on it is too great - we don't
		want to delay the return to production of a recently-crashed
		system.  Instead we feel that this sort of analysis/summary
		activity belongs on service-owned datacenter management
		hubs instead of on the production host, and the high-level
		information we can obtain without decompression (panic string,
		time, panic stack) is sufficient for the purposed of
		automated call logging.

		We will not produce a defect diagnosis for an FMA-initiated
		panic --- it is for other diagnosis software to deduce
		the fault, and the panic is simply a symptom.

	    Changes with this portfolio

		a) At each boot a Solaris instance will generate a UUID for
		   that instance of the operating system and embed this
		   within the running kernel.  This UUID serves to identify
		   a given boot of the operating system and any crashdump
		   arising therefrom.  A separate ARC case will cover the
		   introduction of a Solaris image UUID.

		   If this Solaris instance should panic then we will use
		   the instance UUID for the case UUID that we will solve
		   in diagnosing a defect.  

		b) The dumpadm service start method will always run
		   savecore, even if dumpadm -n is in effect.

		   If an unconsumed crash dump is present on the dump
		   device then, if this is not an FMA-initiated panic,
		   savecore raises an event to indicate that a fresh crash
		   is pending on the device.  If dumpadm -n is in effect
		   then savecore indicates this in the event and then exits
		   without extracting the crashdump.  The event raised
		   includes the panic time, string, stack etc all extracted
		   from a (modified) dump header.

		   FMA-initiated panics do not result in an event - these
		   will be diagnosed by fmd and we do not want two diagnoses
		   from a single panic.  Note that fmd does not require that
		   a dump be extracted from the dump device in order to
		   replay ereport telemetry carried on the dump device, so
		   dumpadm -n does not affect that functionality.

		c) If savecore is enabled (dumpadm -y) then savecore attempts
		   to extract the dump.  On success it raises an event
		   to indicate that the dump is available; on failure it
		   raises an event to record that failure.

		d) When our diagnosis software receives the "dump pending"
		   event, it opens a new fmd case with fmd_case_open_uuid
		   (to match the dump UUID).

		   If the event indicates that savecore is not enabled
		   (dumpadm -n) then we go ahead and solve the case now.
		   The defect we diagnose can still record panic time,
		   uuid, panic string and panic stack as extracted from
		   the dump header and included in the "pending" event.
		   If savecore is later run manually to extract the dump
		   we will not update our diagnosis.  While there are perhaps
		   optimizations that may be made around this dumpadm -n case,
		   it is our opinion that enterprise installations will not
		   run with dumpadm -n and incremental improvements around that
		   case are not worthwhile.

		   If savecore is enabled then we expect one of the
		   "success" or "failed" events.  We arm a timer to guard
		   against the case that neither arrives - if that timer
		   trips we will solve the case with what we have.

		   If "success" is received then we know that the dump
		   is available in the savecore directory, and the event
		   tells us the pathname.  We solve the case, appending
		   this information to the defect.

		   If "failure" is received we solve the case anyway,
		   but we cannot point to a path for the extracted dump.

		e) When we solve the case as a defect diagnosis it appears
		   to observers just as any other fmd problem diagnosis -
		   e.g., appearing in fmadm faulty, will be rendered by
		   syslog-msgs, can be notified by email or smtp.  The
		   fault-specific data in the fault-list array entry
		   includes pointers to the dump etc.
		

3. Fault Boundary Analysis

	3.1 For systems, subsystems, components or services that make up 
	    this portfolio, list all resources that will be diagnosed and
	    all the ASRUs and FRUs  associated with each diagnosis in
	    which the resource may be a suspect.

		An example shows the resource and asru we use.

		On reboot after panic we see the following on the console:

SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major
EVENT-TIME: Tue May 18 19:08:25 PDT 2010
PLATFORM: Sun-Fire-V40z, CSN: XG051535088, HOSTNAME: parity
SOURCE: software-diagnosis, REV: 0.1
EVENT-ID: 5099bb12-ab7b-6601-f07a-c51c1484f6cb
DESC: The system has rebooted after a kernel panic.  Refer to http://sun.com/msg/SUNOS-8000-KL for more information.
AUTO-RESPONSE: The failed system image was dumped to the dump device.  If savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to the savecore directory /var/crash/opensolaris.
IMPACT: There may be some performance impact while the panic is copied to the savecore directory.  Disk space usage by panics can be substantial.
REC-ACTION: Please log a call with you support vendor and provide them with this information.  If savecore is not enabled then please take steps to preserve the crash image.

# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
May 18 19:08:25 5099bb12-ab7b-6601-f07a-c51c1484f6cb  SUNOS-8000-KL  Major     

Host        : parity
Platform    : Sun-Fire-V40z     Chassis_id  : XG051535088
Product_sn  : 

Fault class : defect.sunos.kernel.panic
Problem in  : sw:///:path=/var/crash/opensolaris/.5099bb12-ab7b-6601-f07a-c51c1484f6cb
                  faulted and taken out of service

Description : The system has rebooted after a kernel panic.  Refer to
              http://sun.com/msg/SUNOS-8000-KL for more information.

Response    : The failed system image was dumped to the dump device.  If
              savecore is enabled (see dumpadm(1M)) a copy of the dump will be
              written to the savecore directory /var/crash/opensolaris.

Impact      : There may be some performance impact while the panic is copied to
              the savecore directory.  Disk space usage by panics can be
              substantial.

Action      : Please log a call with you support vendor and provide them with
              this information.  If savecore is not enabled then please take
              steps to preserve the crash image.

# fmdump -Vp -u 5099bb12-ab7b-6601-f07a-c51c1484f6cb
TIME                           UUID                                 SUNW-MSG-ID
May 18 2010 19:08:25.931755000 5099bb12-ab7b-6601-f07a-c51c1484f6cb SUNOS-8000-KL

  TIME                 CLASS                                 ENA
  May 18 19:08:25.8732 ireport.os.sunos.panic.dump_available 0x0000000000000000
  May 18 19:07:37.1681 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000

nvlist version: 0
        version = 0x0
        class = list.suspect
        uuid = 5099bb12-ab7b-6601-f07a-c51c1484f6cb
        code = SUNOS-8000-KL
        diag-time = 1274234905 891131
        de = fmd:///module/software-diagnosis
        fault-list-sz = 0x1
        fault-list = (array of embedded nvlists)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = defect.sunos.kernel.panic
                certainty = 0x64
                resource = sw:///:path=/var/crash/opensolaris/.5099bb12-ab7b-6601-f07a-c51c1484f6cb
                savecore-succcess = 1
                os-instance-uuid = 5099bb12-ab7b-6601-f07a-c51c1484f6cb
                savedir = /var/crash/opensolaris
                instance = 1
                panicstr = forced crash dump initiated at user request
                panicstack = genunix:kadmin+16e () | genunix:uadmin+10f () | unix:brand_sys_syscall32+272 () | 
                fm-panic = 0
                crashtime = 1274234794
                compressed = 1
        (end fault-list[0])

        fault-status = 0x1
        severity = Major
        __ttl = 0x1
        __tod = 0x4bf34819 0x378973f8

	Note that the path used in the resource "sw" scheme FMRI does not
	exist in the filesystem - we continue to save dumps at
	<savecore-dir>/{unix/vmcore,vmdump}.N - but we need a unique
	resource fmri to enter into the fmd resource cache.  We choose
	<savecore-dir>/.<uuid> based upon the mooted suggestion of one
	day gathering aspects of a single dump (image, analysis file etc)
	in a single subdirectory for that dump.

	The fault-list entry contains a number of fault-specific items.  These
	are aimed at software that may subscribe to the diagnosis.

	3.2 Diagrams or a description of the faults that may be present.

	    o Provide a pointer to the latest version of your fault tree
	      which lists each fault and how errors are propagated both
	      internally within the subsystem and between subsystems. 

		ireport.os.sunos.panic.dump_pending_on_device
					|
					|
					V
				defect.sunos.kernel.panic

		This diagnosis is made in all cases, as described in 2.3.

	    o Provide a pointer to a summary of your changes to the Event
	      Registry, as summarized by the "ercheck" tool.  The HTML
	      report (summary.html) from ercheck is the preferred format:

		See ercheck.html, and webrevs and erapache per email.

4. Diagnosis Strategy

	4.1 How are faults diagnosed?
	    Provide a diagnosis philosophy document or a pointer to a
	    portfolio that describes the algorithm(s) used to diagnose the 
	    faults described in Section 3.2 and the reasons for using
	    said strateg(y/ies).

		See 2.3

	4.2 What new fault events will be defined and registered with the
	    SMI event registry? Include all FMA Protocol ereport
	    specifications, and provide a pointer to your ercheck output.

		See ercheck.html, and webrevs and erapache per email

	4.3 If your fault management activity (error handling, diagnosis 
	    or recovery) spans multiple fault manager regions, explain
	    how each activity is coordinated between regions.

		N/A

5. Error Handling & Reporting Strategy

	5.1 How are errors detected and handled?
	    Include a description of the immediate error reactions taken
	    to capture error state and keep the system available without
	    compromising the integrity of the rest of the system or user
	    data.  In the case of a device driver being hardened, describe
	    the recovery/retry behavior, if any. Describe how storms of
	    errors are mitigated.

		N/A

	5.2 How are ereports generated?
	    Descrbe the software components involved in generating ereports.
	    Describe any filtering, hysteresis, etc. that occurs prior to
	    ereport generation.

		We use the userland event publication mechanism of
		2010/007.fmd-infrastructure-additions

	5.3 What new error report (ereport) events will be defined and
	    registered with the SMI event registry? Include all FMA
	    Protocol ereport specifications, and provide a pointer to
	    your ercheck output.

		See ercheck.html, and webrevs and erapache per email.

 	5.4 If you are *not* using a reference fault manager (fmd(1M))
            on your system, how are you persisting ereports and communicating
	    them to Sun Services?

		N/A

	5.5 For more complex system portfolios provide a comprehensive
	    error handling philosophy document that descibes how errors
	    are handled by all components involved in error handling.

		N/A

	5.6 If this portfolio includes new errors, but leverages existing
	    ereports, provide a description of how the errors map to the
	    ereports.

		N/A

	5.7 Describe the error telemetry flow between all software components
	    A block diagram would be a good method to do this.

		N/A

	5.8 How can error telmetry (ereports) be disabled?
	    Describe any cofiguration file settings, environment variables,
	    etc. that can be used to turn of ereport generation.

		When savecore extracts a dump it marks it as consumed,
		so in the normal course of events we will not raise
		repeated events from a single panic.

		Since we use the UUID of the panic image, only one
		fmd_case_open_uuid will ever succeed anyway.  So
		if someone repeats the savecore run (eg savecore -d
		to disregard the header and save anyway) we'll
		only open and solve a case once.

		If a system panics repeatedly, you'll have one case
		solved per unique panic.

		A special case is that of a booted Solaris instance
		that is snapshotted in some virtualization environment
		(e.g., VirtualBox) and that instance resumed repeatedly.
		Such a setup can break the usual rule that a given
		OS instance UUID can only panic once - we could see
		different panic signatures all with the same UUID
		(a resume before each, of course);  at each panic
		the case UUID will be new (since fmd state is part of
		the OS snapshot) and we can still solve a case - but
		an external observer may see repeated panics for the same
		UUID.

6. Recovery/Reaction

	6.1 Are you introducing or modifying any response agent(s)?
	    If so, provide a description of the agent(s).

	6.2 What existing fma modules will be used in response to your faults?

		None

	6.3 Are you modifying any existing (Section 6.2) response agents? 
	    If so, indicate the agents below, with a brief description
	    of how they will be modified.

		None

	6.4 Describe any immediate (e.g. offlining) and long-term
	    (e.g. black-listing) retiring/disabling of components.

		N/A

	6.5 Provide pointers to dictionary/po entries and knowledge articles.

		See ercheck.html, and webrevs and erapache per email.


7. Event Transport Mechanisms

	7.1 Are you introducing or modifying any event transport mechanisms?
	    If so please provide a description of the transport mechanism(s).

		7.2 What events are transported?
	    Provide a list of ereport, fault, and list.* events which
	    are transported if this is not already covered above in
	    sections 4 and 5.

		This transport does not have any subscriptions.

8.  FRUID Implementation

	8.1 Complete this section if you're submitting a portfolio for a 
    	    platform.

		N/A.
		This is not a platform portfolio.
9. Test

	9.1 Provide a pointer to your test plan(s) and specification(s).
	    Make sure to list all FMA functionalities that are/are not
	    covered by the test plan(s) and specification(s).

		http://ontestreview.central.sun.com/wiki/index.php/CoreFMA 
		http://ontestreview.central.sun.com/wiki/index.php/CoreFMA_Assertion

	9.2 Explain the risks associated with the test gaps, if any.

		See http://infoshare.sfbay/twiki/bin/view/Fma/SwFmaTestGaps


10. Gaps

	10.1 List any gaps that prevent a full FMA feature set.
	     This includes but is not limited to insufficient error 
	     detectors, error reporting, and software infrastructure.

		#
		# Note:
		#
		# For each gap listed:
		# - Provide a description of the gap.
		# - Provide a reason/justification for the gap.
		# - Provide a risk assessment of the gap.
		#   Describe the customer and/or service impact if the
		#   gap is not addressed.
		# - List future projects/get-well plans to address the gap.
		#   Provide bugids, target date, and/or release information
		#   as to when these gap will be addressed.
		#

		None at this time.

11.Dependencies

	11.1 List all project and other portfolio dependencies to fully realize
	     the targeted FMA feature set for this portfolio. A portfolio may
	     have dependencies on infrastructure projects.

		#
		# For Example,
		# The "Sun4u PCI hostbridge" and "PCI-X" projects have
		# a dependency on the events/ereports defined within the
		# "PCI Local Bus" portfolio.
		#

		None.


12. References

      12.1 Provide pointers to all documents referenced in previous sections.

		#
		# Note:
		#
		# - Documents referenced here should include things like:
		#   architecture spec, 1pagers, functional spec,
		#   error handling specs, diagnosis philosophy docs,
		#   and test plans/specs.
		# - All documents that directly relate to this portfolio
		#   should eventually be archived in the repository, and
		#   have a reference here.
		# - The document pointer may initially point to someplace
		#   other than the repository, but when/if the document is
		#   archived in the repository the pointer should change.
		# - References to these documments in the sections above
		#   should include a document identifier and/or name.
		#

		#
		# An Example of a document references in this section
		# might look like this:
		#	[1] Error Handling Philosophy Spec
		#	    http://...
		#
		# A Example of reference to the document in a section
		# above may look like this:
		#	For a a description of error handling see [1].
		#