1. Introduction

	1.1 Portfolio Name

		SMF/FMA integration Phase 1

		Reference: 2010/014.smf-integration-ph1
 
	1.2 Portfolio Author

		Antonello.Cruz@oracle.com
		Gavin.Maltby@oracle.com

	1.3 Submission Date 

		18 May 2010

	1.4 Project Team Aliases: 

		swfma@sun.com

	1.5 Interest List

		N/A
 
	1.6 List of Reviewers

		Reviewer	Group	Version	Date     Comments
		--------	-----	------- -------- ---------------------

2. Portfolio description

	New ireport leaf events are defined to describe all SMF instance
	state transitions.
 
	Transitions to SMF maintenance state are modelled in FMA as a defect
	diagnosis, with associated case lifecycle events and management.

	The smtp and snmp notification agents of portfolios 2009.027 and
	2009.027 are already configured to understand our SMF state
	transition events and to be able to raise customized notifications
	of such events as dictated by notification preferences
	expressed within SMF.

	2.1 Are you leveraging any existing portfolio(s), or is this an
	    umbrella portfolio that includes other portfolios?
	    Please provide a pointer to the location of those portfolio(s).

	This portfolio imports its sibling portfolio
	2010/006.stabilities.ireport.swscheme "FMRI and FMA Event Stabilty,
	'ireport' category 1 event class, and the 'sw' FMRI scheme" which
	defines the protocol extensions required for the present portfolio.

	Portfolio 2008.038.fma-sw-servicesBasic describes software service
	diagnosis in the fishworks appliance; it was never approved as a
	portfolio and instead simply used to formalize required changes to the
	"svc" FMRI scheme and to introduce the SMF maintenance defect
	event.  The current portfolio describes SMF software service
	diagnosis for mainstream Solaris.  The "svc" FMRI scheme changes
	documented in 2008/038.fma-sw-servicesBasic are already present
	in Solaris.  In submitting the current portfolio we propose
	marking 2008.038 as approved for the aspects detailing the "svc"
	FMRI scheme.

	We also import portfolios for smtp and snmp notification agents,
	2009/027.smtp-agent and 2009/028.snmp-agent, and the configuration
	mechanism of PSARC/2009/617.

	2.2 Provide a Customer visible FMA features document

		The SMF graph engine within svc.startd is modified to 
		be able to raise events describing instance state transitions.
		This support applies to all restarters - svc.startd itself,
		the inetd delegated restarter, or a custom delegated
		restarter.  An event is raised for a transition if:

		a) the transition is to or from maintenance state, or
		b) the transition is online -> offline following a hardware
		   error event in the service contract, or
		c) an administrator has configured a notification preference
		   for this transition.

		If none of these apply then no event is raised, so note that
		the log is not necessarily a log a complete history of all instance
		state transitions.

		Instance transition events are used in fmd for two purposes:

		1) Notification of the transition (e.g., via email or snmp).
		   Such notification is possible for "innocent" transitions
		   that fault management software won't be applying any
		   diagnosis to, such as a transition of an instance from
		   offline to online.

		   Notification preferences are now held within the SMF
		   repository, and configured through svccfg(1m) or through
		   the service manifest.  

		2) Modelling SMF maintenance state with a corresponding
		   defect diagnosis.  If the affected service is cleared in
		   SMF (svcadm clear) we will repair the case in fmd; if we
		   instead choose to repair the case in fmd (fmadm repair) we
		   propagate that back into SMF as if 'svcadm clear' had been
		   used.

	2.3 Provide a Architecture specification/document
	    It is important for all systems portfolios to provide an
	    architecture spec that describes the various components 
	    comprising the system's FMA (fault boundaries, telemetry
	    flow, etc.)

		N/A.
		This is not a platform portfolio.

3. Fault Boundary Analysis

	3.1 For systems, subsystems, components or services that make up 
	    this portfolio, list all resources that will be diagnosed and
	    all the ASRUs and FRUs  associated with each diagnosis in
	    which the resource may be a suspect.

	This portfolio adds only one new diagnosis, that of a
	defect corresponding to an SMF maintenance event.  The fault
	boundary is that defined by an SMF service instance.  There is
	no diagnosis as such:  we are simply mirroring the diagnosis
	already made in SMF when it decided to place the instance
	into maintenance state.

	An example illustrates the resource and ASRU FMRIs in such
	a defect diagnosis; they identify the affected service instance,
	and there is no FRU or location label in the fault-list entry for this
	defect.

	Services are identified using the 'svc' FMRI scheme as defined
	by 2008/038.fma-sw-servicesBasic and clarified in
	2010/006.stabilities.ireport.swscheme.

	An example of an SMF maintenance event defect diagnosis: 

# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
May 12 22:52:47 915cb64b-e16b-4f49-efe6-de81ff96fce7  SMF-8000-YX    major     

Host        : parity
Platform    : Sun-Fire-V40z     Chassis_id  : XG051535088
Product_sn  : 

Fault class : defect.sunos.smf.svc.maintenance
Affects     : svc:///system/intrd:default
                  faulted and taken out of service
Problem in  : svc:///system/intrd:default
                  faulted and taken out of service

Description : A service failed - it is restarting too quickly.
              Refer to http://sun.com/msg/SMF-8000-YX for more information.

Response    : The service has been placed into the maintenance state.

Impact      : svc:/system/intrd:default is unavailable.

Action      : Run 'svcs -xv svc:/system/intrd:default' to determine why the
              service failed and the location of logfiles, if any.

	Topology is not required for diagnosis, but is required for
	case management.  We add full enumeration support to the 'svc'
	scheme in libtopo which currently supports only nvl2str and str2nvl
        and service state operations.

	Default 'fmtopo' and 'fmtopo -p' output is unchanged.  The svc
	svc topology can be viewed using 'fmtopo -s svc':

# /usr/lib/fm/fmd/fmtopo -s svc -p
TIME                 UUID
Dec 22 18:40:28 73f7403f-2346-40f4-fa51-a1ef19333153

svc://system/boot-archive
        ASRU: -
        FRU: -
        Label: svc:/system/boot-archive

svc://system/boot-archive:default
        ASRU: -
        FRU: -
        Label: svc:/system/boot-archive:default

svc://system/device/local
        ASRU: -
        FRU: -
        Label: svc:/system/device/local

svc://system/device/local:default
        ASRU: -
        FRU: -
        Label: svc:/system/device/local:default

... and so on.

	3.2 Diagrams or a description of the faults that may be present.

	    o Provide a pointer to the latest version of your fault tree
	      which lists each fault and how errors are propagated both
	      internally within the subsystem and between subsystems. 

		This is a tough one:

		ireport.os.smf.state-transition.maintenance
			|
			|
			V
		defect.sunos.smf.svc.maintenance

		The affected service is detailed in the event payload of
		the ireport - see the event definition below. 
		It is this "svc" scheme FMRI that is the 'resource' and
		'asru' in the diagnosed defect.

	    o Provide a pointer to a summary of your changes to the Event
	      Registry, as summarized by the "ercheck" tool.  The HTML
	      report (summary.html) from ercheck is the preferred format:

		See ercheck.html.  See also erapache and webrev provided -
		the ercheck is a little overwhelmed by the stability level
		edits.

		Note that defect.sunos.smf.svc.maintenance already exists -
		it was introduced by 2008/038.fma-sw-servicesBasic.  While
		we share the same defect class (and associated KA) we will
		use the ireport.os.smf.state-transition.maintenance
		telemetry class, whereas fishworks employed an ereport class
		event for this.  We also arrange to use the SMF dictionary
		instead of the SUNOS dictionary so that our dictionary
		entries and article content can be independent of
		those used for the fishworks appliance today.

		In addition to ireport.os.smf.state-transition.maintenance
		we introduce leaf events corresponding to all new states
		for an SMF instance transition.  So the full set of new
		events is:

			ireport.os.smf.state-transition.maintenance
			ireport.os.smf.state-transition.uninitialized
			ireport.os.smf.state-transition.online
			ireport.os.smf.state-transition.offline
			ireport.os.smf.state-transition.degraded
			ireport.os.smf.state-transition.maintenance

		These event class names are Committed.

		Notification mechanisms such as the smtp-notify and snmp-notify
		daemons can subscribe not only to higher-level "list.*"
		case lifecycle events, but also to lower-level "ireport.*"
		events.  They may therefore choose to notify of events that
		FMA will not diagnose (or in addition to FMA diagnosis if so
		desired), such as notification of service restart, the
		service being disabled, etc.  They decide whether or not
		to raise a notification based on the preferences stored
		in SMF, as described in the portfolios for smtp and snmp
		notification services.

		Document 02-event-payload.txt in the portfolio materials
		details the event payload for each of the new events.

4. Diagnosis Strategy

	4.1 How are faults diagnosed?
	    Provide a diagnosis philosophy document or a pointer to a
	    portfolio that describes the algorithm(s) used to diagnose the 
	    faults described in Section 3.2 and the reasons for using
	    said strateg(y/ies).

	    Diagnosis:

		Diagnosis as such has occured within SMF, and we are simply
		mirroring and tracking maintenance state in fmd.

		Our diagnosis code, such as it is, is hosted as a subsidiary
		named "smf diagnosis" of the software-diagnosis module
		introduced in 2010/007.fmd-infrastructure-additions.
		We subscribe to ireport.os.smf.state-transition.maintenance
		(published, indirectly, by svc.startd on a transition to
		maintenance)

		When an event is received, we begin by inspecting the
		reason for the transition to maintenance.  If the state
		was requested by an administrator (a reason-short of
		'administrative_request') then we return taking no
		action;  otherwise we use fmd_case_open_uuid to open a case
		with the same UUID as the ireport.

		Next we add a defect.sunos.smf.svc.maintenance to the
		case listing the svc scheme FMRI of that affected service
		as resource and asru and the the reason-short and reason-long
		to the defect.  We then immediately solve the case, but
		despite the asru already being isolated in SMF we do not
		close the case - our sibling smf response logic will do
		that after caching the case UUID.

	    Response:

		Our response logic lives as a subsidairy of the
		software-response module introduced in portfolio
		2010/007.fmd-infrastructure-additions.

		We do not need to perform isolation actions since
		the service affected by a transition to maintenance state
		is already isolated/unusable.  Instead the function here
		is to coordinate service clear/repair actions via
		svcadm/fmadm.

		Thus we subscribe to ireport.os.smf.state-transition.*
		so that we can watch for transitions *out* of maintenance
		state (as would be the result of a svcadm clear on an
		instance in maintenance state).  When such an event
		is observed we use fmd_repair_asru to mark the asru
		as repaired, which then leads to the case being resolved.

		If instead of using svcadm clear an admin chooses to use
		fmadm to clear (repair[ed]/acquit/replaced) a service, then
		we must propagate this request into SMF via a libscf
		interface.  We receive list.repaired events that are
		published as a result of the fmadm repair action, and
		use smf_restore_instance to propagate the request to
		SMF if we can see that the resource is still unusable
		(in maintenance state).

		The fmd_repair_asru we use (as above) to propagate svcadm
		clear requests into the fmd resource cache *also*
		results in fmd publishing a list.repaired event for the
		case.  We must avoid mistaking this as the result of
		an fmadm repair action, and so not be tempted to propagate
		the request back to SMF!  To achieve this we subscribe
		to the defect class that we diagnose above which results
		in us receiving list.suspect containing that defect class,
		and we use this to cache the set of all case UUIDs
		for SMF maintenance.  We update this cache as we propagate
		events (as described above) and will not propagate
		a list.repaired for the same UUID more than once.

	4.2 What new fault events will be defined and registered with the
	    SMI event registry? Include all FMA Protocol ereport
	    specifications, and provide a pointer to your ercheck output.

		See 3.2 above for the definition of these new events:

		ireport.os.smf.state-transition.maintenance
		ireport.os.smf.state-transition.online
		ireport.os.smf.state-transition.offline
		ireport.os.smf.state-transition.degraded
		ireport.os.smf.state-transition.uninitialized
		ireport.os.smf.state-transition.disabled

	4.3 If your fault management activity (error handling, diagnosis 
	    or recovery) spans multiple fault manager regions, explain
	    how each activity is coordinated between regions.

		N/A

5. Error Handling & Reporting Strategy

	5.1 How are errors detected and handled?
	    Include a description of the immediate error reactions taken
	    to capture error state and keep the system available without
	    compromising the integrity of the rest of the system or user
	    data.  In the case of a device driver being hardened, describe
	    the recovery/retry behavior, if any. Describe how storms of
	    errors are mitigated.

		N/A

	5.2 How are ereports generated?
	    Descrbe the software components involved in generating ereports.
	    Describe any filtering, hysteresis, etc. that occurs prior to
	    ereport generation.

		Ireports are raised from svc.startd using the publication
		mechanism of 2010/007.fmd-infrastructure-additions.

		Events involving maintenance state are always propagated
		out of svc.startd; other events are only propagated if
		notification preferences exist for the particular
		transition.  There is no further filtering or hysteresis
		control at this stage.  Enabling notifications for all
		service instance transitions will generate a reasonable
		number of events, but the rate is limited by SMF (e.g., an
		instance restarting too frequently should be placed into
		maintenance state).

	5.3 What new error report (ereport) events will be defined and
	    registered with the SMI event registry? Include all FMA
	    Protocol ereport specifications, and provide a pointer to
	    your ercheck output.

		See 4.2

 	5.4 If you are *not* using a reference fault manager (fmd(1M))
            on your system, how are you persisting ereports and communicating
	    them to Sun Services?

		N/A

	5.5 For more complex system portfolios provide a comprehensive
	    error handling philosophy document that descibes how errors
	    are handled by all components involved in error handling.

		N/A

	5.6 If this portfolio includes new errors, but leverages existing
	    ereports, provide a description of how the errors map to the
	    ereports.

		N/A

	5.7 Describe the error telemetry flow between all software components
	    A block diagram would be a good method to do this.

		N/A

	5.8 How can error telmetry (ereports) be disabled?
	    Describe any cofiguration file settings, environment variables,
	    etc. that can be used to turn of ereport generation.

		The ext-event-transport module receives "raw" events from
		svc.startd and transforms them into full protocol events.
		Including a line as follows n ext-event-transport.conf
		will cause the module to stop post-processing of events
		received from svc.startd:

		    setprop inbound_postprocess_smf false

		The default value is, of course, "true".

		Observations are published for all events involving
		maintenance state, and otherwise for those for which
		notification preferences exist.  So one means of
		avoiding events is to delete notification preferences -
		but those involving maintenance state will remain.

		Another brute-force approach is to unload the
		ext-event-transport module from fmd.  A side-effect
		of this is that libfmevent consumers will receive no
		events - ext-event-transport forwards events to those
		consumers and is also responsible for the post-processing
		of incoming events.

6. Recovery/Reaction

	6.1 Are you introducing or modifying any response agent(s)?
	    If so, provide a description of the agent(s).

		See 4.1

	6.2 What existing fma modules will be used in response to your faults?

		software-response

	6.3 Are you modifying any existing (Section 6.2) response agents? 
	    If so, indicate the agents below, with a brief description
	    of how they will be modified.

		See 4.1

	6.4 Describe any immediate (e.g. offlining) and long-term
	    (e.g. black-listing) retiring/disabling of components.

		N/A

	6.5 Provide pointers to dictionary/po entries and knowledge articles.

		See events registry webrevs mail to portfolio-review alias.


7. Event Transport Mechanisms

	7.1 Are you introducing or modifying any event transport mechanisms?
	    If so please provide a description of the transport mechanism(s).

		No

	7.2 What events are transported?
	    Provide a list of ereport, fault, and list.* events which
	    are transported if this is not already covered above in
	    sections 4 and 5.

8.  FRUID Implementation

	8.1 Complete this section if you're submitting a portfolio for a 
    	    platform.

		N/A.
		This is not a platform portfolio.
9. Test

	9.1 Provide a pointer to your test plan(s) and specification(s).
	    Make sure to list all FMA functionalities that are/are not
	    covered by the test plan(s) and specification(s).

		http://ontestreview.central.sun.com/wiki/index.php/CoreFMA
		http://ontestreview.central.sun.com/wiki/index.php/CoreFMA_Assertion

	9.2 Explain the risks associated with the test gaps, if any.

		See http://infoshare.sfbay/twiki/bin/view/Fma/SwFmaTestGaps


10. Gaps

	10.1 List any gaps that prevent a full FMA feature set.
	     This includes but is not limited to insufficient error 
	     detectors, error reporting, and software infrastructure.

		#
		# Note:
		#
		# For each gap listed:
		# - Provide a description of the gap.
		# - Provide a reason/justification for the gap.
		# - Provide a risk assessment of the gap.
		#   Describe the customer and/or service impact if the
		#   gap is not addressed.
		# - List future projects/get-well plans to address the gap.
		#   Provide bugids, target date, and/or release information
		#   as to when these gap will be addressed.
		#

	a) Our set of reasons for transitions does not map 1:1 onto the
	   set of reasons used in svcs -x.  The svcs -x implementation
	   digs around for more detail than we are able to include in
	   our events at this time.

	   For example if a start method fails with SMF_EXIT_ERR_CONFIG
	   our event will have a reason-short of "method_failed", and
	   does not indicate which method failed or how.  The svcs -x
	   output for this case will say "Start method exited with
	   $SMF_EXIT_ERR_CONFIG.".

	b) Related to a) - we diagnose the same defect class whatever the
	   reason for a transition to maintenance, and so all causes
	   map to the same message id (SMF-8000-YX).

	   The svcs -x output, on the other hand, hardcodes distinct
	   message ids for each cause it can derive.  At this time we
	   are not able to include that message id in the transition
	   event, and so are not able to point the admin directly to
	   that more-specific article.  We do, however, provide a
	   full suggested 'svcs -x <fmri>' command line to run more
	   more information.

	c) Even the more-specific articles that svcs -x refers to are
	   still generic for the particular reason involved.  For example,
	   the article for "Start method exited with $SMF_EXIT_ERR_CONFIG."
	   (SMF-8000-KS) only talks of what that means generically and
	   is not customized to a specific service.

	   It would be nice to have service-specific article content.
	   For example if service ntp fails as above we could generate
	   a message id that links to an article that covers the common
	   reason for ntp failing in this way.

	   Anyway, such additional resolution is not a feature of the
	   current portfolio.

11.Dependencies

	11.1 List all project and other portfolio dependencies to fully realize
	     the targeted FMA feature set for this portfolio. A portfolio may
	     have dependencies on infrastructure projects.

		#
		# For Example,
		# The "Sun4u PCI hostbridge" and "PCI-X" projects have
		# a dependency on the events/ereports defined within the
		# "PCI Local Bus" portfolio.
		#

		None.


12. References

      12.1 Provide pointers to all documents referenced in previous sections.

		#
		# Note:
		#
		# - Documents referenced here should include things like:
		#   architecture spec, 1pagers, functional spec,
		#   error handling specs, diagnosis philosophy docs,
		#   and test plans/specs.
		# - All documents that directly relate to this portfolio
		#   should eventually be archived in the repository, and
		#   have a reference here.
		# - The document pointer may initially point to someplace
		#   other than the repository, but when/if the document is
		#   archived in the repository the pointer should change.
		# - References to these documments in the sections above
		#   should include a document identifier and/or name.
		#

		#
		# An Example of a document references in this section
		# might look like this:
		#	[1] Error Handling Philosophy Spec
		#	    http://...
		#
		# A Example of reference to the document in a section
		# above may look like this:
		#	For a a description of error handling see [1].
		#