1. Introduction 1.1 Portfolio Name SMF/FMA integration Phase 1 Reference: 2010/014.smf-integration-ph1 1.2 Portfolio Author Antonello.Cruz@oracle.com Gavin.Maltby@oracle.com 1.3 Submission Date 18 May 2010 1.4 Project Team Aliases: swfma@sun.com 1.5 Interest List N/A 1.6 List of Reviewers Reviewer Group Version Date Comments -------- ----- ------- -------- --------------------- 2. Portfolio description New ireport leaf events are defined to describe all SMF instance state transitions. Transitions to SMF maintenance state are modelled in FMA as a defect diagnosis, with associated case lifecycle events and management. The smtp and snmp notification agents of portfolios 2009.027 and 2009.027 are already configured to understand our SMF state transition events and to be able to raise customized notifications of such events as dictated by notification preferences expressed within SMF. 2.1 Are you leveraging any existing portfolio(s), or is this an umbrella portfolio that includes other portfolios? Please provide a pointer to the location of those portfolio(s). This portfolio imports its sibling portfolio 2010/006.stabilities.ireport.swscheme "FMRI and FMA Event Stabilty, 'ireport' category 1 event class, and the 'sw' FMRI scheme" which defines the protocol extensions required for the present portfolio. Portfolio 2008.038.fma-sw-servicesBasic describes software service diagnosis in the fishworks appliance; it was never approved as a portfolio and instead simply used to formalize required changes to the "svc" FMRI scheme and to introduce the SMF maintenance defect event. The current portfolio describes SMF software service diagnosis for mainstream Solaris. The "svc" FMRI scheme changes documented in 2008/038.fma-sw-servicesBasic are already present in Solaris. In submitting the current portfolio we propose marking 2008.038 as approved for the aspects detailing the "svc" FMRI scheme. We also import portfolios for smtp and snmp notification agents, 2009/027.smtp-agent and 2009/028.snmp-agent, and the configuration mechanism of PSARC/2009/617. 2.2 Provide a Customer visible FMA features document The SMF graph engine within svc.startd is modified to be able to raise events describing instance state transitions. This support applies to all restarters - svc.startd itself, the inetd delegated restarter, or a custom delegated restarter. An event is raised for a transition if: a) the transition is to or from maintenance state, or b) the transition is online -> offline following a hardware error event in the service contract, or c) an administrator has configured a notification preference for this transition. If none of these apply then no event is raised, so note that the log is not necessarily a log a complete history of all instance state transitions. Instance transition events are used in fmd for two purposes: 1) Notification of the transition (e.g., via email or snmp). Such notification is possible for "innocent" transitions that fault management software won't be applying any diagnosis to, such as a transition of an instance from offline to online. Notification preferences are now held within the SMF repository, and configured through svccfg(1m) or through the service manifest. 2) Modelling SMF maintenance state with a corresponding defect diagnosis. If the affected service is cleared in SMF (svcadm clear) we will repair the case in fmd; if we instead choose to repair the case in fmd (fmadm repair) we propagate that back into SMF as if 'svcadm clear' had been used. 2.3 Provide a Architecture specification/document It is important for all systems portfolios to provide an architecture spec that describes the various components comprising the system's FMA (fault boundaries, telemetry flow, etc.) N/A. This is not a platform portfolio. 3. Fault Boundary Analysis 3.1 For systems, subsystems, components or services that make up this portfolio, list all resources that will be diagnosed and all the ASRUs and FRUs associated with each diagnosis in which the resource may be a suspect. This portfolio adds only one new diagnosis, that of a defect corresponding to an SMF maintenance event. The fault boundary is that defined by an SMF service instance. There is no diagnosis as such: we are simply mirroring the diagnosis already made in SMF when it decided to place the instance into maintenance state. An example illustrates the resource and ASRU FMRIs in such a defect diagnosis; they identify the affected service instance, and there is no FRU or location label in the fault-list entry for this defect. Services are identified using the 'svc' FMRI scheme as defined by 2008/038.fma-sw-servicesBasic and clarified in 2010/006.stabilities.ireport.swscheme. An example of an SMF maintenance event defect diagnosis: # fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- May 12 22:52:47 915cb64b-e16b-4f49-efe6-de81ff96fce7 SMF-8000-YX major Host : parity Platform : Sun-Fire-V40z Chassis_id : XG051535088 Product_sn : Fault class : defect.sunos.smf.svc.maintenance Affects : svc:///system/intrd:default faulted and taken out of service Problem in : svc:///system/intrd:default faulted and taken out of service Description : A service failed - it is restarting too quickly. Refer to http://sun.com/msg/SMF-8000-YX for more information. Response : The service has been placed into the maintenance state. Impact : svc:/system/intrd:default is unavailable. Action : Run 'svcs -xv svc:/system/intrd:default' to determine why the service failed and the location of logfiles, if any. Topology is not required for diagnosis, but is required for case management. We add full enumeration support to the 'svc' scheme in libtopo which currently supports only nvl2str and str2nvl and service state operations. Default 'fmtopo' and 'fmtopo -p' output is unchanged. The svc svc topology can be viewed using 'fmtopo -s svc': # /usr/lib/fm/fmd/fmtopo -s svc -p TIME UUID Dec 22 18:40:28 73f7403f-2346-40f4-fa51-a1ef19333153 svc://system/boot-archive ASRU: - FRU: - Label: svc:/system/boot-archive svc://system/boot-archive:default ASRU: - FRU: - Label: svc:/system/boot-archive:default svc://system/device/local ASRU: - FRU: - Label: svc:/system/device/local svc://system/device/local:default ASRU: - FRU: - Label: svc:/system/device/local:default ... and so on. 3.2 Diagrams or a description of the faults that may be present. o Provide a pointer to the latest version of your fault tree which lists each fault and how errors are propagated both internally within the subsystem and between subsystems. This is a tough one: ireport.os.smf.state-transition.maintenance | | V defect.sunos.smf.svc.maintenance The affected service is detailed in the event payload of the ireport - see the event definition below. It is this "svc" scheme FMRI that is the 'resource' and 'asru' in the diagnosed defect. o Provide a pointer to a summary of your changes to the Event Registry, as summarized by the "ercheck" tool. The HTML report (summary.html) from ercheck is the preferred format: See ercheck.html. See also erapache and webrev provided - the ercheck is a little overwhelmed by the stability level edits. Note that defect.sunos.smf.svc.maintenance already exists - it was introduced by 2008/038.fma-sw-servicesBasic. While we share the same defect class (and associated KA) we will use the ireport.os.smf.state-transition.maintenance telemetry class, whereas fishworks employed an ereport class event for this. We also arrange to use the SMF dictionary instead of the SUNOS dictionary so that our dictionary entries and article content can be independent of those used for the fishworks appliance today. In addition to ireport.os.smf.state-transition.maintenance we introduce leaf events corresponding to all new states for an SMF instance transition. So the full set of new events is: ireport.os.smf.state-transition.maintenance ireport.os.smf.state-transition.uninitialized ireport.os.smf.state-transition.online ireport.os.smf.state-transition.offline ireport.os.smf.state-transition.degraded ireport.os.smf.state-transition.maintenance These event class names are Committed. Notification mechanisms such as the smtp-notify and snmp-notify daemons can subscribe not only to higher-level "list.*" case lifecycle events, but also to lower-level "ireport.*" events. They may therefore choose to notify of events that FMA will not diagnose (or in addition to FMA diagnosis if so desired), such as notification of service restart, the service being disabled, etc. They decide whether or not to raise a notification based on the preferences stored in SMF, as described in the portfolios for smtp and snmp notification services. Document 02-event-payload.txt in the portfolio materials details the event payload for each of the new events. 4. Diagnosis Strategy 4.1 How are faults diagnosed? Provide a diagnosis philosophy document or a pointer to a portfolio that describes the algorithm(s) used to diagnose the faults described in Section 3.2 and the reasons for using said strateg(y/ies). Diagnosis: Diagnosis as such has occured within SMF, and we are simply mirroring and tracking maintenance state in fmd. Our diagnosis code, such as it is, is hosted as a subsidiary named "smf diagnosis" of the software-diagnosis module introduced in 2010/007.fmd-infrastructure-additions. We subscribe to ireport.os.smf.state-transition.maintenance (published, indirectly, by svc.startd on a transition to maintenance) When an event is received, we begin by inspecting the reason for the transition to maintenance. If the state was requested by an administrator (a reason-short of 'administrative_request') then we return taking no action; otherwise we use fmd_case_open_uuid to open a case with the same UUID as the ireport. Next we add a defect.sunos.smf.svc.maintenance to the case listing the svc scheme FMRI of that affected service as resource and asru and the the reason-short and reason-long to the defect. We then immediately solve the case, but despite the asru already being isolated in SMF we do not close the case - our sibling smf response logic will do that after caching the case UUID. Response: Our response logic lives as a subsidairy of the software-response module introduced in portfolio 2010/007.fmd-infrastructure-additions. We do not need to perform isolation actions since the service affected by a transition to maintenance state is already isolated/unusable. Instead the function here is to coordinate service clear/repair actions via svcadm/fmadm. Thus we subscribe to ireport.os.smf.state-transition.* so that we can watch for transitions *out* of maintenance state (as would be the result of a svcadm clear on an instance in maintenance state). When such an event is observed we use fmd_repair_asru to mark the asru as repaired, which then leads to the case being resolved. If instead of using svcadm clear an admin chooses to use fmadm to clear (repair[ed]/acquit/replaced) a service, then we must propagate this request into SMF via a libscf interface. We receive list.repaired events that are published as a result of the fmadm repair action, and use smf_restore_instance to propagate the request to SMF if we can see that the resource is still unusable (in maintenance state). The fmd_repair_asru we use (as above) to propagate svcadm clear requests into the fmd resource cache *also* results in fmd publishing a list.repaired event for the case. We must avoid mistaking this as the result of an fmadm repair action, and so not be tempted to propagate the request back to SMF! To achieve this we subscribe to the defect class that we diagnose above which results in us receiving list.suspect containing that defect class, and we use this to cache the set of all case UUIDs for SMF maintenance. We update this cache as we propagate events (as described above) and will not propagate a list.repaired for the same UUID more than once. 4.2 What new fault events will be defined and registered with the SMI event registry? Include all FMA Protocol ereport specifications, and provide a pointer to your ercheck output. See 3.2 above for the definition of these new events: ireport.os.smf.state-transition.maintenance ireport.os.smf.state-transition.online ireport.os.smf.state-transition.offline ireport.os.smf.state-transition.degraded ireport.os.smf.state-transition.uninitialized ireport.os.smf.state-transition.disabled 4.3 If your fault management activity (error handling, diagnosis or recovery) spans multiple fault manager regions, explain how each activity is coordinated between regions. N/A 5. Error Handling & Reporting Strategy 5.1 How are errors detected and handled? Include a description of the immediate error reactions taken to capture error state and keep the system available without compromising the integrity of the rest of the system or user data. In the case of a device driver being hardened, describe the recovery/retry behavior, if any. Describe how storms of errors are mitigated. N/A 5.2 How are ereports generated? Descrbe the software components involved in generating ereports. Describe any filtering, hysteresis, etc. that occurs prior to ereport generation. Ireports are raised from svc.startd using the publication mechanism of 2010/007.fmd-infrastructure-additions. Events involving maintenance state are always propagated out of svc.startd; other events are only propagated if notification preferences exist for the particular transition. There is no further filtering or hysteresis control at this stage. Enabling notifications for all service instance transitions will generate a reasonable number of events, but the rate is limited by SMF (e.g., an instance restarting too frequently should be placed into maintenance state). 5.3 What new error report (ereport) events will be defined and registered with the SMI event registry? Include all FMA Protocol ereport specifications, and provide a pointer to your ercheck output. See 4.2 5.4 If you are *not* using a reference fault manager (fmd(1M)) on your system, how are you persisting ereports and communicating them to Sun Services? N/A 5.5 For more complex system portfolios provide a comprehensive error handling philosophy document that descibes how errors are handled by all components involved in error handling. N/A 5.6 If this portfolio includes new errors, but leverages existing ereports, provide a description of how the errors map to the ereports. N/A 5.7 Describe the error telemetry flow between all software components A block diagram would be a good method to do this. N/A 5.8 How can error telmetry (ereports) be disabled? Describe any cofiguration file settings, environment variables, etc. that can be used to turn of ereport generation. The ext-event-transport module receives "raw" events from svc.startd and transforms them into full protocol events. Including a line as follows n ext-event-transport.conf will cause the module to stop post-processing of events received from svc.startd: setprop inbound_postprocess_smf false The default value is, of course, "true". Observations are published for all events involving maintenance state, and otherwise for those for which notification preferences exist. So one means of avoiding events is to delete notification preferences - but those involving maintenance state will remain. Another brute-force approach is to unload the ext-event-transport module from fmd. A side-effect of this is that libfmevent consumers will receive no events - ext-event-transport forwards events to those consumers and is also responsible for the post-processing of incoming events. 6. Recovery/Reaction 6.1 Are you introducing or modifying any response agent(s)? If so, provide a description of the agent(s). See 4.1 6.2 What existing fma modules will be used in response to your faults? software-response 6.3 Are you modifying any existing (Section 6.2) response agents? If so, indicate the agents below, with a brief description of how they will be modified. See 4.1 6.4 Describe any immediate (e.g. offlining) and long-term (e.g. black-listing) retiring/disabling of components. N/A 6.5 Provide pointers to dictionary/po entries and knowledge articles. See events registry webrevs mail to portfolio-review alias. 7. Event Transport Mechanisms 7.1 Are you introducing or modifying any event transport mechanisms? If so please provide a description of the transport mechanism(s). No 7.2 What events are transported? Provide a list of ereport, fault, and list.* events which are transported if this is not already covered above in sections 4 and 5. 8. FRUID Implementation 8.1 Complete this section if you're submitting a portfolio for a platform. N/A. This is not a platform portfolio. 9. Test 9.1 Provide a pointer to your test plan(s) and specification(s). Make sure to list all FMA functionalities that are/are not covered by the test plan(s) and specification(s). http://ontestreview.central.sun.com/wiki/index.php/CoreFMA http://ontestreview.central.sun.com/wiki/index.php/CoreFMA_Assertion 9.2 Explain the risks associated with the test gaps, if any. See http://infoshare.sfbay/twiki/bin/view/Fma/SwFmaTestGaps 10. Gaps 10.1 List any gaps that prevent a full FMA feature set. This includes but is not limited to insufficient error detectors, error reporting, and software infrastructure. # # Note: # # For each gap listed: # - Provide a description of the gap. # - Provide a reason/justification for the gap. # - Provide a risk assessment of the gap. # Describe the customer and/or service impact if the # gap is not addressed. # - List future projects/get-well plans to address the gap. # Provide bugids, target date, and/or release information # as to when these gap will be addressed. # a) Our set of reasons for transitions does not map 1:1 onto the set of reasons used in svcs -x. The svcs -x implementation digs around for more detail than we are able to include in our events at this time. For example if a start method fails with SMF_EXIT_ERR_CONFIG our event will have a reason-short of "method_failed", and does not indicate which method failed or how. The svcs -x output for this case will say "Start method exited with $SMF_EXIT_ERR_CONFIG.". b) Related to a) - we diagnose the same defect class whatever the reason for a transition to maintenance, and so all causes map to the same message id (SMF-8000-YX). The svcs -x output, on the other hand, hardcodes distinct message ids for each cause it can derive. At this time we are not able to include that message id in the transition event, and so are not able to point the admin directly to that more-specific article. We do, however, provide a full suggested 'svcs -x ' command line to run more more information. c) Even the more-specific articles that svcs -x refers to are still generic for the particular reason involved. For example, the article for "Start method exited with $SMF_EXIT_ERR_CONFIG." (SMF-8000-KS) only talks of what that means generically and is not customized to a specific service. It would be nice to have service-specific article content. For example if service ntp fails as above we could generate a message id that links to an article that covers the common reason for ntp failing in this way. Anyway, such additional resolution is not a feature of the current portfolio. 11.Dependencies 11.1 List all project and other portfolio dependencies to fully realize the targeted FMA feature set for this portfolio. A portfolio may have dependencies on infrastructure projects. # # For Example, # The "Sun4u PCI hostbridge" and "PCI-X" projects have # a dependency on the events/ereports defined within the # "PCI Local Bus" portfolio. # None. 12. References 12.1 Provide pointers to all documents referenced in previous sections. # # Note: # # - Documents referenced here should include things like: # architecture spec, 1pagers, functional spec, # error handling specs, diagnosis philosophy docs, # and test plans/specs. # - All documents that directly relate to this portfolio # should eventually be archived in the repository, and # have a reference here. # - The document pointer may initially point to someplace # other than the repository, but when/if the document is # archived in the repository the pointer should change. # - References to these documments in the sections above # should include a document identifier and/or name. # # # An Example of a document references in this section # might look like this: # [1] Error Handling Philosophy Spec # http://... # # A Example of reference to the document in a section # above may look like this: # For a a description of error handling see [1]. #