# # Copyright 2008 Sun Microsystems, Inc. All rights reserved. # Use is subject to license terms. # # ident "@(#)portfolio 1.3 08/08/18 SMI" 1. Introduction 1.1 Portfolio Name SCSI Disk Device-as-Detector Diagnosis (phase3) 1.2 Portfolio Authors Chris Horne, Liu Ti, Xiao Li, David Zhang 1.3 Submission Date 1.4 Mon Aug 25 1.4 Project Team Aliases: scsifma-bj@sun.com 1.5 Interest List scsifma-bj@sun.com 1.6 List of Reviewers Eric.Schrock@Sun.Com Stephen.Hanson@sun.com 2. Portfolio description This portfolio describes the third phase of FMA integration for SCSI disks. The first phase addressed topology [1], the second phase built infrastructure [2], this phase (third) addresses 'device-as-detector' diagnosis, and a future phase will address transport diagnosis and health monitoring. Section 2.1 of this portfolio provides an overview of how the disk driver and fmd(1M) coordinate fault diagnosis and response. This overview will cover: o The derivation of a ereport detector 'device-path'. o The driver rules and fmd(1M) topology implications related to having a 'devid' in an ereport detector. o Design considerations and structure of proposed ereport classes, including class-specific payload. o The definition of the common 'driver-assessment' ereport payload property, its values, and its role in Eversholt propagation rules. o How ereport telemetry relates to general driver messaging, the implementation of a structured log, and current driver scsi_log() and /var/adm/messages use. Sections 3-11 of this portfolio follow the standard portfolio template, and provide specific implementation information - often in the form of references and links to additional material. Many of the standard portfolio topics are covered in in the overview below. 2.1 Overview The term 'device-as-detector' (name of this case) refers to a SCSI device operating as an FMA error detector. The device may be an internal disk, or it may be a device located somewhere in an external enclosure. In both cases, the device detects problems and reports them using T10 standards-defined SCSI transport and protocol [19]. The Solaris endpoint for this SCSI standards-defined telemetry is the leaf (disk/tape) driver (sd(7D)). NOTE: In this overview we use the term leaf driver, instead of disk driver, because we expect all leaf drivers to use the approach outlined. Delivery will be limited to the disk leaf driver. The leaf driver is responsible for converting SCSI protocol defined telemetry into FMA ereport form. The leaf driver, and not the SCSA framework, must fill this role because the framework does not understand state beyond a single scsi_pkt(9S) and does not understand device specific behavior. In addition to a leaf driver's detector-oriented role, the driver also performs its own low-level error handling. This low-level error handling will initiate and coordinate command level retry and recovery procedures within the driver itself. These procedures are carried out independent of any direct fmd(1M) control, but all ereports generated during this process have a 'driver-assessment' property. The 'driver-assessment' property allows Eversholt device-as-detector rules to track 'driver-assessment', effectively turning the low-level error handling results into a low-level diagnosis. The 'driver-assessment' property is in some respects like an in-band 'service impact' [18]. The following 'driver-assessment' values are used: fatal: The driver has failed an operation which, in the absence of a fault, should have succeeded. retry: The driver will retry a failed operation which, in the absence of a fault, should have succeeded. recovered: A retry was successful. fail: The driver has failed an operation for reasons unrelated to hardware. info: The driver encountered sense data, but operation was successful operation. The term 'detector' refers to something that detects a problem. However, in the context of driver generated ereports, 'detector' is the name of an nvlist embedded in all ereports: this can be seen by looking at the error log using 'fmdump -ev'). The ereport detector 'device-path' property describes the physical transport hardware that encountered a problem. For mpxio operation, the path_instance is used to construct the detector 'device-path' [2]. When the driver can guarantee the identity of a device, the ereport detector should contain a 'devid' property. An ereport with a 'devid' is considered a device-as-detector event. All fmd(1M) diagnosis-engine processing covered by this portfolio is device-as-detector oriented. The same ereport without a 'devid' is considered transport-detector event. The fmd(1M) diagnosis-engine responsible for processing transport-detector ereports is future work [4]. The driver must pay close attention to whether the ereport detector should have a devid property or not. The driver writer should focus on the accuracy of this choice, not the topology and diagnosis-engine ramifications described below. When the driver can't guarantee the identity of the device, the detector should not have a devid. The driver's choice may be influenced by knowledge of transport addressing: providing an accurate answer for a transport that addresses the receptacle where a device is located (@target,lun) is more complex than providing an accurate answer for a transport that addresses the device directly (@wWWN,lun). Device-as-detector events are processed by the Eversholt diagnosis-engine. The Eversholt term for topology is 'config', and config information is obtained from libtopo snapshots. Basic storage topology was defined by [1], and [2] enhanced the Eversholt in two ways for dealing with storage: o The Eversholt language was enhanced by adding a 'discard_if_config_unknown' ereport property. o The Eversholt diagnosis-engine front end was enhanced to match topology based on detector 'devid', and to silently discard telemetry that does not match topology/config for ereport classes defined as 'discard_if_config_unknown'. For device-as-detector events, propagation rules are tied directly to the common 'driver-assessment' property provided by the leaf driver. This allows user-land fmd(1M) operation to track low-level driver diagnosis. This approach keeps existing low-level driver diagnosis procedures largely intact, yet allows them to trigger more sophisticated fmd(1M) agent activity like io-retire [10]. For device-as-detector ereports with a 'devid' topology match in the Eversholt front end, the Eversholt rules delivered by this portfolio will cause an ereport with a 'driver-assessment' property value of 'fatal' to trigger fault event of the same class. Stated another way, our .esc rules generate faults when an ereport has a 'devid', the 'devid' matches topology, and the 'driver-assessment' is 'fatal'. Ereports with a 'devid' topology match and a non-'fatal' 'driver-assessment' generate upsets. Like many Eversholt consumers, we expect upsets to be discarded. We also expect fmd(1M) to treat fault events with the same FRU and ASRU as an existing fault as duplicates. Raw ereport telemetry is always available from the error log via 'fmdump -ev', even when a discard occurs. Agent activity for storage is not new: selected platforms have supported internal fmd(1M) generated 'ereport.io.scsi.disk' '.predictive-failure', '.self-test-failure', and '.over-temperature' ereports (and associated faults) for a long time. For ereport and fault classes, we chose to define a minimal set of event classes based on available payload information and the FRU/ASRU orientation of that information. We intentionally decided to not define classes based on interpretation of payload information. The structure of new classes is shown in the diagram below. To keep things simple, the same class structure is used for all types of events. While other types of scsi devices, like tape, are expected to use a similar class structure, to allow for knowledge articles specific to device-type and to allow for future flexibility in areas like disk.vs.tape SERD, we have a '.disk' device-type level in the proposed class structure. Eversholt rules map ereports of class '.merr', and '.derr', to faults when they have a 'devid', the 'devid' matches topology, and the 'driver-assessment' property value is 'fatal'. A '.uderr' will map to a fault/defect at some future date, depending on how gap 4 is resolved. A '.recovered' ereport will always have a 'driver-assessment' of 'recovered'. Ereports of class '.tran' never have a '.devid', and will be discarded by the Eversholt diagnostic-engine front end. All other ereports that end up in Eversholt (match topology) map to upsets - with a structured error log view of device behavior available in the 'fmdump -e' error log. | io.scsi | .cmd [driver-assessment,op-code,cdb,pkt-reason,pkt-state,pkt-stats] | .disk | +----------------------------+----------------------------+ | | | .recovered .dev .tran [stat-code] : | {future} +--------------------+--------------------+ | | | .rqs .serr .uderr++ [key*,asc*,ascq*,sense-data] [info,value] | +---------+ | | .merr+ .derr+ [lba*] :LEGEND: .cmd scsi_command .derr device_error .dev have_info_from_device .disk disk_related .merr media_error .recovered recovered .rqs request_sense .serr scsi_status_error .tran transport_error .uderr unexpected_data_error [payload_property[,payload_property]*] + maps to 'device-as-detector' fault ++ future fault/defect, see gap 4 * property promoted into fault Our choice of simple class names means that the fault names do not directly provide a detailed interpretation of the fault. Instead, we promote specific ereport properties into the faults events (like SCSI T10 standard defined key/asc/ascq properties). Promoted properties, and their values, are available as "hc-specific" members of the fault FMRI, and can be viewed with 'fmdump -V'. All ereports generated will show up in the the error log and can be displayed and filtered using 'fmdump -e'. The fmdump command supports a number of different filtering mechanisms ([15], [2]). A future project could provide human readable error log output - and reconstruct scsi_log()-like messages currently in /var/adm/messages. The 'recovered' ereport is generated so that each 'retry' sequence has a resolution of either 'recovered' or 'fault' in 'fmdump -e' output of the error log. The project is introducing one private sd(7D)/ssd(7D) driver.conf(4) property to control reporting of FMA telemetry via /var/adm/messages. 'fm-scsi-log' The default value is 0, set this to 1 to enable FMA telemetry logging via scsi_log(9F) messages captured in the /var/adm/messages file. The 'fm-scsi-log' property is supported to mitigate risk due to unforeseen dependencies on /var/adm/messages. Any use of this tunable should be considered a short-term fix. This tunable may be removed, without notice, at a future date. The generation of ereports can be disabled via the standard driver.conf(4) facility mentioned in ddi_fm_init(9F). In addition, there is a system(4) global variable called 'scsi_fm_capable' that provides the default value when the 'fm-capable' property is undefined. The default value of 'scsi_fm_capable' is DDI_FM_EREPORT_CAPABLE. From an ereport rate perspective, we have the conflicting goals of providing complete error log information and of providing reliable fault diagnosis. These goals conflict because the first wants to capture complete retry sequences, and the second wants to ensure that ereports leading to faults are never dropped. The driver already implements various forms of delay-before-retry. Also, in situations where upper level software tells the driver that another valid copy of the data exists on a different device, one failing retry sequence results in failure of all active IO to the device (without additional retry/delay): both SVM and ZFS use B_FAILFAST. These mechanism will help reduce ereport rate, and reduce the chance of dropping ereports. If we experience problems with dropped ereports, there are two things we can do to help: o Enhance the framework and scsi ereport post code so that the posting code can assign a priority to an ereport: high priority ereports are less likely to be dropped. For what we are doing, a 'driver-assessment' of 'fatal' would be high priority because we know they can generate in fault events. o Limit the rate at which transport-detector events (i.e. events without devids) are generated: dropping events that exceed the maximum rate). The highest ereport rate is expected to occur with transport-detector events. An example would be a 'switch' failure under heavy load to lots of disks (no B_FAILFAST). A single switch fault can affect many initiator ports and disks. The future diagnosis of transport-detector events ([4]) is expected to issue active probes to determine fault location - an initial event is needed to trigger the generation of active probes, but dropping some of the initial transport-detector should not affect the final diagnosis. 3. Fault Boundary Analysis (FBA) 3.1 For systems, subsystems, components or services that make up this portfolio, list all resources that will be diagnosed and all the ASRUs and FRUs (see RAS glossary for definitions) associated with each diagnosis in which the resource may be a suspect.\\ See [1] and [2] 3.2 Diagrams or a description of the faults that may be present in the subsystem. A suitable format for this information is an Eversholt Fault Tree (see http://eversholt.central) that describes the ASRU and FRU boundaries, the faults that can be present within those boundaries and the error propagation telemetry for those faults.\\ See overview. Ereports: See [99] 'ereport' file Fault Tree: See [99] 'disk.esc' file Event Registry Changes: See [99] 'report.html' file 4. Diagnosis Strategy 4.1 Provide a diagnosis philosophy document or a pointer to a portfolio that describes the algorithm used to diagnose the faults described in Section 3.2 and the reasons for using said strategy(y/ies).\\ See overview. 4.2 If your fault management activity (error handling, diagnosis or recovery) spans multiple fault manager regions, explain how each activity is coordinated between regions. For example, a Service Processor and Solaris domain may need to coordinate common error telemetry for diagnosis or provide interfaces to effect recovery operations.\\ N/A 5. Error Handling Strategy 5.1 How are errors handled? Include a description of the immediate error reactions taken to capture error state and keep the system available without compromising the integrity of the rest of the system or user data. In the case of a device driver being hardened, describe the recovery/retry behavior, if any.\\ See overview. Demo of use: See [99] 'demo' file. 5.2 What new error report (ereport) events will be defined and registered with the SMI event registry? Include all FMA Protocol ereport specifications. Provide a pointer to your ercheck output.\\ New ereports: See [99] 'ereport' file Output of ercheck: See [99] 'report.html' file 5.3 If you are *not* using a reference fault manager (fmd(1M)) on your system, how are you persisting ereports and communicating them to Sun Services?\\ N/A 5.4 For more complex system portfolios (like Niagara2), provide a comprehensive error handling philosophy document that describes how errors are handled by all components involved in error handling (including Service Processors, LDOMs, etc.) [As an example, for sun4v platforms this may include specs for reset/config, POST, hypervisor, Solaris, and service processor software components.]\\ N/A 6. Recovery/Reaction 6.1 Are you introducing any new recovery agent(s)? If so, please provide a description of the recovery agent(s).\\ N/A 6.2 What existing fma modules will be used in response to your faults?\\ syslog-msgs [15] io-retire [10] 6.3 Are you modifying any existing (Section 6.2) recovery agents? If so, please indicate the agents below, with a brief description of how they will be modified.\\ N/A 6.4 Describe any immediate (e.g. offlining) and long-term (e.g. (e.g. black-listing) recovery.\\ N/A 6.5 Provide pointers to dictionary/po entries and knowledge articles.\\ See [99] 'DISK.dict' and 'DISK.po' files. 7. FRUID Implementation 7.1 Complete this section if you're submitting a portfolio for a platform.\\ N/A 8. Test 8.1 Provide a pointer to your test plan(s) and specification(s). Make sure to list all FMA functionalities that are/are not covered by the test plan(s) and specification(s).\\ See [99] 'testplan_ut.txt' and 'testplan_ut_result.txt' files. The sd driver supports a scsi_pkt(9S) fault injection mechanism at the front end of sdintr(). Our tools and test scripts use this mechanism. The demos in [20] are also generated using this. See [21] for a description. During development, we also developed a dtrace 'inject' fault injection mechanism. An 'inject' was similar to the dtrace 'breakpoint', but with the D program providing a string with kmdb commands to execute to automate fault injection. It was not considered a deliverable approach, so we switched the method above. 8.2 Explain the risks associated with the test gaps, if any.\\ 1) Testing can't be run on all types of HBAs - we will however ensure coverage on auto_request_sense and non-auto_request_sense HBAs. 2) There is a testing hole when the system is issuing SCSI commands during HBA driven device enumeration - before a disk device node is even created. Testing will cover sdattach but will not cover HBA attach or transport enumeration. 9. Gaps 9.1 List any gaps that prevent a full FMA feature set. This includes but is not limited to insufficient error detectors, error reporting, and software infrastructure.\\ 1) FRUID_for_disk: disk faults lack Sun part-number information http://monaco.sfbay.sun.com/detail.jsf?cr=6740012 2) ereport priority: detector indicates priority http://monaco.sfbay.sun.com/detail.jsf?cr=6740013 3) disk service-impact mapping: devinfo state and cfgadm output http://monaco.sfbay.sun.com/detail.jsf?cr=6740014 4) FRU for firmware: http://monaco.sfbay.sun.com/detail.jsf?cr=6740015 A) Topology: Device-as-detector diagnosis depends on understanding topology, and the system does not know how to represent the topology of all disks in the system. B) Health-Monitor: Implement phase 4 [4]. C) Transport-DE. Implement phase 4 [4]. D) fmdump: Human readable 'fmdump -e' output that looks like current scsi_log() messages in /var/adm/messages. E) media: '.merr' faults have ASRUs the device level instead of at the block level, agents don't know how to handle block level faults. F) SERD: Serd Engine for recovered errors. G) iSCSI: 'device-path' does not map to the NIC hardware used. 9.2 Provide a risk assessment of the gaps listed in Section 9.1. Describe the customer and/or service impact if said gaps are not addressed.\\ 1) FRUID_for_disk: Delay and confusion in obtaining proper Sun qualified replacement part, increased potential for use of non-qualified replacement. 2) ereport priority: Faults not being processed correctly. 3) disk service-impact mapping: customer sees poor integration across various solaris utilitys that process system state. 4) FRU for firmware: mandated replacement of components that are unlikely to resolve the problem. A) Topology: When topology is unknown, ereports associated with fatal conditions will not produce faults. B) Health-Monitor: Faults on components that are not being accessed may remain unexposed for extended periods of time, and are often exposed when they are needed the most (standby path, hot spare). C) Transport-DE: Ereports associated with fatal conditions will not produce faults. D) fmdump: Customers must look at raw 'fmdump -e' output instead of the more familiar scsi_log() representation of the same information. E) media: We may be faulting entire devices in situations where a finer-grained lba/partition approach is more appropriate. F) SERD: We may continue to use marginal/suspect hardware. G) iSCSI: Impact is minimal for device-as-detector ereports (this phase [3]), for next phase [4] transport-detector faults will not be able to identify specific NIC ports. 9.3 List future projects/get-well plans to address the gaps listed in Section 9.1. Provide target date and/or release information as to when these gaps will be addressed.\\ 1) FRUID_for_disk: CR filed, considered high priority, schedule planning is needed, 2) ereport priority: CR filed, schedule planning is needed. 3) disk service-impact mapping: CR filed, schedule planning is needed. 4) FRU for firmware: CR filed, schedule planning is needed. A) Topology: One possibility is to populate disks that lack topology in an "unknown enclosure". More planning is needed. B-C) Health-Monitor, Transport-DE: While some prototype work has been done, after putback of this phase (three), we intend to start working on a more detailed schedule for phase 4. This may include breaking up the phase 4 goals into smaller sub-phases that are delivered independently. D-G) At this point in time we don't have any plans to address these issues. 10. Dependencies 10.1 List all project and other portfolio dependencies to fully realize the targeted FMA feature set for this portfolio. A portfolio may have dependencies on infrastructure projects. For example, The "Sun4u PCI hostbridge" and "PCI-X" projects have a dependency on the events/ereports defined within the "PCI Local Bus" portfolio.\\ See [0]. 11. References 11.1 Provide pointers to all documents referenced in previous sections (for example, list pointers to error handling and diagnosis philosophy documents, test plans, etc.)\\ See [99] for pointers to project specific information. UMBRELLA: [ 0] Umbrella for Disk FMA: Unified Disk FMA http://wikihome.sfbay/fma-portfolio/Wiki.jsp?page=2007.015.UnifiedDisk STORAGE FMA: [ 1] Phase1: Topology: Generic Topology for Internal Disks http://wikihome.sfbay/fma-portfolio/Wiki.jsp?page=2007.016.DiskTopology http://sac.sfbay/PSARC/2007/388 http://www.opensolaris.org/os/community/arc/caselog/2007/388 [ 2] Phase2: Infrastructure: Multiplexed I/O Enhancements to Support FMA http://wikihome.sfbay.sun.com/fma-portfolio/Wiki.jsp?page=2008.004.MPXIO http://sac.sfbay/PSARC/2008/077 http://www.opensolaris.org/os/community/arc/caselog/2008/077 [ 3] Phase3: Device-as-Detector (THIS CASE) [99] 'portfolio' file http://wikihome.sfbay.sun.com/fma-portfolio/Wiki.jsp?page=2008.032.SCSIP3 http://sac.sfbay/PSARC/2008/XXX...TBS http://www.opensolaris.org/os/community/arc/caselog/2008/XXX...TBS [ 4] Phase4: Transport Diagnosis (Future) ...future RELATED_THUMPER_WORK: [ 5] FMA 2006/012: Sun Fire X4500 Disk Failures: Phase I http://fma.eng/documents/engineering/portfolios/2006/012.thumper http://sac.eng/PSARC/2006/322 http://www.opensolaris.org/os/community/arc/caselog/2006/322 [ 6] "Generic Disk Monitoring (sfx4500 phase 2)" portfolio: http://fma.eng/documents/engineering/portfolios/2007/007.Generic-disk-monitoring-sfx4500-p2 http://sac.eng/Archives/CaseLog/arc/PSARC/2007/202 http://www.opensolaris.org/os/community/arc/caselog/2007/202 RELATED_ZFS_WORK: [ 7] FMA 2005/019 ZFS FMA Phase 0 http://fma.eng/documents/engineering/portfolios/2005/019.zfs [ 8] FMA 2006/005 ZFS FMA Phase 1 http://fma.eng/documents/engineering/portfolios/2006/005.zfs-phase1 http://sac.eng/PSARC/2006/139 http://www.opensolaris.org/os/community/arc/caselog/2006/139 [ 9] FMA 2007/006 ZFS FMA Phase 2 http://fma.eng/documents/engineering/portfolios/2007/006.ZFS-P2 http://sac.eng/PSARC/2007/283/ http://www.opensolaris.org/os/community/arc/caselog/2007/283 IORETIRE: [10] FMA 2007/004 Solaris I/O Retire Agent http://fma.eng/documents/engineering/portfolios/2007/004.IO_Retireagent http://sac.eng/PSARC/2007/290 http://www.opensolaris.org/os/community/arc/caselog/2007/290 MISC: [11] Improved Disk-Drive Failure Warnings http://charlotte.ucsd.edu/users/elkan/ieeereliability.pdf. [12] Dev scheme specification - Section 8.4.3 http://fma.eng/documents/engineering/protocol_whtppr.pdf [13] EVERSHOLT: Eversholt Diagnosis Technology http://sac.eng/PSARC/2003/428 [14] EVERSHOLT: Eversholt Language Manual (Version 1.5 10/04/06) http://eversholt.central/docs/language/ [15] FMD: Solaris Fault Management Daemon http://sac.sfbay/PSARC/2003/089/ [16] OLD: FMA 2006/013 SCSI FMA Phase 1 (Withdrawn) http://fma.eng/documents/engineering/portfolios/2006/013.scsi-phase1 [18] "service impact" information. ddi_fm_ereport_post(9F) http://fma.sfbay/documents/engineering/fmaioprm/chap3-9.html http://sac.eng/PSARC/2007/290 http://sac.eng/PSARC/2002/288 [19] T10 SCSI Standards http://t10.org PROJECT: [20] Project details: http://fogbroom.prc/bjroot/users/yz203490/FMA_phase3 [21] Project details: fault injection http://fogbroom.prc/bjroot/users/xiaoli/onnv-fma3/unit_test/README http://fogbroom.prc/bjroot/users/xiaoli/onnv-fma3/unit_test/ [99] SCSI Disk Device-as-Detector Diagnosis (phase3): http://wikihome.sfbay.sun.com/fma-portfolio/Wiki.jsp?page=2008.032.SCSIP3