Requirement ----------- To record more detailed information regarding individual members of a suspect list, as to whether they have been physically repaired/replaced/removed or have been acquitted. The acquittal could be automatic (as part of an improved conviction policy) or manual (eg by following instructions in a knowledge article). Proposal -------- 1. Add the following new options to fmadm fmadm repaired fmri | label fmadm replaced fmri | label fmadm acquit fmri | label fmadm acquit uuid [ fmri | label ] These replace the existing "fmadm repair", and essentially behave in exactly the same way as "fmadm repair", except that fmd remembers whether the fmri or uuid has been repaired, replaced or acquitted. For compatibility reasons, we will retain "fmadm repair" as a synonym for "fmadm repaired". "fmadm repaired" should be used when some physical repair has been carried out which it is hoped will resolve the problem - such as reseating a card or straightening a bent pin. "fmadm replaced" should be used to indicate that the suspect fru has been replaced. If it is automatically discovered that a FRU has been replaced (serial number has changed), then this is treated in the same way as if "fmadm replaced" has been typed. Also "fmadm replaced" will not be allowed if fmd can automatically confirm that the FRU has not been replaced (serial number has not changed). If it is automatically discovered that a FRU has been removed but not replaced, then the current behaviour is unchanged - the suspect will be displayed as "not present", but will not be considered as permanently removed until the rsrc.aged time has expired. Note that replacement takes precedence over repair and both take precedence over acquittal. So you can acquit something and then subsequently repair it, but you can't acquit something that has already been repaired. A case is considered repaired (moves into the FMD_CASE_REPAIRED state and a list.repaired event generated) when either its uuid is acquitted or all suspects have been either repaired, replaced, removed or acquitted. It is expected that the most common usages would be "fmadm acquit label" "fmadm replaced label" or "fmadm repaired label". Thus if we have a case with a two-entry suspect list, one for a FRU with label "PCIE1" and the other for "MB", and the KA suggests replacing the FRU in "PCIE1" but acquitting "MB", then the user should replace the FRU in "PCIE1" then type fmadm replaced PCIE1 fmadm acquit MB Typically the user would only want to "acquit by fmri/label" if it was determined that the resource was not guilty in all current cases in which it is a suspect. However to allow a FRU to be manually acquitted in one case while remaining a suspect in all others an option has been provided which allows the user to specifiy both uuid and fmri/label, ie fmadm acquit uuid [ fmri | label ] 2. Extend the fmd case state model to include a final state FMD_CASE_RESOLVED. This state can only be transitioned to from the FMD_CASE_REPAIRED state, and indicates that all agents have successfully carried out their actions resulting from the list.repaired event. The suspects for the case will remain in the resource cache and visible through "fmadm faulty -a" as long as the case remains in the FMD_CASE_REPAIRED state. The message put out by "fmadm faulty -a" for an ASRU that is still unusable but is associated with a resource(s) that are no longer faulty is currently "unknown, not present or disabled". This should be changed to a more descriptive "out of service, but associated components no longer faulty". For example # fmadm faulty -a --------------- ------------------------------------ -------------- ------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- ------- Sep 21 10:01:36 d482f935-5c8f-e9ab-9f25-d0aaafec1e6c PCIEX-9999-7J Major Fault class : fault.io.pciex.device_xxxx Affects : dev:///pci@0,0/pci1022,7458@11/pci1000,3060@0 out of service, but associated components no longer faulty FRU : "SLOT 2" (hc://.../pciexrc=3/pciexbus=4/pciexdev=0) replaced Description : A problem was detected for a PCIEX device. Refer to http://sun.com/msg/PCIEX-9999-7J for more information. Response : One or more device instances may be disabled Impact : Possible loss of services provided by the device instances associated with this fault Action : Schedule a repair procedure to replace the affected device. A reboot of the system will be required to bring the resource back into service after the repair. The case will finally be deleted from the resource cache (and cease to be displayed by "fmadm faulty -a") only when the case state moves to FMD_CASE_RESOLVED. When the case does finally move to the FMD_CASE_RESOLVED state, an additional list.resolved event will be broadcast and logged to indicate that the case is now finally resolved. In order to implement this we need an additional interface between the agent and fmd. This is the equivalent of the fmd_case_uuclose() call made by the agent when retire actions have been successfully completed. So a similar interface to be called when all unretire actions have completed would be fmd_case_uuresolved(fmd_hdl_t *handle, const char *uuid) If the case is in the FMD_CASE_REPAIRED state but has not progressed to the FMD_CASE_RESOLVED state, and fmd is restarted (eg after a reboot) then the list.repaired will be replayed. fmd_case_uuresolved() will fail unless all the suspects in the suspect list are repaired, replaced, acquitted, or removed. 3. Add an additional list.updated event. This will be sent whenever an individual suspect is repaired, replaced, removed or acquitted. It will be logged in the fltlog (in the same way as list.repaired is now logged), and can be subscribed to by retire agents etc so they can unretire an individual suspect as soon as it has been repaired/replaced/acquitted/removed. It will not be logged in syslog nor sent as an snmp trap. So in the example above a list.updated will be generated when "fmadm replaced PCIE1" is typed, and the list.repaired is then generated when "fmadm acquit MB" is subsequently typed. If one or more suspects have been repaired, replaced, removed or acquitted but the case has not progressed to the FMD_CASE_REPAIRED state, and fmd is restarted (eg after a reboot) then the list.updated will be replayed. 4. Add a new fmd_api interface fmd_nvl_fmri_replaced(fmd_hdl_t *hdl, nvlist_t *nvl) This will tell the caller whether the fru at the specified fmri has been replaced by another with a different serial/part number. This is different from the existing fmd_nvl_fmri_present in that the latter doesn't distinguish between a fru being replaced or just not being present. If the fru serial number has changed, we can automatically consider it as having been replaced straight away. If it is just not present, we should wait for rsrc.age to expire before considering it to have been permanently removed. In order to implement this we will need a new scheme interface fmd_fmri_replaced(nvlist_t *nvl) and a new topo interface topo_fmri_replaced(topo_hdl_t *thp, nvlist_t *nvl, int *err); These interfaces should return one of four values FMD_REPLACED_NOT_PRESENT nothing present at the location FMD_REPLACED_FALSE same resource is still at the location FMD_REPLACED_TRUE the resource has been replaced FMD_REPLACED_UNKNOWN there is something at the location but no serial/part number support is available to determine if it is the same part as before 5. The status array in the list event is extended. Currently it contains the following fields #define FM_SUSPECT_FAULTY 0x1 #define FM_SUSPECT_UNUSABLE 0x2 #define FM_SUSPECT_NOT_PRESENT 0x4 This is extended by adding #define FM_SUSPECT_ACQUITTED 0x8 #define FM_SUSPECT_REPAIRED 0x10 #define FM_SUSPECT_REPLACED 0x20 Here FM_SUSPECT_ACQUITTED means acquitted by "fmadm acquit", FM_SUSPECT_REPAIRED means repair notified by "fmadm repaired", and FM_SUSPECT_REPLACED means repair notified by "fmadm replaced" or replacement auto-detected by serial number change. These additional bits will be present in the list.repaired event and the list.updated event. The FM_SUSPECT_ACQUITTED bit may also be present in the list.suspect event if the acquittal was carried out by automated conviction policy. These bits are also passed across the libfmd_adm interface and so will be visible via "fmadm faulty". 6. Enhance fmdump so that it displays the status bits in a more human readable fashion. # fmdump TIME UUID SUNW-MSG-ID Mar 24 17:52:20.9963 7b51569c-92aa-c7da-d0c7-84baea334f59 PCI-8000-8S Mar 24 18:12:00.0000 7b51569c-92aa-c7da-d0c7-84baea334f59 Updated Mar 24 18:13:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Repaired Mar 24 18:14:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Resolved # fmdump -v TIME UUID SUNW-MSG-ID Mar 24 17:52:20.9963 7b51569c-92aa-c7da-d0c7-84baea334f59 PCI-8000-8S 67% fault.io.pci.device-interr Problem in: ... Affects: ... FRU: ... Location: PCIE1 33% fault.io.pci.bus-linkerr Problem in: ... Affects: ... FRU: ... Location: MB Mar 24 18:12:00.0000 7b51569c-92aa-c7da-d0c7-84baea334f59 Updated 67% fault.io.pci.device-interr Repair Attempted Problem in: ... Affects: ... FRU: ... Location: PCIE1 33% fault.io.pci.bus-linkerr Problem in: ... Affects: ... FRU: ... Location: MB Mar 24 18:13:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Repaired 67% fault.io.pci.device-interr Repair Attempted Problem in: ... Affects: ... FRU: ... Location: PCIE1 33% fault.io.pci.bus-linkerr Acquitted Problem in: ... Affects: ... FRU: ... Location: MB Mar 24 18:14:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Resolved 67% fault.io.pci.device-interr Repair Attempted Problem in: ... Affects: ... FRU: ... Location: PCIE1 33% fault.io.pci.bus-linkerr Acquitted Problem in: ... Affects: ... FRU: ... Location: MB The example above is where a repair has been attempted on one suspect (notified by "fmadm repaired"), and then the other suspect is acquitted via "fmadm acquit". A further example below is where conviction policy acquits one suspect and subsequently the other suspect is replaced. # fmdump TIME UUID SUNW-MSG-ID Mar 24 17:52:20.9963 7b51569c-92aa-c7da-d0c7-84baea334f59 PCI-8000-8S Mar 24 18:13:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Repaired Mar 24 18:14:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Resolved # fmdump -v TIME UUID SUNW-MSG-ID Mar 24 17:52:20.9963 7b51569c-92aa-c7da-d0c7-84baea334f59 PCI-8000-8S 67% fault.io.pci.device-interr Problem in: ... Affects: ... FRU: ... Location: PCIE1 33% fault.io.pci.bus-linkerr Acquitted Problem in: ... Affects: ... FRU: ... Location: MB Mar 24 18:13:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Repaired 67% fault.io.pci.device-interr Replaced Problem in: ... Affects: ... FRU: ... Location: PCIE1 33% fault.io.pci.bus-linkerr Acquitted Problem in: ... Affects: ... FRU: ... Location: MB Mar 24 18:14:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Resolved 67% fault.io.pci.device-interr Replaced Problem in: ... Affects: ... FRU: ... Location: PCIE1 33% fault.io.pci.bus-linkerr Acquitted Problem in: ... Affects: ... FRU: ... Location: MB 7. Enhance "fmadm faulty" so it distinguishes "repaired", "replaced" and "acquitted" components. So in the example below SLOT 2 and SLOT3 now say "repair attempted", "replaced" and "acquitted" respectively. # fmadm faulty --------------- ------------------------------------ -------------- ------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- ------- Sep 21 10:01:36 d482f935-5c8f-e9ab-9f25-d0aaafec1e6c PCIEX-8000-7J Major Fault class : fault.io.pciex.device_invreq Affects : dev:///pci@0,0/pci1022,7458@11/pci1000,3060@0 dev:///pci@0,0/pci1022,7458@11/pci1000,3060@1 ok and in service dev:///pci@0,0/pci1022,7458@11/pci1000,3060@2 dev:///pci@0,0/pci1022,7458@11/pci1000,3060@3 faulty and taken out of service FRU : "SLOT 2" (hc://.../pciexrc=3/pciexbus=4/pciexdev=0) repair attempted "SLOT 3" (hc://.../pciexrc=3/pciexbus=4/pciexdev=1) acquitted "SLOT 4" (hc://.../pciexrc=3/pciexbus=4/pciexdev=2) not present "SLOT 5" (hc://.../pciexrc=3/pciexbus=4/pciexdev=3) faulty Description : A problem was detected for a PCIEX device. Refer to http://sun.com/msg/PCIEX-8000-7J for more information. Response : One or more device instances may be disabled Impact : Possible loss of services provided by the device instances associated with this fault Action : Schedule a repair procedure to replace the affected device. 8. There is already an fmd_nvl_fmri_faulty() interface, though it is currently not documented. It tells the caller whether there is any active case in which nvl is a suspect that has not been repaired/replaced/removed/acquitted. It only currently works if nvl is an ASRU. This needs to be extended so that it also works for fmris representing FRUs/resources. It could then be used to allow an agent (eg fault led agent or DFRUID agent) to tell if a component is still an active suspect in any outstanding cases. A further additional feature which would be very useful is to add an extra "class" parameter, so an agent/DE can enquire whether there is an active fault of specific class "class" which applies to this resource. So I propose renaming the function as follows fmd_nvl_fmri_has_fault(fmd_hdl_t *hdl, nvlist_t *nvl, int type, char *class) The type parameter describes what type of fmri is represented by nvl, and can be one of FMD_HAS_FAULT_FRU FMD_HAS_FAULT_ASRU FMD_HAS_FAULT_RESOURCE The "class" parameter can be wildcarded in the same way as for fmd_hdl_subscribe(). An agent which receives a list.repaired may call fmd_nvl_fmri_has_fault() to check if there are any other faults still affecting that resource before proceeding with re-onlining that resource. This mechanism would also allow a DE to check if a particular fault had been previously detected on a given resource in the system and use that information as part of its diagnosis algorithm. To that end I propose adding a new eversholt function "has_fault(path, class)" which reports if the resource at address "path" has an outstanding fault of class "class". 9. The SNMP FMA MIB needs to be extended to represent the additional suspect states. A SUNFMFAULTEVENT_COL_STATUS column needs to be added to the sunFmFaultEventTable, to give the status of each suspect in the list. While we are at it, it looks like a SUNFMFAULTEVENT_COL_LOCATION column is also missing and needs to be added. 10.There is already an fmd_nvl_fmri_unusable() interface which is used to tell whether a asru is in usable or unusable state. However there are a number of other states which an asru could be in, such as degraded (usable, but running with reduced performance/functionality - this would for example represent the "service_degraded" state reported by hardened leaf drivers). Also it might be useful in some circumstances to give further information about the unusable state such as "unusable until next reset" and "unusable until replaced" (the latter is apparently an issue on SPs where even if a resource is repaired/acquitted, the low level fw will not re-online it if the FRUID serial number is unchanged). So I propose adding a further new interface fmd_nvl_fmri_service_state(fmd_hdl_t *hdl, nvlist_t *nvl) In order to implement this we will need a new scheme interface fmd_fmri_service_state(nvlist_t *nvl) and a new topo interface topo_fmri_service_state(topo_hdl_t *thp, nvlist_t *nvl, int *err); Each of these interfaces will return one of the following FMD_SERVICE_STATE_UNKNOWN FMD_SERVICE_STATE_OK FMD_SERVICE_STATE_DEGRADED FMD_SERVICE_STATE_UNUSABLE and this can be extended to include additional states such as FMD_SERVICE_STATE_DEGRADED_PENDING_RESET FMD_SERVICE_STATE_UNUSABLE_PENDING_RESET FMD_SERVICE_STATE_UNUSABLE_UNTIL_REPLACED 11.A further feature which would be useful is to provide a simple log file which logs all "fmadm" commands that have been typed along with a timestamp. This could be a file called /var/fm/fmd/admlog, and could be subject to rotation like errlog and fltlog.