Requirement
-----------

To record more detailed information regarding individual members of a suspect
list, as to whether they have been physically repaired/replaced/removed or have
been acquitted. The acquittal could be automatic (as part of an improved 
conviction policy) or manual (eg by following instructions in a knowledge
article).

Proposal
--------

1. Add the following new options to fmadm

	fmadm repaired fmri | label
	fmadm replaced fmri | label
	fmadm acquit fmri | label
        fmadm acquit uuid [ fmri | label ]

   These replace the existing "fmadm repair", and essentially behave in exactly
   the same way as "fmadm repair", except that fmd remembers whether the fmri
   or uuid has been repaired, replaced or acquitted. For compatibility reasons,
   we will retain "fmadm repair" as a synonym for "fmadm repaired".

   "fmadm repaired" should be used when some physical repair has been carried
   out which it is hoped will resolve the problem - such as reseating a card
   or straightening a bent pin. "fmadm replaced" should be used to indicate that
   the suspect fru has been replaced.

   If it is automatically discovered that a FRU has been replaced (serial
   number has changed), then this is treated in the same way as if
   "fmadm replaced" has been typed. Also "fmadm replaced" will not be allowed
   if fmd can automatically confirm that the FRU has not been replaced (serial
   number has not changed).
  
   If it is automatically discovered that a FRU has been removed but not
   replaced, then the current behaviour is unchanged - the suspect will be
   displayed as "not present", but will not be considered as permanently
   removed until the rsrc.aged time has expired.

   Note that replacement takes precedence over repair and both take precedence
   over acquittal. So you can acquit something and then subsequently repair it,
   but you can't acquit something that has already been repaired.

   A case is considered repaired (moves into the FMD_CASE_REPAIRED state and a
   list.repaired event generated) when either its uuid is acquitted or all
   suspects have been either repaired, replaced, removed or acquitted.

   It is expected that the most common usages would be "fmadm acquit label"
   "fmadm replaced label" or "fmadm repaired label". Thus if we have a case
   with a two-entry suspect list, one for a FRU with label "PCIE1" and the
   other for "MB", and the KA suggests replacing the FRU in "PCIE1" but
   acquitting "MB", then the user should replace the FRU in "PCIE1" then type

	fmadm replaced PCIE1
	fmadm acquit MB

   Typically the user would only want to "acquit by fmri/label" if it was
   determined that the resource was not guilty in all current cases in which it
   is a suspect. However to allow a FRU to be manually acquitted in one case
   while remaining a suspect in all others an option has been provided which
   allows the user to specifiy both uuid and fmri/label, ie

        fmadm acquit uuid [ fmri | label ]

2. Extend the fmd case state model to include a final state FMD_CASE_RESOLVED.
   This state can only be transitioned to from the FMD_CASE_REPAIRED state,
   and indicates that all agents have successfully carried out their
   actions resulting from the list.repaired event. The suspects for the case
   will remain in the resource cache and visible through "fmadm faulty -a" as
   long as the case remains in the FMD_CASE_REPAIRED state.

   The message put out by "fmadm faulty -a" for an ASRU that is still unusable
   but is associated with a resource(s) that are no longer faulty is currently
   "unknown, not present or disabled". This should be changed to a more
   descriptive "out of service, but associated components no longer faulty".
   For example

# fmadm faulty -a
 --------------- ------------------------------------  -------------- -------
 TIME            EVENT-ID                              MSG-ID         SEVERITY
 --------------- ------------------------------------  -------------- -------
 Sep 21 10:01:36 d482f935-5c8f-e9ab-9f25-d0aaafec1e6c  PCIEX-9999-7J  Major

 Fault class  : fault.io.pciex.device_xxxx
 Affects      : dev:///pci@0,0/pci1022,7458@11/pci1000,3060@0
                  out of service, but associated components no longer faulty
 FRU          : "SLOT 2" (hc://.../pciexrc=3/pciexbus=4/pciexdev=0)
                  replaced

 Description  : A problem was detected for a PCIEX device.
                Refer to http://sun.com/msg/PCIEX-9999-7J for more information.

 Response     : One or more device instances may be disabled

 Impact       : Possible loss of services provided by the device instances
                associated with this fault

 Action       : Schedule a repair procedure to replace the affected device.
                A reboot of the system will be required to bring the resource
                back into service after the repair.


   The case will finally be deleted from the resource cache (and cease to be
   displayed by "fmadm faulty -a") only when the case state moves to
   FMD_CASE_RESOLVED.

   When the case does finally move to the FMD_CASE_RESOLVED state, an additional
   list.resolved event will be broadcast and logged to indicate that the
   case is now finally resolved.

   In order to implement this we need an additional interface between the agent
   and fmd. This is the equivalent of the fmd_case_uuclose() call made by the
   agent when retire actions have been successfully completed. So a similar
   interface to be called when all unretire actions have completed would be 

   fmd_case_uuresolved(fmd_hdl_t *handle, const char *uuid)

   If the case is in the FMD_CASE_REPAIRED state but has not progressed to
   the FMD_CASE_RESOLVED state, and fmd is restarted (eg after a reboot) then
   the list.repaired will be replayed.

   fmd_case_uuresolved() will fail unless all the suspects in the suspect list
   are repaired, replaced,  acquitted, or removed. 

3. Add an additional list.updated event. This will be sent whenever an
   individual suspect is repaired, replaced, removed or acquitted. It will be
   logged in the fltlog (in the same way as list.repaired is now logged), and
   can be subscribed to by retire agents etc so they can unretire an individual
   suspect as soon as it has been repaired/replaced/acquitted/removed. It will
   not be logged in syslog nor sent as an snmp trap.

   So in the example above a list.updated will be generated when
   "fmadm replaced PCIE1" is typed, and the list.repaired is then generated
   when "fmadm acquit MB" is subsequently typed.

   If one or more suspects have been repaired, replaced, removed or acquitted
   but the case has not progressed to the FMD_CASE_REPAIRED state, and fmd
   is restarted (eg after a reboot) then the list.updated will be replayed.

4. Add a new fmd_api interface
   
   fmd_nvl_fmri_replaced(fmd_hdl_t *hdl, nvlist_t *nvl)

   This will tell the caller whether the fru at the specified fmri has been
   replaced by another with a different serial/part number. This is different
   from the existing fmd_nvl_fmri_present in that the latter doesn't distinguish
   between a fru being replaced or just not being present. If the fru serial
   number has changed, we can automatically consider it as having been replaced
   straight away. If it is just not present, we should wait for rsrc.age to
   expire before considering it to have been permanently removed.

   In order to implement this we will need a new scheme interface

   fmd_fmri_replaced(nvlist_t *nvl)

   and a new topo interface

   topo_fmri_replaced(topo_hdl_t *thp, nvlist_t *nvl, int *err);

   These interfaces should return one of four values

	FMD_REPLACED_NOT_PRESENT  nothing present at the location
	FMD_REPLACED_FALSE        same resource is still at the location
	FMD_REPLACED_TRUE         the resource has been replaced
    	FMD_REPLACED_UNKNOWN      there is something at the location but
				  no serial/part number support is available
				  to determine if it is the same part as before

5. The status array in the list event is extended. Currently it contains
   the following fields

	#define FM_SUSPECT_FAULTY               0x1
	#define FM_SUSPECT_UNUSABLE             0x2
	#define FM_SUSPECT_NOT_PRESENT          0x4

   This is extended by adding

	#define FM_SUSPECT_ACQUITTED          	0x8
	#define FM_SUSPECT_REPAIRED          	0x10
	#define FM_SUSPECT_REPLACED          	0x20

   Here FM_SUSPECT_ACQUITTED means acquitted by "fmadm acquit",
   FM_SUSPECT_REPAIRED means repair notified by "fmadm repaired", and 
   FM_SUSPECT_REPLACED means repair notified by "fmadm replaced" or
   replacement auto-detected by serial number change.

   These additional bits will be present in the list.repaired event and the
   list.updated event. The FM_SUSPECT_ACQUITTED bit may also be present in the
   list.suspect event if the acquittal was carried out by automated conviction
   policy.  These bits are also passed across the libfmd_adm interface and so
   will be visible via "fmadm faulty".

6. Enhance fmdump so that it displays the status bits in a more human
   readable fashion.

# fmdump 
TIME                 UUID                                 SUNW-MSG-ID
Mar 24 17:52:20.9963 7b51569c-92aa-c7da-d0c7-84baea334f59 PCI-8000-8S
Mar 24 18:12:00.0000 7b51569c-92aa-c7da-d0c7-84baea334f59 Updated
Mar 24 18:13:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Repaired
Mar 24 18:14:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Resolved

# fmdump -v
TIME                 UUID                                 SUNW-MSG-ID
Mar 24 17:52:20.9963 7b51569c-92aa-c7da-d0c7-84baea334f59 PCI-8000-8S
   67%  fault.io.pci.device-interr
        Problem in: ...
           Affects: ...
               FRU: ...
          Location: PCIE1
   33%  fault.io.pci.bus-linkerr
        Problem in: ...
           Affects: ...
               FRU: ...
          Location: MB
Mar 24 18:12:00.0000 7b51569c-92aa-c7da-d0c7-84baea334f59 Updated
   67%  fault.io.pci.device-interr	Repair Attempted
        Problem in: ...
           Affects: ...
               FRU: ...
          Location: PCIE1
   33%  fault.io.pci.bus-linkerr
        Problem in: ...
           Affects: ...
               FRU: ...
          Location: MB
Mar 24 18:13:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Repaired
   67%  fault.io.pci.device-interr	Repair Attempted
        Problem in: ...
           Affects: ...
               FRU: ...
          Location: PCIE1
   33%  fault.io.pci.bus-linkerr	Acquitted
        Problem in: ...
           Affects: ...
               FRU: ...
          Location: MB
Mar 24 18:14:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Resolved
   67%  fault.io.pci.device-interr	Repair Attempted
        Problem in: ...
           Affects: ...
               FRU: ...
          Location: PCIE1
   33%  fault.io.pci.bus-linkerr	Acquitted
        Problem in: ...
           Affects: ...
               FRU: ...
          Location: MB

   The example above is where a repair has been attempted on one suspect
   (notified by "fmadm repaired"), and then the other suspect is acquitted via
   "fmadm acquit".  A further example below is where conviction policy acquits
   one suspect and subsequently the other suspect is replaced.

# fmdump 
TIME                 UUID                                 SUNW-MSG-ID
Mar 24 17:52:20.9963 7b51569c-92aa-c7da-d0c7-84baea334f59 PCI-8000-8S
Mar 24 18:13:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Repaired
Mar 24 18:14:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Resolved

# fmdump -v
TIME                 UUID                                 SUNW-MSG-ID
Mar 24 17:52:20.9963 7b51569c-92aa-c7da-d0c7-84baea334f59 PCI-8000-8S
   67%  fault.io.pci.device-interr
        Problem in: ...
           Affects: ...
               FRU: ...
          Location: PCIE1
   33%  fault.io.pci.bus-linkerr        Acquitted
        Problem in: ...
           Affects: ...
               FRU: ...
          Location: MB
Mar 24 18:13:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Repaired
   67%  fault.io.pci.device-interr	Replaced
        Problem in: ...
           Affects: ...
               FRU: ...
          Location: PCIE1
   33%  fault.io.pci.bus-linkerr	Acquitted
        Problem in: ...
           Affects: ...
               FRU: ...
          Location: MB
Mar 24 18:14:00.0900 7b51569c-92aa-c7da-d0c7-84baea334f59 Resolved
   67%  fault.io.pci.device-interr	Replaced
        Problem in: ...
           Affects: ...
               FRU: ...
          Location: PCIE1
   33%  fault.io.pci.bus-linkerr	Acquitted
        Problem in: ...
           Affects: ...
               FRU: ...
          Location: MB

	
7. Enhance "fmadm faulty" so it distinguishes "repaired", "replaced" and
   "acquitted" components. So in the example below SLOT 2 and SLOT3 now say
   "repair attempted", "replaced" and "acquitted" respectively.

# fmadm faulty
 --------------- ------------------------------------  -------------- -------
 TIME            EVENT-ID                              MSG-ID         SEVERITY
 --------------- ------------------------------------  -------------- -------
 Sep 21 10:01:36 d482f935-5c8f-e9ab-9f25-d0aaafec1e6c  PCIEX-8000-7J  Major

 Fault class  : fault.io.pciex.device_invreq
 Affects      : dev:///pci@0,0/pci1022,7458@11/pci1000,3060@0
                dev:///pci@0,0/pci1022,7458@11/pci1000,3060@1
                  ok and in service
                dev:///pci@0,0/pci1022,7458@11/pci1000,3060@2
                dev:///pci@0,0/pci1022,7458@11/pci1000,3060@3
                  faulty and taken out of service
 FRU          : "SLOT 2" (hc://.../pciexrc=3/pciexbus=4/pciexdev=0)
                  repair attempted
                "SLOT 3" (hc://.../pciexrc=3/pciexbus=4/pciexdev=1)
                  acquitted
                "SLOT 4" (hc://.../pciexrc=3/pciexbus=4/pciexdev=2)
                  not present
                "SLOT 5" (hc://.../pciexrc=3/pciexbus=4/pciexdev=3)
                  faulty

 Description  : A problem was detected for a PCIEX device.
                Refer to http://sun.com/msg/PCIEX-8000-7J for more information.

 Response     : One or more device instances may be disabled

 Impact       : Possible loss of services provided by the device instances
                associated with this fault

 Action       : Schedule a repair procedure to replace the affected device.


8. There is already an fmd_nvl_fmri_faulty() interface, though it is currently
   not documented. It tells the caller whether there is any active case in
   which nvl is a suspect that has not been repaired/replaced/removed/acquitted.
   It only currently works if nvl is an ASRU.
   
   This needs to be extended so that it also works for fmris representing
   FRUs/resources. It could then be used to allow an agent (eg fault led agent
   or DFRUID agent) to tell if a component is still an active suspect in any
   outstanding cases.

   A further additional feature which would be very useful is to add an extra
   "class" parameter, so an agent/DE can enquire whether there is an active
   fault of specific class "class" which applies to this resource. So I propose
   renaming the function as follows
	
   fmd_nvl_fmri_has_fault(fmd_hdl_t *hdl, nvlist_t *nvl, int type, char *class)

   The type parameter describes what type of fmri is represented by nvl,
   and can be one of
	FMD_HAS_FAULT_FRU
	FMD_HAS_FAULT_ASRU
	FMD_HAS_FAULT_RESOURCE

   The "class" parameter can be wildcarded in the same way as for
   fmd_hdl_subscribe().

   An agent which receives a list.repaired may call fmd_nvl_fmri_has_fault() to
   check if there are any other faults still affecting that resource before
   proceeding with re-onlining that resource.

   This mechanism would also allow a DE to check if a particular fault had been
   previously detected on a given resource in the system and use that
   information as part of its diagnosis algorithm. To that end I propose adding
   a new eversholt function "has_fault(path, class)" which reports if the 
   resource at address "path" has an outstanding fault of class "class".

9. The SNMP FMA MIB needs to be extended to represent the additional suspect
   states.

   A SUNFMFAULTEVENT_COL_STATUS column needs to be added to the
   sunFmFaultEventTable, to give the status of each suspect in the list.

   While we are at it, it looks like a SUNFMFAULTEVENT_COL_LOCATION column
   is also missing and needs to be added.

10.There is already an fmd_nvl_fmri_unusable() interface which is used to tell
   whether a asru is in usable or unusable state.  However there are a number of
   other states which an asru could be in, such as degraded (usable, but
   running with reduced performance/functionality - this would for example
   represent the "service_degraded" state reported by hardened leaf drivers).
 
   Also it might be useful in some circumstances to give further information
   about the unusable state such as "unusable until next reset" and "unusable
   until replaced" (the latter is apparently an issue on SPs where even if a
   resource is repaired/acquitted, the low level fw will not re-online it if
   the FRUID serial number is unchanged).
   
   So I propose adding a further new interface 

   fmd_nvl_fmri_service_state(fmd_hdl_t *hdl, nvlist_t *nvl)

   In order to implement this we will need a new scheme interface

   fmd_fmri_service_state(nvlist_t *nvl)

   and a new topo interface

   topo_fmri_service_state(topo_hdl_t *thp, nvlist_t *nvl, int *err);

   Each of these interfaces will return one of the following
  
   FMD_SERVICE_STATE_UNKNOWN
   FMD_SERVICE_STATE_OK
   FMD_SERVICE_STATE_DEGRADED
   FMD_SERVICE_STATE_UNUSABLE

   and this can be extended to include additional states such as 

   FMD_SERVICE_STATE_DEGRADED_PENDING_RESET
   FMD_SERVICE_STATE_UNUSABLE_PENDING_RESET
   FMD_SERVICE_STATE_UNUSABLE_UNTIL_REPLACED

11.A further feature which would be useful is to provide a simple log file
   which logs all "fmadm" commands that have been typed along with a timestamp.
   This could be a file called /var/fm/fmd/admlog, and could be subject to
   rotation like errlog and fltlog.