Introduction ------------ The Cache-line-retire Project for Panther requires the following features in cpumem-diagnosis engine and cpumem-retire agent - Retire/unretire a cache line - Determine if a cacheline is in retired state - Read TAG info to compute ECC and syndrome to determine faulty bit in case of TAG errors. - Additions to the CPU FMRI scheme - New fault events Mem Cache Driver ---------------- To achieve the above a new driver named mem_cache was developed that provides a set of ioctls which the cpumem-diagnosis engine and cpumem-retire agent will use. The device driver uses a set of assembly routines that have been added to the existing us3_cheetahplus_asm.s file to satisfy the ioctl calls. The mem_cache device driver is a psuedo device driver and resides at the following location in the solaris source tree usr/src/uts/sun4u/io/mem_cache.c The device driver nodes are as follows: /devices/pseudo/mem_cache@0:mem_cache /dev/mem_cache Interface details ------------------ NAME STABILITY LABEL DESCRIPTION Interfaces Exported /dev/mem_cache Project private Driver name /devices/pseudo/mem_cache@0:mem_cache Project private Driver name usr/src/uts/sun4u/sys/mem_cache.h Project private Header file ioctls provided below are all project private #define MEM_CACHE_RETIRE (('C' << 8) | 0x01) #define MEM_CACHE_ISRETIRED (('C' << 8) | 0x02) #define MEM_CACHE_UNRETIRE (('C' << 8) | 0x03) #define MEM_CACHE_STATE (('C' << 8) | 0x04) #define MEM_CACHE_READ_TAGS (('C' << 8) | 0x05) #define MEM_CACHE_INJECT_ERR (('C' << 8) | 0x06) #define MEM_CACHE_READ_ERROR_INJECTED_TAGS (('C' << 8) | 0x07) The data structure associated with the ioctl arg is as follows: typedef enum { L2_CACHE_DATA, L2_CACHE_TAG, L3_CACHE_DATA, L3_CACHE_TAG } cache_id_t; typedef struct cache_info { int cpu_id; cache_id_t cache; uint32_t index; uint32_t way; uint16_t bit; void *datap; } cache_info_t; The valid values for cpu_id identify the CPU that have not faulted and are online on the system. The valid values for cache are the following: L2_CACHE_DATA L2_CACHE_TAG L3_CACHE_DATA L3_CACHE_TAG Note: All of the above four cache values are not supported by some of ioctl call. Refer to the Description section for the ioctl to determine which of the above cache values are supported. The valid values for index are: 0-8191 if the cache field is L2_CACHE_DATA or L2_CACHE_TAG 0-131071 if the cache field is L3_CACHE_DATA ot L3_CACHE_TAG The valid values for way are: 0-3 The valid values forbit are: 0-63 for L2_CACHE_TAG and L3_CACHE_TAG 0-511 for L2_CACHE_DATA and L3_CACHE_DATA Description of IOCTL calls -------------------------- MEM_CACHE_RETIRE Input Parameters used in cache_info_t cpu_id cache index way bit Output Parameters: None Description The MEM_CACHE_RETIRE ioctl marks the cacheline identified by the index, way on CPU as Not Avaialable. The cache field identifies the L2/L3 cache. The cache field cane be either L2_CACHE_DATA or L2_CACHE_TAG to select L2 cache. Similary cache field can be either L3_CACHE_DATA or L3_CACHE_TAG to select L3 cache. If any of the input parameters are outside the range of valid values the ioctl call will fail and errno will be set to EINVAL. The bit field is used to set the faulty bit in the TAG to its stable state. This will ensure that we will not get any more errors due to this bit. All the 4 ways TAG info. is used to compute the TAG ECC, hence even though we mark a particular way as NA it will still be read during ECC computation. This is the reason we need to make sure that the faulty bit we identified in DE is set to its stable value of 1 or 0. All other bits in the TAG are set to zeros(0) The 64 bit "bit field" is interpreted as follows: bit position of "bit field" description 15 If set The bit identified by 0-14 is set as 1 Else the bit identified by 0-14 is set as 0. 0-14 identify one of 0-63 bits to be set/reset. Note: Obviously we need just 6 bits (0-5) to identify any one of the 64 bits in the TAG. Return values -------------- Upon successful completion 0 is returned. Otherwise -1 is returned and the errno is set to one of the following: EIO EBADF EINVAL ---------------------------------------------------------------------------- MEM_CACHE_UNRETIRE Input Parameters used in cache_info_t cpu_id cache index way Output Parameters: None Description The MEM_CACHE_UNRETIRE ioctl marks the cacheline identified by the index, way on CPU as INVALID. The cache field identifies the L2/L3 cache. The cache field cane be either L2_CACHE_DATA or L2_CACHE_TAG to select L2 cache. Similary cache field can be either L3_CACHE_DATA or L3_CACHE_TAG to select L3 cache. If any of the input parameters are outside the range of valid values the ioctl call will fail and errno will be set to EINVAL. Before marking the cache line as INVALID a check is made to determine if the cache line is in Not Available state. If not ioctl fails and errno will be set to EINVAL. Return values -------------- Upon successful completion 0 is returned. Otherwise -1 is returned and the errno is set to one of the following: EIO EBADF EINVAL ---------------------------------------------------------------------------- MEM_CACHE_READ_TAGS Input Parameters used in cache_info_t cpu_id cache index way datap Output Parameters: Tag data returned in datap. Description ----------- The MEM_CACHE_READ_TAGS returns the TAG informations of all the four ways of the specified cache line identified by the index on CPU in the array pointed to by datap. Even though the way field is not relevant to this ioctl it must be a valid value in the range 0-3. The cache field identifies L2/L3 cache. The imput parameter datap must be a pointer to an array of four(4) uint64_t data. The valid values of cache for this ioctl are: L2_CACHE_TAG L3_CACHE_TAG Return values -------------- Upon successful completion 0 is returned and the TAG information of all 4 ways of the cache line set is returned in the array pointed to by datap. Otherwise -1 is returned and the errno is set to one of the following: EIO EBADF EINVAL --------------------------------------------------------------------------- MEM_CACHE_INJECT_ERR Input Parameters used in cache_info_t cpu_id cache index way bit Output Parameters: None Description ----------- The ioctl MEM_CACHE_INJECT_ERR is used to test the Diagnostic Engines. It injects a Correctable Error(CE) in a TAG identified by index/way on CPU . The bit field is used to flip the bit identified by it. The valid values for cache filed are : L2_CACHE_TAG L3_CACHE_TAG This ioctl injects the error as follows: It first reads the TAG at the specified index/way and XORs the TAG info read with the bit identified by "bit field" and writes back the TAG. This ensures that only one bit is flipped and a CE will be detected. The ioctl does not take action to cause the TAG to be read. The CE is detected when the cache scrubber runs. The "bit field" is stored in a variable called "last_error_injected_bit" and the way is stored in a variable called "last_error_injected_way" by the device driver. These variables are used by the ioctl MEM_CACHE_READ_ERROR_INJECTED_TAGS to return a corrupted TAG. Return values -------------- Upon successful completion 0 is returned. Otherwise -1 is returned and the errno is set to one of the following: EIO EBADF EINVAL ------------------------------------------------------------------------ MEM_CACHE_READ_ERROR_INJECTED_TAGS Input Parameters used in cache_info_t cpu_id cache index way datap Output Parameters: Tag data returned in datap. Description ------------ The ioctl MEM_CACHE_READ_ERROR_INJECTED_TAGS is similar to MEM_CACHE_READ_TAGS except that it uses the variable "last_error_injected_bit" to flip one bit in the TAG identified by "last_error_injected_way" This ioctl simulates a stuck bit behavior. The reason we need this ioctl is that when a CE error in TAG is detected the HW does not capture the syndrome. When we read TAG during the error handler we will get a HW corrected TAG information and DE will not find a faulty bit. With this ioctl we return a faulty TAG info and DE will be able to detect faulty bit. To test DE we modifed the code in DE to use this ioctl call to obtain TAG info when it handles THCE ereports. Return values -------------- Upon successful completion 0 is returned and the TAG information of all 4 ways of the cache line set is returned in the array pointed to by datap. Otherwise -1 is returned and the errno is set to one of the following: EIO EBADF EINVAL --------------------------------------------------------------------------- MEM_CACHE_ISRETIRED MEM_CACHE_STATE The above 2 ioctls are not supported currently. We use MEM_CACHE_READ_TAGS to determine if the cache line is retired or not. The state info is available through the MEM_CACHE_REAF_TAGS ioctl. At some future date we may support the above mentioned ioctls. ----------------------------------------------------------------------------- CPU FMRI Additions ------------------ This project adds the following as optional members to the Sun Private CPU scheme FMRI payload: uint32_t index; /* The Cache Line Index */ uint32_t way; /* The Cache Line Way */ uint16_t bit; /* The Failing bit */ uint8_t type; /* The Type of Cache (L2, L3) */ Current legal values are: index: 0 - 8191 way: 0, 1, 2 and 3 bit: 0 - 255 type 0 = L2, 1 = L3 Fault Event Specification ------------------------------------- The fault events defined for this project can be found at http://wikihome.sfbay.sun.com/fma-portfolio/attach/2008.001.PantherCachelineRetire%2Freport.html with a copy archived in the case directory.