ECC Information

ECC Information#

AMD SMI: ECC Information
ECC Information

Data Structures

struct  amdsmi_cper_guid_t
 
struct  amdsmi_cper_timestamp_t
 
union  amdsmi_cper_valid_bits_t
 
struct  amdsmi_cper_hdr_t
 

Functions

amdsmi_status_t amdsmi_get_gpu_ecc_count (amdsmi_processor_handle processor_handle, amdsmi_gpu_block_t block, amdsmi_error_count_t *ec)
 Retrieve the error counts for a GPU block. It is not supported on virtual machine guest. More...
 
amdsmi_status_t amdsmi_get_gpu_ecc_enabled (amdsmi_processor_handle processor_handle, uint64_t *enabled_blocks)
 Retrieve the enabled ECC bit-mask. It is not supported on virtual machine guest. More...
 
amdsmi_status_t amdsmi_get_gpu_total_ecc_count (amdsmi_processor_handle processor_handle, amdsmi_error_count_t *ec)
 Returns the total number of ECC errors (correctable, uncorrectable and deferred) in the given GPU. It is not supported on virtual machine guest. More...
 
amdsmi_status_t amdsmi_get_gpu_cper_entries (amdsmi_processor_handle processor_handle, uint32_t severity_mask, char *cper_data, uint64_t *buf_size, amdsmi_cper_hdr_t **cper_hdrs, uint64_t *entry_count, uint64_t *cursor)
 Retrieve CPER entries cached in the driver. More...
 

Detailed Description

Function Documentation

◆ amdsmi_get_gpu_ecc_count()

amdsmi_status_t amdsmi_get_gpu_ecc_count ( amdsmi_processor_handle  processor_handle,
amdsmi_gpu_block_t  block,
amdsmi_error_count_t ec 
)

Retrieve the error counts for a GPU block. It is not supported on virtual machine guest.

See RAS Error Count sysfs Interface (AMDGPU RAS Support - Linux Kernel documentation) to learn how these error counts are accessed.

Platform:

gpu_bm_linux

host

Given a processor handle processor_handle, an amdsmi_gpu_block_t block and a pointer to an amdsmi_error_count_t ec, this function will write the error count values for the GPU block indicated by block to memory pointed to by ec.

Parameters
[in]processor_handlea processor handle
[in]blockThe block for which error counts should be retrieved
[in,out]ecA pointer to an amdsmi_error_count_t to which the error counts should be written If this parameter is nullptr, this function will return AMDSMI_STATUS_INVAL if the function is supported with the provided, arguments and AMDSMI_STATUS_NOT_SUPPORTED if it is not supported with the provided arguments.
Returns
amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail

◆ amdsmi_get_gpu_ecc_enabled()

amdsmi_status_t amdsmi_get_gpu_ecc_enabled ( amdsmi_processor_handle  processor_handle,
uint64_t *  enabled_blocks 
)

Retrieve the enabled ECC bit-mask. It is not supported on virtual machine guest.

See RAS Error Count sysfs Interface (AMDGPU RAS Support - Linux Kernel documentation) to learn how these error counts are accessed.

Platform:

gpu_bm_linux

host

Given a processor handle processor_handle, and a pointer to a uint64_t enabled_mask, this function will write bits to memory pointed to by enabled_blocks. Upon a successful call, enabled_blocks can then be AND'd with elements of the amdsmi_gpu_block_t ennumeration to determine if the corresponding block has ECC enabled. Note that whether a block has ECC enabled or not in the device is independent of whether there is kernel support for error counting for that block. Although a block may be enabled, but there may not be kernel support for reading error counters for that block.

Parameters
[in]processor_handlea processor handle
[in,out]enabled_blocksA pointer to a uint64_t to which the enabled blocks bits will be written. If this parameter is nullptr, this function will return AMDSMI_STATUS_INVAL if the function is supported with the provided, arguments and AMDSMI_STATUS_NOT_SUPPORTED if it is not supported with the provided arguments.
Returns
amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail

◆ amdsmi_get_gpu_total_ecc_count()

amdsmi_status_t amdsmi_get_gpu_total_ecc_count ( amdsmi_processor_handle  processor_handle,
amdsmi_error_count_t ec 
)

Returns the total number of ECC errors (correctable, uncorrectable and deferred) in the given GPU. It is not supported on virtual machine guest.

See RAS Error Count sysfs Interface (AMDGPU RAS Support - Linux Kernel documentation) to learn how these error counts are accessed.

Platform:

gpu_bm_linux

host

guest_windows

Parameters
[in]processor_handleDevice which to query
[out]ecReference to ecc error count structure. Must be allocated by user.
Returns
amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail

◆ amdsmi_get_gpu_cper_entries()

amdsmi_status_t amdsmi_get_gpu_cper_entries ( amdsmi_processor_handle  processor_handle,
uint32_t  severity_mask,
char *  cper_data,
uint64_t *  buf_size,
amdsmi_cper_hdr_t **  cper_hdrs,
uint64_t *  entry_count,
uint64_t *  cursor 
)

Retrieve CPER entries cached in the driver.

The user will pass buffers to hold the CPER data and CPER headers. The library will fill the buffer based on the severity_mask user passed. It will also parse the CPER header and stored in the cper_hdrs array. The user can use the cper_hdrs to get the timestamp and other header information. A cursor is also returned to the user, which can be used to get the next set of CPER entries.

If there are more data than any of the buffers user pass, the library will return AMDSMI_STATUS_MORE_DATA. User can call the API again with the cursor returned at previous call to get more data. If the buffer size is too small to even hold one entry, the library will return AMDSMI_STATUS_OUT_OF_RESOURCES.

Even if the API returns AMDSMI_STATUS_MORE_DATA, the 2nd call may still get the entry_count == 0 as the driver cache may not contain the serverity user is interested in. The API should return AMDSMI_STATUS_SUCCESS in this case so that user can ignore that call.

Platform:

gpu_bm_linux

host

guest_1vf

Parameters
[in]processor_handleHandle to the processor for which CPER entries are to be retrieved.
[in]severity_maskThe severity mask of the entries to be retrieved.
[in,out]cper_dataPointer to a buffer where the CPER data will be stored. User must allocate the buffer and set the buf_size correctly.
[in,out]buf_sizePointer to a variable that specifies the size of the cper_data. On return, it will contain the actual size of the data written to the cper_data.
[in,out]cper_hdrsArray of the parsed headers of the cper_data. The user must allocate the array of pointers to cper_hdr. The library will fill the array with the pointers to the parsed headers. The underlying data is in the cper_data buffer and only pointer is stored in this array.
[in,out]entry_countPointer to a variable that specifies the array length of the cper_hdrs user allocated. On return, it will contain the actual entries written to the cper_hdrs.
[in,out]cursorPointer to a variable that will contain the cursor for the next call.
Returns
amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail