ECC Information#
Data Structures | |
struct | amdsmi_cper_guid_t |
struct | amdsmi_cper_timestamp_t |
union | amdsmi_cper_valid_bits_t |
struct | amdsmi_cper_hdr_t |
Functions | |
amdsmi_status_t | amdsmi_get_gpu_ecc_count (amdsmi_processor_handle processor_handle, amdsmi_gpu_block_t block, amdsmi_error_count_t *ec) |
Retrieve the error counts for a GPU block. It is not supported on virtual machine guest. More... | |
amdsmi_status_t | amdsmi_get_gpu_ecc_enabled (amdsmi_processor_handle processor_handle, uint64_t *enabled_blocks) |
Retrieve the enabled ECC bit-mask. It is not supported on virtual machine guest. More... | |
amdsmi_status_t | amdsmi_get_gpu_total_ecc_count (amdsmi_processor_handle processor_handle, amdsmi_error_count_t *ec) |
Returns the total number of ECC errors (correctable, uncorrectable and deferred) in the given GPU. It is not supported on virtual machine guest. More... | |
amdsmi_status_t | amdsmi_get_gpu_cper_entries (amdsmi_processor_handle processor_handle, uint32_t severity_mask, char *cper_data, uint64_t *buf_size, amdsmi_cper_hdr_t **cper_hdrs, uint64_t *entry_count, uint64_t *cursor) |
Retrieve CPER entries cached in the driver. More... | |
Detailed Description
Function Documentation
◆ amdsmi_get_gpu_ecc_count()
amdsmi_status_t amdsmi_get_gpu_ecc_count | ( | amdsmi_processor_handle | processor_handle, |
amdsmi_gpu_block_t | block, | ||
amdsmi_error_count_t * | ec | ||
) |
Retrieve the error counts for a GPU block. It is not supported on virtual machine guest.
See RAS Error Count sysfs Interface (AMDGPU RAS Support - Linux Kernel documentation) to learn how these error counts are accessed.
- Platform:
gpu_bm_linux
host
Given a processor handle processor_handle
, an amdsmi_gpu_block_t block
and a pointer to an amdsmi_error_count_t ec
, this function will write the error count values for the GPU block indicated by block
to memory pointed to by ec
.
- Parameters
-
[in] processor_handle a processor handle [in] block The block for which error counts should be retrieved [in,out] ec A pointer to an amdsmi_error_count_t to which the error counts should be written If this parameter is nullptr, this function will return AMDSMI_STATUS_INVAL if the function is supported with the provided, arguments and AMDSMI_STATUS_NOT_SUPPORTED if it is not supported with the provided arguments.
- Returns
- amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail
◆ amdsmi_get_gpu_ecc_enabled()
amdsmi_status_t amdsmi_get_gpu_ecc_enabled | ( | amdsmi_processor_handle | processor_handle, |
uint64_t * | enabled_blocks | ||
) |
Retrieve the enabled ECC bit-mask. It is not supported on virtual machine guest.
See RAS Error Count sysfs Interface (AMDGPU RAS Support - Linux Kernel documentation) to learn how these error counts are accessed.
- Platform:
gpu_bm_linux
host
Given a processor handle processor_handle
, and a pointer to a uint64_t enabled_mask
, this function will write bits to memory pointed to by enabled_blocks
. Upon a successful call, enabled_blocks
can then be AND'd with elements of the amdsmi_gpu_block_t ennumeration to determine if the corresponding block has ECC enabled. Note that whether a block has ECC enabled or not in the device is independent of whether there is kernel support for error counting for that block. Although a block may be enabled, but there may not be kernel support for reading error counters for that block.
- Parameters
-
[in] processor_handle a processor handle [in,out] enabled_blocks A pointer to a uint64_t to which the enabled blocks bits will be written. If this parameter is nullptr, this function will return AMDSMI_STATUS_INVAL if the function is supported with the provided, arguments and AMDSMI_STATUS_NOT_SUPPORTED if it is not supported with the provided arguments.
- Returns
- amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail
◆ amdsmi_get_gpu_total_ecc_count()
amdsmi_status_t amdsmi_get_gpu_total_ecc_count | ( | amdsmi_processor_handle | processor_handle, |
amdsmi_error_count_t * | ec | ||
) |
Returns the total number of ECC errors (correctable, uncorrectable and deferred) in the given GPU. It is not supported on virtual machine guest.
See RAS Error Count sysfs Interface (AMDGPU RAS Support - Linux Kernel documentation) to learn how these error counts are accessed.
- Platform:
gpu_bm_linux
host
guest_windows
- Parameters
-
[in] processor_handle Device which to query [out] ec Reference to ecc error count structure. Must be allocated by user.
- Returns
- amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail
◆ amdsmi_get_gpu_cper_entries()
amdsmi_status_t amdsmi_get_gpu_cper_entries | ( | amdsmi_processor_handle | processor_handle, |
uint32_t | severity_mask, | ||
char * | cper_data, | ||
uint64_t * | buf_size, | ||
amdsmi_cper_hdr_t ** | cper_hdrs, | ||
uint64_t * | entry_count, | ||
uint64_t * | cursor | ||
) |
Retrieve CPER entries cached in the driver.
The user will pass buffers to hold the CPER data and CPER headers. The library will fill the buffer based on the severity_mask user passed. It will also parse the CPER header and stored in the cper_hdrs array. The user can use the cper_hdrs to get the timestamp and other header information. A cursor is also returned to the user, which can be used to get the next set of CPER entries.
If there are more data than any of the buffers user pass, the library will return AMDSMI_STATUS_MORE_DATA. User can call the API again with the cursor returned at previous call to get more data. If the buffer size is too small to even hold one entry, the library will return AMDSMI_STATUS_OUT_OF_RESOURCES.
Even if the API returns AMDSMI_STATUS_MORE_DATA, the 2nd call may still get the entry_count == 0 as the driver cache may not contain the serverity user is interested in. The API should return AMDSMI_STATUS_SUCCESS in this case so that user can ignore that call.
- Platform:
gpu_bm_linux
host
guest_1vf
- Parameters
-
[in] processor_handle Handle to the processor for which CPER entries are to be retrieved. [in] severity_mask The severity mask of the entries to be retrieved. [in,out] cper_data Pointer to a buffer where the CPER data will be stored. User must allocate the buffer and set the buf_size correctly. [in,out] buf_size Pointer to a variable that specifies the size of the cper_data. On return, it will contain the actual size of the data written to the cper_data. [in,out] cper_hdrs Array of the parsed headers of the cper_data. The user must allocate the array of pointers to cper_hdr. The library will fill the array with the pointers to the parsed headers. The underlying data is in the cper_data buffer and only pointer is stored in this array. [in,out] entry_count Pointer to a variable that specifies the array length of the cper_hdrs user allocated. On return, it will contain the actual entries written to the cper_hdrs. [in,out] cursor Pointer to a variable that will contain the cursor for the next call.
- Returns
- amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail