Previous: , Up: Architectures   [Contents][Index]


22.4.10 AMD GPU

ROCGDB provides support for systems that have heterogeneous agents associated with commercially available AMD GPU devices (see Debugging Heterogeneous Programs) when the AMD ROCm platform is installed.

22.4.10.1 AMD GPU Architectures

The following AMD GPU achitetures are supported:

AMD Vega 10

Displayed as ‘vega10’ by ROCGDB and denoted as ‘gfx900’ by the compiler.

AMD Vega 7nm

Displayed as ‘vega20’ by ROCGDB and denoted as ‘gfx906’ by the compiler.

AMD Instinct® MI100 accelerator

Displayed as ‘arcturus’ by ROCGDB and denoted as ‘gfx908’ by the compiler.

Aldebaran

Displayed as ‘aldebaran’ by ROCGDB and denoted as ‘gfx90a’ by the compiler.

Navi10

Displayed as ‘navi10’ by ROCGDB and denoted as ‘gfx1010’ by the compiler.

Navi12

Displayed as ‘navi12’ by ROCGDB and denoted as ‘gfx1011’ by the compiler.

Navi14

Displayed as ‘navi14’ by ROCGDB and denoted as ‘gfx1012’ by the compiler.

Sienna Cichlid

Displayed as ‘sienna_cichlid’ by ROCGDB and denoted as ‘gfx1030’ by the compiler.

Navy Flounder

Displayed as ‘navy_flounder’ by ROCGDB and denoted as ‘gfx1031’ by the compiler.

22.4.10.2 AMD ROCm Source Languages

ROCGDB supports the following source languages:

HIP

The HIP Programming Language is supported.

When compiling, the -g option should be used to produce debugging information suitable for use by ROCGDB. The --offload-arch option is used to specify the AMD GPU chips that the executable is required to support. For example, to compile a HIP program that can utilize “Vega 10” and “Vega 7nm” AMD GPU devices, with no optimization:

hipcc -O0 -g --offload-arch=gfx900 --offload-arch=gfx906 bit_extract.cpp -o bit_extract

The AMD ROCm compiler maps HIP source language device function work-items to the lanes of an AMD GPU wavefront, which are represented in ROCGDB as heterogeneous lanes.

Assembly Code

Assembly code kernels are supported.

Other Languages

Other languages, including OpenCL and Fortran, are currently supported as the minimal pseudo-language, provided they are compiled specifying at least the AMD GPU Code Object V3 and DWARF 4 formats. See Unsupported Languages.

22.4.10.3 AMD GPU Device Driver and AMD ROCm Runtime

ROCGDB requires a compatible AMD GPU device driver to be installed. A warning message is displayed if either the device driver version or the version of the debug support it implements is unsupported. For example,

amd-dbgapi: warning: AMD GPU driver's version 1.6 not supported (version 2.x where x >= 1 required)
amd-dbgapi: warning: AMD GPU driver's debug support version 9.0 not supported (version 10.x where x >= 1) required

ROCGDB will continue to function except no AMD GPU debugging will be possible.

ROCGDB requires each agent to have compatible firmware installed by the device driver. A warning message is displayed if unsupported firmware is detected. For example,

amd-dbgapi: warning: AMD GPU gpu_id 17619's firmware version 458 not supported (version >= 555 required)

ROCGDB will continue to function except no AMD GPU debugging will be possible on the agent.

ROCGDB requires a compatible AMD ROCm runtime to be loaded in order to detect AMD GPU code objects and wavefronts. A warning message is displayed if an unsupported AMD ROCm runtime is detected, or there is an error or restriction that prevents debugging. For example,

amd_dbgapi: warning: AMD GPU runtime's r_debug::r_version 5 not supported (r_debug::r_version >= 6 required)

ROCGDB will continue to function except no AMD GPU debugging will be possible.

22.4.10.4 AMD GPU Heterogeneous Agents

AMD GPU heterogeneous agents are not listed by the ‘info agents’ command until the inferior has started executing the program.

Debugging an agent is unsupported if it has an architecture that is not supported by ROCGDB or the AMD GPU device driver, or the firmware version is not supported by ROCGDB.

22.4.10.5 AMD GPU Heterogeneous Queues

The AMD GPU heterogeneous queue types reported by the ‘info queues’ command are:

HSA

An HSA AQL queue. The ‘(Single)’ suffix indicates it uses the single-producer protocol, ‘(Multi)’ suffix indicates the multi-producer protocol, and ‘(Coop)’ suffix indicates the multi-producer cooperative dispatch protocol.

PM4

An AMD PM4 queue.

DMA

A DMA queue.

XGMI

An XGMI queue.

22.4.10.6 AMD GPU Heterogeneous Dispatches

AMD GPU supports the following address spaces for the ‘info dispatches’ command:

Shared

Per work-group storage.

Private

Per work-item storage.

The ‘info dispatches’ command uses the following BNF syntax for AMD GPU heterogeneous dispatch fences:

fence     ::== [ barrier ] [ separator ] [ acquire ]  [ separator ] [ release ]
separator ::== "|"
barrier   ::== "B"
acquire   ::== "A" scope
release   ::== "R" scope
scope     ::== system | agent
system    ::== "s"
agent     ::== "a"

Where:

separator

The elements are separated by ‘|’.

barrier

If present indicates the next heterogeneous packet will not be initiated until heterogeneous dispatch completes.

acquire

Indicates an acquire memory fence was performed before initiating the heterogeneous dispatch.

release

Indicates a release memory fence will be performed when the heterogeneous dispatch completes.

system

Indicates the memory fence is performed at the system memory scope.

agent

Indicates the memory fence is performed at the agent memory scope.

22.4.10.7 AMD GPU Wavefronts

An AMD GPU wavefront is represented in ROCGDB as a thread.

An AMD GPU wavefront can enter the halt state by:

When a wavefront is in the halt state, it executes no further instructions. In addition, a wavefront that is associated with a queue that is in the queue error state (see AMD GPU Signals) is inhibited from executing further instructions. Continuing such wavefronts will not hit any breakpoints nor report completion of a single step command. If necessary, ‘Ctrl-C’ can be used to cancel the command.

Note that some AMD GPU architectures may have restrictions on providing information about AMD GPU wavefronts created when ROCGDB is not attached (see AMD GPU Attaching Restrictions).

When scheduler-locking is in effect (see set scheduler-locking), new wavefronts created by the resumed thread (either CPU thread or GPU wavefront) are held in the halt state.

22.4.10.8 AMD GPU Registers

AMD GPU supports the following reggroup values for the ‘info registers reggroup’ command:

The number of scalar and vector registers is configured when a wavefront is created. Only allocated registers are displayed.

Scalar registers are reported as 32-bit signed integer values.

Vector registers are reported as a wavefront size vector of signed 32-bit values.

The pc is reported as a function pointer value.

The exec register is reported as a wavefront size-bit unsigned integer value.

The vcc and xnack_mask pseudo registers are reported as a wavefront size-bit unsigned integer value.

The flat_scratch pseudo register is reported as a 64-bit unsigned integer value.

The mode, status, and trapsts registers are reported as flag values. For example,

(gdb) p $mode
$1 = [ FP_ROUND.32=NEAREST_EVEN FP_ROUND.64_16=NEAREST_EVEN FP_DENORM.32=FLUSH_NONE FP_DENORM.64_16=FLUSH_NONE DX10_CLAMP IEEE CSP=0 ]
(gdb) p $status
$2 = [ SPI_PRIO=0 USER_PRIO=0 TRAP_EN VCCZ VALID ]
(gdb) p $trapsts
$3 = [ EXCP_CYCLE=0 DP_RATE=FULL ]

Use the ‘ptype’ command to see the type of any register.

22.4.10.9 AMD GPU Code Objects

The ‘info sharedlibrary’ command will show the AMD GPU code objects together with the CPU code objects. For example:

(gdb) info sharedlibrary
From                To                  Syms Read   Shared Object Library
0x00007fd120664ac0  0x00007fd120682790  Yes (*)     /lib64/ld-linux-x86-64.so.2
...
0x00007fd0125d8ec0  0x00007fd015f21630  Yes (*)     /opt/rocm-3.5.0/hip/lib/../../lib/libamd_comgr.so
0x00007fd11d74e870  0x00007fd11d75a868  Yes (*)     /lib/x86_64-linux-gnu/libtinfo.so.5
0x00007fd11d001000  0x00007fd11d00173c  Yes         file:///home/rocm/examples/bit_extract#offset=6477&size=10832
0x00007fd11d008000  0x00007fd11d00adc0  Yes (*)     memory://95557/mem#offset=0x7fd0083e7f60&size=41416
(*): Shared library is missing debugging information.
(gdb)

The code object path for AMD GPU code objects is shown as a URI (Universal Location Identifier) with a syntax defined by the following BNF syntax:

code_object_uri ::== file_uri | memory_uri
file_uri        ::== "file://" file_path [ range_specifier ]
memory_uri      ::== "memory://" process_id range_specifier
range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number
file_path       ::== URI_ENCODED_OS_FILE_PATH
process_id      ::== DECIMAL_NUMBER
number          ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER

Where:

number

A C integral literal where hexadecimal values are prefixed by ‘0x’ or ‘0X’, and octal values by ‘0’.

file_path

The file’s path specified as a URI encoded UTF-8 string. In URI encoding, every character that is not in the regular expression ‘[a-zA-Z0-9/_.~-]’ is encoded as two uppercase hexidecimal digits proceeded by ‘%’. Directories in the path are separated by ‘/’.

offset

A 0-based byte offset to the start of the code object. For a file URI, it is from the start of the file specified by the file_path, and if omitted defaults to 0. For a memory URI, it is the memory address and is required.

size

The number of bytes in the code object. For a file URI, if omitted it defaults to the size of the file. It is required for a memory URI.

process_id

The identity of the process owning the memory. For Linux it is the C unsigned integral decimal literal for the process pid.

AMD GPU code objects are loaded into each AMD GPU device separately. The ‘info sharedlibrary’ command will therefore show the same code object loaded multiple times. As a consequence, setting a breakpoint in AMD GPU code will result in multiple breakpoints if there are multiple AMD GPU devices.

If the source language runtime defers loading code objects until kernels are launched, then setting breakpoints may result in pending breakpoints that will be set when the code object is finally loaded.

22.4.10.10 AMD GPU Heterogeneous Entity Target Identifies and Convenience Variables

The AMD GPU heterogeneous entities have the following target identifier formats:

Agent Target ID

The AMD GPU agent target identifier agent_systag string has the following format:

AMDGPU Agent (GPUID target-agent-id)

It is used in the ‘Target ID’ column of the ‘info agents’ command and is available using the $_agent_systag convenience variable.

Queue Target ID

The AMD GPU queue target identifier queue_systag string has the following format:

AMDGPU Queue agent-id:queue-id (QID target-queue-id)

It is used in the ‘Target ID’ column of the ‘info queues’ command and is available using the $_queue_systag convenience variable.

Dispatch Target ID

The AMD GPU dispatch target identifier dispatch_systag string has the following format:

AMDGPU Dispatch agent-id:queue-id:dispatch-id (PKID target-packet-id)

It is used in the ‘Target ID’ column of the ‘info dispatches’ command and is available using the $_dispatch_systag convenience variable. The target-packet-id corresponds to the dispatch packet that initiated the dispatch.

Thread Target ID

The AMD GPU thread target identifier (systag) string has the following format:

AMDGPU Wave agent-id:queue-id:dispatch-id:wave-id (work-group-x,work-group-y,work-group-z)/work-group-thread-index

It is used in the ‘Target ID’ column of the ‘info threads’ command and is available using the $_thread_systag convenience variable.

Lane Target ID

The AMD GPU lane target identifier (lane_systag) string has the following format:

AMDGPU Lane agent-id:queue-id:dispatch-id:wave-id/lane-index (work-group-x,work-group-y,work-group-z)[work-item-x,work-item-y,work-item-z]

It is used in the ‘Target ID’ column of the ‘info lanes’ command and is available using the $_lane_systag convenience variable.

The AMD GPU heterogeneous entities have the following convenience variables:

$_dispatch_pos

The string returned by the $_dispatch_pos debugger convenience variable has the following format:

(work-group-x,work-group-y,work-group-z)/work-group-thread-index
$_thread_workgroup_pos

The string returned by the $_thread_workgroup_pos debugger convenience variable has the following format:

work-group-thread-index
$_lane_workgroup_pos

The string returned by the $_lane_workgroup_pos debugger convenience variable has the following format:

[work-item-x,work-item-y,work-item-z]

Where:

agent-id
queue-id
dispatch-id
wave-id

The AMD GPU target agent identifier, queue identifier, dispatch identifier, and wave identifier respectively. The identifiers are global across all inferiors.

target-agent-id
target-queue-id

The AMD GPU target driver agent identifier and queue identifier identifier respectively. The identifiers are per process.

target-packet-id

The AMD GPU target driver packet identifier. The identifier is per queue.

work-group-x
work-group-y
work-group-z

The grid position of the thread’s work-group within the heterogeneous dispatch.

work-group-thread-index

The thread’s number within the heterogeneous work-group.

lane-index

The heterogeneous lane index within the thread.

work-item-x
work-item-y
work-item-z

The position of the heterogeneous lane’s work-item within the heterogeneous work-group.

22.4.10.11 AMD GPU Address Spaces

AMD GPU heterogeneous agents support the following address spaces:

global

the default global virtual address space

region

the per heterogeneous agent shared address space (GDS (Global Data Store)) (see AMD GPU region Address Space Restrictions)

group

the per heterogeneous work-group shared address space (LDS (Local Data Store))

private

the per heterogeneous lane private address space (Scratch)

generic

the generic address space that can access the global, group, or private address spaces (Flat)

The AMD GPU architecture default address space is global.

The maint print address-spaces command can be used to display the AMD GPU architecture address spaces.

22.4.10.12 AMD GPU Signals

AMD GPU wavefronts can raise the following signals when executing instructions:

SIGILL

Execution of an illegal instruction.

SIGTRAP

Execution of a S_TRAP instruction other than:

  • S_TRAP 1 which is used by ROCGDB to insert breakpoints.
  • S_TRAP 2 which raises SIGABRT.

Note that S_TRAP 3 only raises a signal when ROCGDB is attached to the inferior. Otherwise, it is treated as a no-operation. The compiler generates S_TRAP 3 for the llvm.debugtrap intrinsic.

SIGABRT

Execution of a S_TRAP 2 instruction. The compiler generates S_TRAP 2 for the llvm.trap intrinsic which is used for assertions.

SIGFPE

Execution of a floating point or integer instruction detects a condition that is enabled to raise a signal. The conditions include:

  • Floating point operation is invalid.
  • Floating point operation had subnormal input that was rounded to zero.
  • Floating point operation performed a division by zero.
  • Floating point operation produced an overflow result. The result was rounded to infinity.
  • Floating point operation produced an underflow result. A subnormal result was rounded to zero.
  • Floating point operation produced an inexact result.
  • Integer operation performed a division by zero.

By default, these conditions are not enabled to raise signals. The ‘set $mode’ command can be used to change the AMD GPU wavefront’s register that has bits controlling which conditions are enabled to raise signals. The ‘print $trapsts’ command can be used to inspect which conditions have been detected even if they are not enabled to raise a signal.

SIGBUS

Execution of an instruction that accessed global memory using an address that is outside the virtual address range.

SIGSEGV

Execution of an instruction that accessed a global memory page that is either not mapped or accessed with incompatible permissions.

If a single instruction raises more than one signal, they will be reported one at a time each time the wavefront is continued.

If any of these signals are delivered to the wavefront, it will cause the wavefront to enter the halt state and cause the AMD ROCm runtime to put the associated queue into the queue error state. All wavefronts associated with a queue that is in the queue error state are inhibited from executing further instructions even if they are not in the halt state. In addition, when the AMD ROCm runtime puts a queue into the queue error state it may invoke an application registered callback that could either abort the application or delete the queue which will delete any wavefronts associated with the queue.

The ROCGDB signal-related commands (see Signals) can be used to control when a signal is delivered to the inferior, what signal is delivered to the inferior, and even if a signal should not be delivered to the inferior.

If the ‘signal’ or ‘queue-signal’ commands are used to deliver a signal other than those listed above to an AMD GPU wavefront, then the following error will be displayed when the wavefront is resumed:

Resuming with signal signal is not supported by this agent.

The wavefront will not be resumed and no signal will be delivered. Use the ‘signal’ or ‘queue-signal’ commands to change the signal to deliver, or use ‘signal 0’ or ‘queue-signal 0’ to suppress delivering a signal.

Note that some AMD GPU architectures may have restrictions on supressing delivering signals to a wavefront (see AMD GPU Signal Restrictions).

22.4.10.13 AMD GPU Memory Violation Reporting

A wavefront can report memory violation and address watch access events. However, the program location at which they are reported may be after the machine instruction that caused them. This can result in the reported source statement being incorrect. The following commands can be used to control this behavior:

set amdgpu precise-memory mode

set amdgpu precise-memory’ controls how AMD GPU devices detect memory violations and address watch events. Where mode can be:

off

The program location may not be immediately after the instruction that caused the memory violation or address watch event. This is the default.

on

Requests that the program location will be immediately after the instruction that caused a memory violation or address watch event. Enabling this mode may make the AMD GPU device execution significantly slower as it has to wait for each memory operation to complete before executing the next instruction.

For example:

(gdb) set amdgpu precise-memory off
(gdb) show amdgpu precise-memory
AMDGPU precise memory access reporting is off
(gdb)

If a memory violation or address watch access event is reported for an AMD GPU thread that supports controlling precise memory detection when the mode is ‘off’, then the message includes an indication that the position may not be accurate. For example:

(gdb) run
Warning: precise memory violation signal reporting is not enabled, reported
location may not be accurate.  See "show amdgpu precise-memory".

Thread 6 "bit_extract" received signal SIGSEGV, Segmentation fault.
0x00007ffee6a0a028 in bit_extract_kernel (C_d=<optimized out>, A_d=<optimized out>, N=<optimized out>) at bit_extract.cpp:38
38     size_t offset = (hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x);
(gdb)

The precise memory mode cannot be enabled until the inferior is started or attached. If at that time all AMD GPU devices accessible to the inferior support the ‘on’ mode, then it is enabled. For example:

(gdb) set amdgpu precise-memory on
(gdb) show amdgpu precise-memory
AMDGPU precise memory access reporting is on (currently disabled)
(gdb) run
...
(gdb) show amdgpu precise-memory
AMDGPU precise memory access reporting is on (currently enabled)
(gdb)

Alternatively, if at that time any of the AMD GPU devices accessible to the inferior do not support the ‘on’ mode, then a warning is reported and the mode is not enabled. For example:

(gdb) set amdgpu precise-memory on
(gdb) show amdgpu precise-memory
AMDGPU precise memory access reporting is on (currently disabled)
(gdb) run
AMDGPU precise memory access reporting could not be enabled
(gdb)

If the inferior is already executing when setting the ‘on’ mode, then a warning will be reported immediately. For example:

(gdb) set amdgpu precise-memory on
AMDGPU precise memory access reporting could not be enabled
(gdb) show amdgpu precise-memory
AMDGPU precise memory access reporting is on (currently disabled)
(gdb)

Otherwise, setting the ‘on’ mode will enable it immediately. For example:

(gdb) set amdgpu precise-memory on
(gdb) show amdgpu precise-memory
AMDGPU precise memory access reporting is on (currently enabled)
(gdb)

The set amdgpu precise-memory parameter is per-inferior. When an inferior forks and a child inferior is created as a result, the child inferior inherits the parameter value of the parent inferior.

show amdgpu precise-memory

show amdgpu precise-memory’ displays the currently requested AMD GPU precise memory setting. If ‘on’ has been requested, the message also indicates if it is currently enabled. For example:

(gdb) show amdgpu precise-memory
AMDGPU precise memory access reporting is on (currently disabled)
(gdb)

22.4.10.14 AMD GPU Logging

The ‘set debug amdgpu log-level level’ command can be used to enable diagnostic messages for the AMD GPU target. The ‘show debug amdgpu log-level’ command displays the current AMD GPU target log level. See set debug amdgpu.

For example, the following will enable information messages and send the log to a new file:

(gdb) set debug amdgpu log-level info
(gdb) set logging overwrite
(gdb) set logging file log.out
(gdb) set logging debugredirect on
(gdb) set logging on

If you want to print the log to both the console and a file, omit the ‘set logging debugredirect on’ command. See Logging Output.

22.4.10.15 AMD GPU Version

The ‘show amdgpu version’ command can be used to print the ROCdbgapi library version and file path. For example:

(gdb) show amdgpu version
ROCdbgapi 0.56.0 (0.56.0-rocm-rel-4.5-56)
Loaded from `/opt/rocm-4.5.0/lib/librocm-dbgapi.so.0'

22.4.10.16 AMD GPU Restrictions

ROCGDB AMD GPU support is currently a prototype and has the following restrictions. Future releases aim to address these restrictions.

  1. The debugger convenience variables, convenience functions, and commands described in Debugging Heterogeneous Programs are not yet implemented unless noted below.

    The command line interface ‘info agents’, ‘info queues’, ‘info dispatches’, ‘queue find’, and ‘dispatch find’ commands are supported. However, these have no Python bindings.

    The debugger convenience variable $_wave_id is available which returns a string that has the format:

    (work-group-x,work-group-y,work-group-z)/work-group-thread-index
    

    Where:

    work-group-x
    work-group-y
    work-group-z

    The grid position of the thread’s work-group within the heterogeneous dispatch.

    work-group-thread-index

    The thread’s number within the heterogeneous work-group.

  2. The AMD ROCm compiler currently has the following limitations:
    • DWARF information for symbolic variables is not always generated at optimization levels above -O0.
    • DWARF information for symbolic variables allocated in the local address space is not generated correctly. The locations of these variables are always reported with an address of 0x0 in the local address space.
    • The AMD ROCm compiler currently adds the -mllvm -amdgpu-spill-cfi-saved-regs option for AMD GPU when the -g option is specified. This ensures registers not currently supported by the CFI generation are saved so the CFI information is correct. If this option is not used, the invalid DWARF may cause ROCGDB to report that it is unable to read memory (such as when reading arguments in a backtrace).
  3. ROCGDB does not currently make use of debug information describing an inactive lane’s logical current PC (and the AMD ROCm compiler currently does not generate it). Such debug information informs the debugger about what the instruction an inactive lane will execute next once it becomes active. Thus, currently, source locations displayed for inactive lanes always point to the wave’s physical PC. This is the same PC address as the PC of the lanes that are active.
  4. Only AMD GPU Code Object V3 and above is supported. This is the default for the AMD ROCm compiler. The following error will be reported for incompatible code objects:
    Error while mapping shared library sections:
    `file:///rocm/bit_extract#offset=6751&size=3136': ELF file ABI version (0) is not supported.
    
  5. No support yet for AMD GPU core dumps.
  6. When in non-stop mode, wavefronts may not hit breakpoints inserted while not stopped, nor see memory updates made while not stopped, until the wavefront is next stopped. Memory updated by non-stopped wavefronts may not be visible until the wavefront is next stopped.
  7. Single-stepping or resuming execution from an illegal instruction may execute differently in ROCGDB than on real hardware.
  8. Halting AMD GPU wavefronts in an inferior can result in preventing other inferiors from executing AMD GPU wavefronts.
  9. The performance of resuming from a breakpoint when a large number of threads have hit a breakpoint on a fully occupied single AMD GPU device is comparable to CPU debugging. However, the techniques described in Debugging Heterogeneous Programs can be used to improve responsiveness. Other techniques that can improve responsiveness are:
    • Try to avoid having a lot of threads stopping at a breakpoint. For example, by placing breakpoints in conditional paths only executed by one thread.
    • Use of tbreak so only one thread reports the breakpoint and the other threads hitting the breakpoint will be continued. A similar effect can be achieved by deleting the breakpoint manually when it is hit.
    • Reduce the number of wavefronts when debugging if practical.
  10. Some AMD GPU devices, such as ‘gfx90a’, can be in use by multiple processes that are being debugged by ROCGDB. For other devices the following warning message may be displayed.
    amd-dbgapi: warning: At least one agent is busy (debugging may be enabled by another process)
    

    ROCGDB will continue to function except no AMD GPU debugging will be possible.

    The Linux cgroups facility can be used to limit which AMD GPU devices are used by a process. In order for a ROCGDB process to access the AMD GPU devices of the process it is debugging, the AMD GPU devices must be included in the ROCGDB process cgroup.

    Therefore, multiple ROCGDB processes can each debug a process provided the cgroups specify disjoint sets of AMD GPU devices. However, a single ROCGDB process cannot debug multiple inferiors that use AMD GPU devices even if those inferiors have cgroups that specify disjoint AMD GPU devices. This is because the ROCGDB process must have all the AMD GPU devices in its cgroups and so will attempt to enable debugging for all AMD GPU devices for all inferiors it is debugging.

    It is suggested to use Docker rather than cgroups directly to limit the AMD GPU devices visible inside a container:

    1. /dev/kfd’ must be mapped into the container.
    2. The ‘/dev/dri/renderD<render-minor-number>’ and ‘/dev/drm/card<node-number>’ files corresponding to each AMD GPU device that is to be visible must be mapped into the container. Note that non-AMD GPU devices may also be present.

      The render-minor-number for a device can be obtained by looking at the ‘drm_render_minor’ field value from:

      cat /sys/class/kfd/kfd/topology/nodes/<node-number>/properties
      
    3. Make sure the container user is a member of the render group for Ubuntu 20.04 onward and the video group for all other distributions.
    4. Specify the ‘--cap-add=SYS_PTRACE’ and ‘--security-opt seccomp=unconfined’ options.
    5. Install the AMD ROCm packages in the container. See https://github.com/RadeonOpenCompute/ROCm-docker.

    All processes running in the container will see the same subset of devices. By having two containers with non-overlapping sets of AMD GPUs, it is possible to use ROCGDB in both containers at the same time since each AMD GPU device will only have one ROCGDB process accessing it.

    For example:

    docker run -it --rm --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
        --device=/dev/kfd --device=/dev/drm/card0 --device=/dev/dri/renderD128 \
        --group-add render ubuntu:20.04 /bin/bash
    
  11. The HIP runtime currently performs deferred code object loading by default. AMD GPU code objects are not loaded until the first kernel is launched. Before then, all breakpoints have to be set as pending breakpoints.

    If source line positions are used that only correspond to source lines in unloaded code objects, then ROCGDB may not set pending breakpoints, and instead set breakpoints in unpredictable places of the loaded code objects if they contain code from the same file. This can result in unexpected breakpoint hits being reported. When the code object containing the source lines is loaded, the incorrect breakpoints will be removed and replaced by the correct ones. This problem can be avoided by only setting breakpoints in unloaded code objects using symbol or function names.

    The HIP_ENABLE_DEFERRED_LOADING environment variable can be used to disable deferred code object loading by the HIP runtime. This ensures all code objects will be loaded when the inferior reaches the beginning of the main function.

    For example,

    export HIP_ENABLE_DEFERRED_LOADING=0
    

    Note: If deferred code object loading is disabled and the application performs a fork, then the program may crash.

    Note: Disabling code object loading can result in errors being reported when executing ROCGDB due to open file limitations when the application contains a large number of embedded device code objects. With deferred code object loading enabled, only the device code objects actually invoked are loaded, and so ROCGDB opens fewer files.

  12. ROCGDB supports watchpoints, but limits the capabilities to the lowest common denominator of the heterogeneous agents in the system. Hardware supported watchpoints are used when possible, otherwise software emulation is used. Software emulation involves using-single stepping and reading memory to determine if values have changed, and as a result performs substantially slower than hardware watchpoints.

    The AMD GPU supported architectures provide a maximum of 4 hardware write watchpoints. Precise read watchpoints or access watchpoints are not supported.

    The x86 architecture provides 4 hardware watchpoints that can each monitor up to 8 bytes.

    When ROCGDB is used with x86 and AMD GPU devices, hardware watchpoints are therefore limited to at most 4 write watchpoints that have a collective size of up to 32 bytes. The collective size is calculated by adding the size of each watchpoint rounded up to a multiple of 8 bytes. Software emulation will be used for watchpoints that exceed the hardware limitations.

    Currently, watchpoints are only created on the CPU, and not the AMD GPU, until the AMD ROCm runtime is initialized. With deferred code object loading disabled this does not happen until the inferior reaches the beginning of the main function. With deferred code object loading enabled this does not happen until the first kernel is executed. This also means that, when the inferior is re-run, watchpoints are only re-activated on the CPU, not on the AMD GPU.

  13. Watchpoints are not reported for memory that is written by memory transfers performed by DMA (Direct Memory Access) hardware. The HSA_ENABLE_SDMA environment variable can be set to ‘0’ to disable the AMD ROCm runtime from using DMA for transfers between the CPU and AMD GPU.
  14. When single stepping there can be times when ROCGDB appears to wait indefinitely for the single step to complete. If this happens, ‘Ctrl-C’ can be used to cancel the single step command so it can be tried again.
  15. If no CPU thread is running, then ‘Ctrl-C’ is not able to stop AMD GPU threads. This can happen for example if you enable scheduler-locking after the whole program stopped, and then resume an AMD GPU thread. For example:
    Thread 6 hit Breakpoint 1, with lanes [0-63], kernel () at test.cpp:38
    38          size_t l = 0;
    (gdb) info threads
      Id   Target Id                            Frame
      1    Thread 0x7ffff6493880 (LWP 2222574)  0x00007ffff6cb989b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
      2    Thread 0x7ffff6492700 (LWP 2222582)  0x00007ffff6ccb50b in ioctl () at ../sysdeps/unix/syscall-template.S:78
      4    Thread 0x7ffff5aff700 (LWP 2222584)  0x00007ffff6ccb50b in ioctl () at ../sysdeps/unix/syscall-template.S:78
      5    Thread 0x7ffff515d700 (LWP 2222585)  0x00007ffff6764d81 in rocr::core::InterruptSignal::WaitRelaxed() from /opt/rocm/lib/libhsa-runtime64.so.1
    * 6    AMDGPU Wave 1:1:1:1 (0,0,0)/0        kernel () at test.cpp:38
    (gdb) del 1
    (gdb) set scheduler-locking on
    (gdb) c
    Continuing.
    ^C
    

    Above, ROCGDB does not respond to ‘Ctrl-C’. The only way to unblock the situation is to kill the ROCGDB process.

  16. The HIP runtime currently loads code objects from memory, including when loading modules from a file, which results in code object URIs being reported as ‘memory://’.

    The HSA_LOADER_ENABLE_MMAP_URI environment variable can be used to request that the AMD ROCm runtime attempt to determine the file containing the code object memory so that ‘file://’ URIs can be reported.

    For example,

    export HSA_LOADER_ENABLE_MMAP_URI=1
    
  17. AMD GPU target does not currently support calling inferior functions.
  18. The gdbserver is not supported.
  19. No language specific support for Fortran or OpenCL. No OpenMP language extension support for C, C++, or Fortran.
  20. Does not support the AMD ROCm HCC compiler or runtime available as part of releases before ROCm 3.5.
  21. AMD GPU target does not currently support the compiler address, memory, or thread sanitizers.
  22. ROCGDB support for AMD GPU devices is not currently available under virtualization.
  23. Supressing delivering some signals to a wavefront for some AMD GPU architectures may not prevent the AMD ROCm runtime putting the associated queue into the queue error state. For example, suppressing the SIGSEGV signal may prevent the wavefront from being put in the halt state, but the AMD ROCm runtime may still put the associated queue into the queue error state.

    Suppressing delivering some signals, such as SIGSEGV, for a wavefront may also suppress the same signal raised by other AMD GPU hardware such as from DMA or from the packet processor, preventing the AMD ROCm runtime being notified.

    See AMD GPU Signals.

  24. By default, for some architectures, the AMD GPU device driver causes all AMD GPU wavefronts created when ROCGDB is not attached to be unable to report the heterogeneous dispatch associated with the wavefront, or the wavefront’s heterogeneous work-group position. The ‘info threads’ command will display this missing information with a ‘?’.

    For example,

    (gdb) info threads
      Id   Target Id                                       Frame
    * 1    Thread 0x7ffff6987840 (LWP 62056) "bit_extract" 0x00007ffff6da489b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
      2    Thread 0x7ffff6986700 (LWP 62064) "bit_extract" 0x00007ffff6db650b in ioctl () at ../sysdeps/unix/syscall-template.S:78
      3    Thread 0x7ffff5f7f700 (LWP 62066) "bit_extract" 0x00007ffff6db650b in ioctl () at ../sysdeps/unix/syscall-template.S:78
      4    Thread 0x7ffff597f700 (LWP 62067) "bit_extract" 0x00007ffff6db650b in ioctl () at ../sysdeps/unix/syscall-template.S:78
      5    AMDGPU Wave 1:2:?:1 (?,?,?)/? "bit_extract"     bit_extract_kernel (C_d=<optimized out>, A_d=<optimized out>, N=<optimized out>) at bit_extract.cpp:41
    

    This does not affect wavefronts created while ROCGDB is attached which are always capable of reporting this information.

    If the HSA_ENABLE_DEBUG environment variable is set to ‘1’ when the AMD ROCm runtime is initialized, then this information will be available for all archiectures even for wavefronts created when ROCGDB was not attached. Setting this environment variable may very marginally reduce wavefront launch latency for some architectures for very short lived wavefronts.

  25. If the inferior exits while there are wavefronts that have reported events, such as breakpoints, that ROCGDB has not completed processing, then the following error may be displayed.
    Couldn't get registers: No such process.
    
  26. The current ROCGDB implementation represents addresses using a 64 bit value in which the top 8 bits are used to represent information about the address space. This means that an explicit conversion between an integral value and a pointer (for example, in a user entered expression) may result in the top eight bits of that value being forced to zero. The same happens if an integral value used with the ‘#’ address space qualifier is large enough to have the top eight bits non-zero.
  27. ROCGDB does not support reading or writing to the region address space. A memory access error is reported.
  28. If an AMD GPU wavefront has the DX10_CLAMP bit set in the MODE register, enabled arithmetic exceptions will not be reported as SIGFPE signals. This happens if the DX10_CLAMP kernel descriptor field is enabled.

    See AMD GPU Signals.

  29. ROCGDB does not support single root I/O virtualization (SR-IOV) on any AMD GPU architecture that supports it. That includes ‘gfx1030’ and ‘gfx1031’.

Previous: S12Z, Up: Architectures   [Contents][Index]