AMD ROCm ROCProfiler

Overview

The rocProf is a command line tool implemented on the top of rocProfiler and rocTracer APIs. Source code for rocProf can be found at GitHub: https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/bin/rocprof
This command line tool is implemented as a script which is setting up the environment for attaching the profiler and then run the provided application command line. The tool uses two profiling plugins loaded by ROC runtime and based on rocProfiler and rocTracer for collecting metrics/counters, HW traces and runtime API/activity traces. The tool consumes an input XML or text file with counters list or trace parameters and provides output profiling data and statistics in various formats as text, CSV and JSON traces. Google Chrome tracing can be used to visualize the JSON traces with runtime API/activity timelines and per kernel counters data.

Profiling Modes

‘rocprof’ can be used for GPU profiling using HW counters and application tracing

GPU profiling

GPU profiling is controlled with input file which defines a list of metrics/counters and a profiling scope. An input file is provided using option ‘-i ’. Output CSV file with a line per submitted kernel is generated. Each line has kernel name, kernel parameters and counter values. By option ‘—stats’ the kernel execution stats can be generated in CSV format. Currently profiling has limitation of serializing submitted kernels. An example of input file:

# Perf counters group 1
pmc : Wavefronts VALUInsts SALUInsts SFetchInsts
# Perf counters group 2
pmc : TCC_HIT[0], TCC_MISS[0]
# Filter by dispatches range, GPU index and kernel names
# supported range formats: "3:9", "3:", "3"
range: 1 : 4
gpu: 0 1 2 3
kernel: simple Pass1 simpleConvolutionPass2

An example of profiling command line for ‘MatrixTranspose’ application

$ rocprof -i input.txt MatrixTranspose
RPL: on '191018_011134' from '/…./rocprofiler_pkg' in '/…./MatrixTranspose'
RPL: profiling '"./MatrixTranspose"'
RPL: input file 'input.txt'
RPL: output dir '/tmp/rpl_data_191018_011134_9695'
RPL: result dir '/tmp/rpl_data_191018_011134_9695/input0_results_191018_011134'
ROCProfiler: rc-file '/…./rpl_rc.xml'
ROCProfiler: input from "/tmp/rpl_data_191018_011134_9695/input0.xml"
  gpu_index =
  kernel =
  range =
  4 metrics
    L2CacheHit, VFetchInsts, VWriteInsts, MemUnitStalled
  0 traces
Device name Ellesmere [Radeon RX 470/480/570/570X/580/580X]
PASSED!

ROCProfiler: 1 contexts collected, output directory /tmp/rpl_data_191018_011134_9695/input0_results_191018_011134
RPL: '/…./MatrixTranspose/input.csv' is generated

Counters and metrics

There are two profiling features, metrics and traces. Hardware performance counters are treated as the basic metrics and the formulas can be defined for derived metrics. Counters and metrics can be dynamically configured using XML configuration files with counters and metrics tables:

  • Counters table entry, basic metric: counter name, block name, event id

  • Derived metrics table entry: metric name, an expression for calculation the metric from the counters

Metrics XML File Example:

<gfx8>
    <metric name=L1_CYCLES_COUNTER block=L1 event=0 descr=”L1 cache cycles”></metric>
    <metric name=L1_MISS_COUNTER block=L1 event=33 descr=”L1 cache misses”></metric>
    . . .
</gfx8>

<gfx9>
    . . .
</gfx9>

<global>
  <metric
    name=L1_MISS_RATIO
    expr=L1_CYCLES_COUNT/L1_MISS_COUNTER
    descry=”L1 miss rate metric”
  ></metric>
</global>
Metrics query

Available counters and metrics can be queried by options ‘—list-basic’ for counters and ‘—list-derived’ for derived metrics. The output for counters indicates number of block instances and number of block counter registers. The output for derived metrics prints the metrics expressions. Examples:

$ rocprof --list-basic
RPL: on '191018_014450' from '/opt/rocm/rocprofiler' in '/…./MatrixTranspose'
ROCProfiler: rc-file '/…./rpl_rc.xml'
Basic HW counters:
  gpu-agent0 : GRBM_COUNT : Tie High - Count Number of Clocks
      block GRBM has 2 counters
  gpu-agent0 : GRBM_GUI_ACTIVE : The GUI is Active
      block GRBM has 2 counters
      . . .
  gpu-agent0 : TCC_HIT[0-15] : Number of cache hits.
      block TCC has 4 counters
  gpu-agent0 : TCC_MISS[0-15] : Number of cache misses. UC reads count as misses.
      block TCC has 4 counters
      . . .

$ rocprof --list-derived
RPL: on '191018_015911' from '/opt/rocm/rocprofiler' in '/home/evgeny/work/BUILD/0_MatrixTranspose'
ROCProfiler: rc-file '/home/evgeny/rpl_rc.xml'
Derived metrics:
  gpu-agent0 : TCC_HIT_sum : Number of cache hits. Sum over TCC instances.
      TCC_HIT_sum = sum(TCC_HIT,16)
  gpu-agent0 : TCC_MISS_sum : Number of cache misses. Sum over TCC instances.
      TCC_MISS_sum = sum(TCC_MISS,16)
  gpu-agent0 : TCC_MC_RDREQ_sum : Number of 32-byte reads. Sum over TCC instaces.
      TCC_MC_RDREQ_sum = sum(TCC_MC_RDREQ,16)
    . . .
Metrics collecting

Counters and metrics accumulated per kernel can be collected using input file with a list of metrics, see an example in 2.1. Currently profiling has limitation of serializing submitted kernels. The number of counters which can be dumped by one run is limited by GPU HW by number of counter registers per block. The number of counters can be different for different blocks and can be queried, see 2.1.1.1.

Blocks instancing

GPU blocks are implemented as several identical instances. To dump counters of specific instance square brackets can be used, see an example in 2.1. The number of block instances can be queried, see 2.1.1.1.

HW limitations

The number of counters which can be dumped by one run is limited by GPU HW by number of counter registers per block. The number of counters can be different for different blocks and can be queried, see 2.1.1.1.

  • Metrics groups

To dump a list of metrics exceeding HW limitations the metrics list can be split on groups. The tool supports automatic splitting on optimal metric groups:

$ rocprof -i input.txt ./MatrixTranspose
RPL: on '191018_032645' from '/opt/rocm/rocprofiler' in '/…./MatrixTranspose'
RPL: profiling './MatrixTranspose'
RPL: input file 'input.txt'
RPL: output dir '/tmp/rpl_data_191018_032645_12106'
RPL: result dir '/tmp/rpl_data_191018_032645_12106/input0_results_191018_032645'
ROCProfiler: rc-file '/…./rpl_rc.xml'
ROCProfiler: input from "/tmp/rpl_data_191018_032645_12106/input0.xml"
  gpu_index =
  kernel =
  range =
  20 metrics
    Wavefronts, VALUInsts, SALUInsts, SFetchInsts, FlatVMemInsts, LDSInsts, FlatLDSInsts, GDSInsts, VALUUtilization, FetchSize, WriteSize, L2CacheHit, VWriteInsts, GPUBusy, VALUBusy, SALUBusy, MemUnitStalled, WriteUnitStalled, LDSBankConflict, MemUnitBusy
  0 traces
Device name Ellesmere [Radeon RX 470/480/570/570X/580/580X]

Input metrics out of HW limit. Proposed metrics group set:
  group1: L2CacheHit VWriteInsts MemUnitStalled WriteUnitStalled MemUnitBusy FetchSize FlatVMemInsts LDSInsts VALUInsts SALUInsts SFetchInsts FlatLDSInsts GPUBusy Wavefronts
  group2: WriteSize GDSInsts VALUUtilization VALUBusy SALUBusy LDSBankConflict

ERROR: rocprofiler_open(), Construct(), Metrics list exceeds HW limits

Aborted (core dumped)
Error found, profiling aborted.
  • Collecting with multiple runs

To collect several metric groups a full application replay is used by defining several ‘pmc:’ lines in the input file, see 2.1.

Application tracing

Supported application tracing includes runtime API and GPU activity tracing’ Supported runtimes are: ROCr (HSA API) and HIP Supported GPU activity: kernel execution, async memory copy, barrier packets. The trace is generated in JSON format compatible with Chrome tracing. The trace consists of several sections with timelines for API trace per thread and GPU activity. The timelines events show event name and parameters. Supported options: ‘—hsa-trace’, ‘—hip-trace’, ‘—sys-trace’, where ‘sys trace’ is for HIP and HSA combined trace.

HIP runtime trace

The trace is generated by option ‘—hip-trace’ and includes HIP API timelines and GPU activity at the runtime level.

ROCr runtime trace

The trace is generated by option ‘—hsa-trace’ and includes ROCr API timelines and GPU activity at AQL queue level. Also, can provide counters per kernel.

KFD driver trace

The trace is generated by option ‘—kfd-trace’ and includes KFD Thunk API timelines.

It is planned to include memory allocations/migration activity tracing.

Code annotation

Support for application code annotation. Start/stop API is supported to programmatically control the profiling. A ‘roctx’ library provides annotation API. Annotation is visualized in JSON trace as a separate “Markers and Ranges” timeline section.

Start/stop API
// Tracing start API
void roctracer_start();

// Tracing stop API
void roctracer_stop();
rocTX basic markers API
// A marker created by given ASCII message
void roctxMark(const char* message);

// Returns the 0 based level of a nested range being started by given message associated to this range.
// A negative value is returned on the error.
int roctxRangePush(const char* message);

// Marks the end of a nested range.
// Returns the 0 based level the range.
// A negative value is returned on the error.
int roctxRangePop();

Multiple GPUs profiling

The profiler supports multiple GPU’s profiling and provide GPI id for counters and kernels data in CSV output file. Also, GPU id is indicating for respective GPU activity timeline in JSON trace.

Profiling control

Profiling can be controlled by specifying a profiling scope, by filtering trace events and specifying interesting time intervals.

Profiling scope

Counters profiling scope can be specified by GPU id list, kernel name substrings list and dispatch range. Supported range formats examples: “3:9”, “3:”, “3”. You can see an example of input file in 2.1.

Tracing control

Tracing can be filtered by events names using profiler input file and by enabling interesting time intervals by command line option.

Filtering traced APIs

A list of traced API names can be specified in profiler input file. An example of input file line for ROCr runtime trace (HAS API): hsa: hsa_queue_create hsa_amd_memory_pool_allocate

Tracing period

Tracing can be disabled on start so it can be enabled with start/stop API:

--trace-start <on|off>

Trace can be dumped periodically with initial delay, dumping period length and rate:

--trace-period <dealy:length:rate>

Concurrent kernels

Currently concurrent kernels profiling is not supported which is a planned feature. Kernels are serialized.

Multi-processes profiling

Multi-processes profiling is not currently supported.

Errors logging

Profiler errors are logged to global logs:

/tmp/aql_profile_log.txt
/tmp/rocprofiler_log.txt
/tmp/roctracer_log.txt

3rd party visualization tools

‘rocprof’ is producing JSON trace compatible with Chrome Tracing, which is an internal trace visualization tool in Google Chrome.

Chrome tracing

Good review can be found by the link: https://aras-p.info/blog/2017/01/23/Chrome-Tracing-as-Profiler-Frontend/

Command line options

The command line options can be printed with option ‘-h’:

rocprof [-h] [--list-basic] [--list-derived] [-i <input .txt/.xml file>] [-o <output CSV file>] <app command line>

Options:
-h - this help
--verbose - verbose mode, dumping all base counters used in the input metrics
--list-basic - to print the list of basic HW counters
--list-derived - to print the list of derived metrics with formulas
--cmd-qts <on|off> - quoting profiled cmd line [on]

-i <.txt|.xml file> - input file
    Input file .txt format, automatically rerun application for every pmc line:

      # Perf counters group 1
      pmc : Wavefronts VALUInsts SALUInsts SFetchInsts FlatVMemInsts LDSInsts FlatLDSInsts GDSInsts FetchSize
      # Perf counters group 2
      pmc : VALUUtilization,WriteSize L2CacheHit
      # Filter by dispatches range, GPU index and kernel names
      # supported range formats: "3:9", "3:", "3"
      range: 1 : 4
      gpu: 0 1 2 3
      kernel: simple Pass1 simpleConvolutionPass2

    Input file .xml format, for single profiling run:

      # Metrics list definition, also the form "<block-name>:<event-id>" can be used
      # All defined metrics can be found in the 'metrics.xml'
      # There are basic metrics for raw HW counters and high-level metrics for derived counters
      <metric name=SQ:4,SQ_WAVES,VFetchInsts
      ></metric>

      # Filter by dispatches range, GPU index and kernel names
      <metric
        # range formats: "3:9", "3:", "3"
        range=""
        # list of gpu indexes "0,1,2,3"
        gpu_index=""
        # list of matched sub-strings "Simple1,Conv1,SimpleConvolution"
        kernel=""
      ></metric>

-o <output file> - output CSV file [<input file base>.csv]
  The output CSV file columns meaning in the columns order:
    Index - kernels dispatch order index
    KernelName - the dispatched kernel name
    gpu-id - GPU id the kernel was submitted to
    queue-id - the ROCm queue unique id the kernel was submitted to
    queue-index - The ROCm queue write index for the submitted AQL packet
    tid - system application thread id which submitted the kernel
    grd - the kernel's grid size
    wgr - the kernel's work group size
    lds - the kernel's LDS memory size
    scr - the kernel's scratch memory size
    vgpr - the kernel's VGPR size
    sgpr - the kernel's SGPR size
    fbar - the kernel's barriers limitation
    sig - the kernel's completion signal
    ... - The columns with the counters values per kernel dispatch
    DispatchNs/BeginNs/EndNs/CompleteNs - timestamp columns if time-stamping was enabled

-d <data directory> - directory where profiler store profiling data including thread treaces [/tmp]
    The data directory is renoving autonatically if the directory is matching the temporary one, which is the default.
-t <temporary directory> - to change the temporary directory [/tmp]
    By changing the temporary directory you can prevent removing the profiling data from /tmp or enable removing from not '/tmp' directory.

--basenames <on|off> - to turn on/off truncating of the kernel full function names till the base ones [off]
--timestamp <on|off> - to turn on/off the kernel dispatches timestamps, dispatch/begin/end/complete [off]
  Four kernel timestamps in nanoseconds are reported:
      DispatchNs - the time when the kernel AQL dispatch packet was written to the queue
      BeginNs - the kernel execution begin time
      EndNs - the kernel execution end time
      CompleteNs - the time when the completion signal of the AQL dispatch packet was received

--ctx-limit <max number> - maximum number of outstanding contexts [0 - unlimited]
--heartbeat <rate sec> - to print progress heartbeats [0 - disabled]

--stats - generating kernel execution stats, file <output name>.stats.csv
--roctx-trace - to enable rocTX applicatin code annotation trace
  Will show the application code annotation in JSON trace "Markers and Ranges" section.
--sys-trace - to trace HIP/HSA APIs and GPU activity, generates stats and JSON trace chrome-tracing compatible
--hip-trace - to trace HIP, generates API execution stats and JSON file chrome-tracing compatible
--hsa-trace - to trace HSA, generates API execution stats and JSON file chrome-tracing compatible
--kfd-trace - to trace KFD, generates API execution stats and JSON file chrome-tracing compatible
  Generated files: <output name>.<domain>_stats.txt <output name>.json
  Traced API list can be set by input .txt or .xml files.
  Input .txt:
    hsa: hsa_queue_create hsa_amd_memory_pool_allocate
  Input .xml:
    <trace name="HSA">
      <parameters list="hsa_queue_create, hsa_amd_memory_pool_allocate">
      </parameters>
    </trace>

--trace-start <on|off> - to enable tracing on start [on]
--trace-period <dealy:length:rate> - to enable trace with initial delay, with periodic sample length and rate
  Supported time formats: <number(m|s|ms|us)>
--obj-tracking <on|off> - to turn on/off kernels code objects tracking [off]
  To support V3 code objects.

Configuration file:
You can set your parameters defaults preferences in the configuration file 'rpl_rc.xml'. The search path sequence: .:/home/      evgeny:<package path>
First the configuration file is looking in the current directory, then in your home, and then in the package directory.
Configurable options: 'basenames', 'timestamp', 'ctx-limit', 'heartbeat', 'obj-tracking'.
An example of 'rpl_rc.xml':
  <defaults
    basenames=off
    timestamp=off
    ctx-limit=0
    heartbeat=0
    obj-tracking=off
  ></defaults>

Publicly available counters and metrics

The following counters are publicly available for commercially available VEGA10/20 GPUs.

Counters:

•   GRBM_COUNT : Tie High - Count Number of Clocks
•   GRBM_GUI_ACTIVE : The GUI is Active
•   SQ_WAVES : Count number of waves sent to SQs. (per-simd, emulated, global)
•   SQ_INSTS_VALU : Number of VALU instructions issued. (per-simd, emulated)
•   SQ_INSTS_VMEM_WR : Number of VMEM write instructions issued (including FLAT). (per-simd, emulated)
•   SQ_INSTS_VMEM_RD : Number of VMEM read instructions issued (including FLAT). (per-simd, emulated)
•   SQ_INSTS_SALU : Number of SALU instructions issued. (per-simd, emulated)
•   SQ_INSTS_SMEM : Number of SMEM instructions issued. (per-simd, emulated)
•   SQ_INSTS_FLAT : Number of FLAT instructions issued. (per-simd, emulated)
•   SQ_INSTS_FLAT_LDS_ONLY : Number of FLAT instructions issued that read/wrote only from/to LDS (only works if EARLY_TA_DONE is enabled). (per-simd, emulated)
•   SQ_INSTS_LDS : Number of LDS instructions issued (including FLAT). (per-simd, emulated)
•   SQ_INSTS_GDS : Number of GDS instructions issued. (per-simd, emulated)
•   SQ_WAIT_INST_LDS : Number of wave-cycles spent waiting for LDS instruction issue. In units of 4 cycles. (per-simd, nondeterministic)
•   SQ_ACTIVE_INST_VALU : regspec 71? Number of cycles the SQ instruction arbiter is working on a VALU instruction. (per-simd, nondeterministic)
•   SQ_INST_CYCLES_SALU : Number of cycles needed to execute non-memory read scalar operations. (per-simd, emulated)
•   SQ_THREAD_CYCLES_VALU : Number of thread-cycles used to execute VALU operations (similar to INST_CYCLES_VALU but multiplied by # of active threads). (per-simd)
•   SQ_LDS_BANK_CONFLICT : Number of cycles LDS is stalled by bank conflicts. (emulated)
•   TA_TA_BUSY[0-15] : TA block is busy. Perf_Windowing not supported for this counter.
•   TA_FLAT_READ_WAVEFRONTS[0-15] : Number of flat opcode reads processed by the TA.
•   TA_FLAT_WRITE_WAVEFRONTS[0-15] : Number of flat opcode writes processed by the TA.
•   TCC_HIT[0-15] : Number of cache hits.
•   TCC_MISS[0-15] : Number of cache misses. UC reads count as misses.
•   TCC_EA_WRREQ[0-15] : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Atomics may travel over the same interface and are generally classified as write requests. This does not include probe commands.
•   TCC_EA_WRREQ_64B[0-15] : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface.
•   TCC_EA_WRREQ_STALL[0-15] : Number of cycles a write request was stalled.
•   TCC_EA_RDREQ[0-15] : Number of TCC/EA read requests (either 32-byte or 64-byte)
•   TCC_EA_RDREQ_32B[0-15] : Number of 32-byte TCC/EA read requests
•   TCP_TCP_TA_DATA_STALL_CYCLES[0-15] : TCP stalls TA data interface. Now Windowed.

The following derived metrics have been defined and the profiler metrics XML specification can be found at: https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/test/tool/metrics.xml.

Metrics:

•   TA_BUSY_avr : TA block is busy. Average over TA instances.
•   TA_BUSY_max : TA block is busy. Max over TA instances.
•   TA_BUSY_min : TA block is busy. Min over TA instances.
•   TA_FLAT_READ_WAVEFRONTS_sum : Number of flat opcode reads processed by the TA. Sum over TA instances.
•   TA_FLAT_WRITE_WAVEFRONTS_sum : Number of flat opcode writes processed by the TA. Sum over TA instances.
•   TCC_HIT_sum : Number of cache hits. Sum over TCC instances.
•   TCC_MISS_sum : Number of cache misses. Sum over TCC instances.
•   TCC_EA_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over TCC instances.
•   TCC_EA_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or 64-byte). Sum over TCC instances.
•   TCC_EA_WRREQ_sum : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Sum over TCC instances.
•   TCC_EA_WRREQ_64B_sum : Number of 64-byte transactions going (64-byte write or CMPSWAP) over the TC_EA_wrreq interface. Sum over TCC instances.
•   TCC_WRREQ_STALL_max : Number of cycles a write request was stalled. Max over TCC instances.
•   TCC_MC_WRREQ_sum : Number of 32-byte effective writes. Sum over TCC instaces.
•   FETCH_SIZE : The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
•   WRITE_SIZE : The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
•   GPUBusy : The percentage of time GPU was busy.
•   Wavefronts : Total wavefronts.
•   VALUInsts : The average number of vector ALU instructions executed per work-item (affected by flow control).
•   SALUInsts : The average number of scalar ALU instructions executed per work-item (affected by flow control).
•   VFetchInsts : The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that fetch from video memory.
•   SFetchInsts : The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).
•   VWriteInsts : The average number of vector write instructions to the video memory executed per work-item (affected by flow control). Excludes FLAT instructions that write to video memory.
•   FlatVMemInsts : The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch.
•   LDSInsts : The average number of LDS read or LDS write instructions executed per work item (affected by flow control).  Excludes FLAT instructions that read from or write to LDS.
•   FlatLDSInsts : The average number of FLAT instructions that read or write to LDS executed per work item (affected by flow control).
•   GDSInsts : The average number of GDS read or GDS write instructions executed per work item (affected by flow control).
•   VALUUtilization : The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence).
•   VALUBusy : The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
•   SALUBusy : The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
•   Mem32Bwrites :
•   FetchSize : The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
•   WriteSize : The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
•   L2CacheHit : The percentage of fetch, write, atomic, and other instructions that hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal).
•   MemUnitBusy : The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
•   MemUnitStalled : The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
•   WriteUnitStalled : The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).
•   ALUStalledByLDS : The percentage of GPUTime ALU units are stalled by the LDS input queue being full or the output queue being not ready. If there are LDS bank conflicts, reduce them. Otherwise, try reducing the number of LDS accesses if possible. Value range: 0% (optimal) to 100% (bad).
•   LDSBankConflict : The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).

AMD ROCProfiler Library

ROC profiler library. Profiling with perf-counters and derived metrics. Library supports GFX8/GFX9.

HW specific low-level performance analysis interface for profiling of GPU compute applications. The profiling includes HW performance counters with complex performance metrics.

GitHub: https://github.com/ROCm-Developer-Tools/rocprofiler

Metrics

API specification

To get sources

To clone ROC Profiler from GitHub:

git clone https://github.com/ROCm-Developer-Tools/rocprofiler

The library source tree:

  *  bin
      *  rocprof - Profiling tool run script
  *  doc - Documentation
  *  inc/rocprofiler.h - Library public API
  *  src - Library sources
      *  core - Library API sources
      *  util - Library utils sources
      *  xml - XML parser
  *  test - Library test suite
      *  tool - Profiling tool
          *  tool.cpp - tool sources
          *  metrics.xml - metrics config file
      *  ctrl - Test controll
      *  util - Test utils
      *  simple_convolution - Simple convolution test kernel

Build

Build environment:

export CMAKE_PREFIX_PATH=<path to hsa-runtime includes>:<path to hsa-runtime library>
export CMAKE_BUILD_TYPE=<debug|release> # release by default
export CMAKE_DEBUG_TRACE=1 # to enable debug tracing

To Build with the current installed ROCm:

To build and install to /opt/rocm/rocprofiler
export CMAKE_PREFIX_PATH=/opt/rocm/include/hsa:/opt/rocm
cd ../rocprofiler
mkdir build
cd build
cmake ..
make
make install

Internal ‘simple_convolution’ test run script:

cd ../rocprofiler/build
./run.sh

To enable error messages logging to ‘/tmp/rocprofiler_log.txt’:

export ROCPROFILER_LOG=1

To enable verbose tracing:

export ROCPROFILER_TRACE=1

AMD ROCm ROCTracer

ROCtracer library, Runtimes Generic Callback/Activity APIs. The goal of the implementation is to provide a generic independent from specific runtime profiler to trace API and asyncronous activity.

The API provides functionality for registering the runtimes API callbacks and asyncronous activity records pool support.

GitHub: https://github.com/ROCm-Developer-Tools/roctracer

API specification

To get sources

To clone ROC Tracer from GitHub:

git clone -b amd-master https://github.com/ROCm-Developer-Tools/roctracer

The library source tree:

  *  inc/roctracer.h - Library public API
  *  src - Library sources
      *  core - Library API sources
      *  util - Library utils sources
  *  test - test suit
      *  MatrixTranspose - test based on HIP MatrixTranspose sample

Build and run test

- Python is required
  The required modules: CppHeaderParser, argparse.
  To install:
  sudo pip install CppHeaderParser argparse

- To customize environment, below are defaults
 export HIP_PATH=/opt/rocm/HIP
 export HCC_HOME=/opt/rocm/hcc/
 export CMAKE_PREFIX_PATH=/opt/rocm

- Build ROCtracer
 export CMAKE_BUILD_TYPE=<debug|release> # release by default
 cd <your path>/roctracer && mkdir build && cd build && cmake -DCMAKE_INSTALL_PREFIX=/opt/rocm .. && make -j <nproc>

- To build and run test
 make mytest
 run.sh

- To install
 make install
 or
 make package && dpkg -i *.deb