# ROCm Documentation has moved to docs.amd.com
AMD ROCm Profiler¶
Overview¶
Profiling Modes¶
‘rocprof’ can be used for GPU profiling using HW counters and application tracing.
GPU profiling¶
GPU profiling is controlled with input file which defines a list of metrics/counters and a profiling scope. An input file is provided using option ‘-i ’. Output CSV file with a line per submitted kernel is generated. Each line has kernel name, kernel parameters and counter values. By option ‘—stats’ the kernel execution stats can be generated in CSV format. Currently profiling has limitation of serializing submitted kernels. An example of input file:
# Perf counters group 1
pmc : Wavefronts VALUInsts SALUInsts SFetchInsts
# Perf counters group 2
pmc : TCC_HIT[0], TCC_MISS[0]
# Filter by dispatches range, GPU index and kernel names
# supported range formats: "3:9", "3:", "3"
range: 1 : 4
gpu: 0 1 2 3
kernel: simple Pass1 simpleConvolutionPass2
An example of profiling command line for ‘MatrixTranspose’ application
$ rocprof -i input.txt MatrixTranspose
RPL: on '191018_011134' from '/…./rocprofiler_pkg' in '/…./MatrixTranspose'
RPL: profiling '"./MatrixTranspose"'
RPL: input file 'input.txt'
RPL: output dir '/tmp/rpl_data_191018_011134_9695'
RPL: result dir '/tmp/rpl_data_191018_011134_9695/input0_results_191018_011134'
ROCProfiler: rc-file '/…./rpl_rc.xml'
ROCProfiler: input from "/tmp/rpl_data_191018_011134_9695/input0.xml"
gpu_index =
kernel =
range =
4 metrics
L2CacheHit, VFetchInsts, VWriteInsts, MemUnitStalled
0 traces
Device name Ellesmere [Radeon RX 470/480/570/570X/580/580X]
PASSED!
ROCProfiler: 1 contexts collected, output directory
/tmp/rpl_data_191018_011134_9695/input0_results_191018_011134
RPL: '/…./MatrixTranspose/input.csv' is generated
Counters and metrics¶
There are two profiling features, metrics and traces. Hardware performance counters are treated as the basic metrics and the formulas can be defined for derived metrics. Counters and metrics can be dynamically configured using XML configuration files with counters and metrics tables:
Counters table entry, basic metric: counter name, block name, event id
Derived metrics table entry: metric name, an expression for calculation the metric from the counters
Metrics XML File Example:
<gfx8>
<metric name=L1_CYCLES_COUNTER block=L1 event=0 descr=”L1 cache cycles”></metric>
<metric name=L1_MISS_COUNTER block=L1 event=33 descr=”L1 cache misses”></metric>
. . .
</gfx8>
<gfx9>
. . .
</gfx9>
<global>
<metric
name=L1_MISS_RATIO
expr=L1_CYCLES_COUNT/L1_MISS_COUNTER
descry=”L1 miss rate metric”
></metric>
</global>
Metrics query¶
Available counters and metrics can be queried by options ‘—list-basic’ for counters and ‘—list-derived’ for derived metrics. The output for counters indicates number of block instances and number of block counter registers. The output for derived metrics prints the metrics expressions. Examples:
$ rocprof --list-basic
RPL: on '191018_014450' from '/opt/rocm/rocprofiler' in '/…./MatrixTranspose'
ROCProfiler: rc-file '/…./rpl_rc.xml'
Basic HW counters:
gpu-agent0 : GRBM_COUNT : Tie High - Count Number of Clocks
block GRBM has 2 counters
gpu-agent0 : GRBM_GUI_ACTIVE : The GUI is Active
block GRBM has 2 counters
. . .
gpu-agent0 : TCC_HIT[0-15] : Number of cache hits.
block TCC has 4 counters
gpu-agent0 : TCC_MISS[0-15] : Number of cache misses. UC reads count as misses.
block TCC has 4 counters
. . .
$ rocprof --list-derived
RPL: on '191018_015911' from '/opt/rocm/rocprofiler' in
'/home/evgeny/work/BUILD/0_MatrixTranspose'
ROCProfiler: rc-file '/home/evgeny/rpl_rc.xml'
Derived metrics:
gpu-agent0 : TCC_HIT_sum : Number of cache hits. Sum over TCC instances.
TCC_HIT_sum = sum(TCC_HIT,16)
gpu-agent0 : TCC_MISS_sum : Number of cache misses. Sum over TCC instances.
TCC_MISS_sum = sum(TCC_MISS,16)
gpu-agent0 : TCC_MC_RDREQ_sum : Number of 32-byte reads. Sum over TCC instaces.
TCC_MC_RDREQ_sum = sum(TCC_MC_RDREQ,16)
. . .
Metrics collecting¶
Counters and metrics accumulated per kernel can be collected using input file with a list of metrics, see an example in 2.1. Currently profiling has limitation of serializing submitted kernels. The number of counters which can be dumped by one run is limited by GPU HW by number of counter registers per block. The number of counters can be different for different blocks and can be queried, see 2.1.1.1.
Blocks instancing¶
GPU blocks are implemented as several identical instances. To dump counters of specific instance square brackets can be used, see an example in 2.1. The number of block instances can be queried, see 2.1.1.1.
HW limitations¶
The number of counters which can be dumped by one run is limited by GPU HW by number of counter registers per block. The number of counters can be different for different blocks and can be queried, see 2.1.1.1.
Metrics groups
To dump a list of metrics exceeding HW limitations the metrics list can be split on groups. The tool supports automatic splitting on optimal metric groups:
$ rocprof -i input.txt ./MatrixTranspose RPL: on '191018_032645' from '/opt/rocm/rocprofiler' in '/…./MatrixTranspose' RPL: profiling './MatrixTranspose' RPL: input file 'input.txt' RPL: output dir '/tmp/rpl_data_191018_032645_12106' RPL: result dir '/tmp/rpl_data_191018_032645_12106/input0_results_191018_032645' ROCProfiler: rc-file '/…./rpl_rc.xml' ROCProfiler: input from "/tmp/rpl_data_191018_032645_12106/input0.xml" gpu_index = kernel = range = 20 metrics Wavefronts, VALUInsts, SALUInsts, SFetchInsts, FlatVMemInsts, LDSInsts, FlatLDSInsts, GDSInsts, VALUUtilization, FetchSize, WriteSize, L2CacheHit, VWriteInsts, GPUBusy, VALUBusy, SALUBusy, MemUnitStalled, WriteUnitStalled, LDSBankConflict, MemUnitBusy 0 traces Device name Ellesmere [Radeon RX 470/480/570/570X/580/580X] Input metrics out of HW limit. Proposed metrics group set: group1: L2CacheHit VWriteInsts MemUnitStalled WriteUnitStalled MemUnitBusy FetchSize FlatVMemInsts LDSInsts VALUInsts SALUInsts SFetchInsts FlatLDSInsts GPUBusy Wavefronts group2: WriteSize GDSInsts VALUUtilization VALUBusy SALUBusy LDSBankConflict ERROR: rocprofiler_open(), Construct(), Metrics list exceeds HW limits Aborted (core dumped) Error found, profiling aborted.
Collecting with multiple runs
To collect several metric groups a full application replay is used by defining several ‘pmc:’ lines in the input file, see 2.1.
Application tracing¶
Supported application tracing includes runtime API and GPU activity tracing’ Supported runtimes are: ROCr (HSA API) and HIP Supported GPU activity: kernel execution, async memory copy, barrier packets. The trace is generated in JSON format compatible with Chrome tracing. The trace consists of several sections with timelines for API trace per thread and GPU activity. The timelines events show event name and parameters. Supported options: ‘—hsa-trace’, ‘—hip-trace’, ‘—sys-trace’, where ‘sys trace’ is for HIP and HSA combined trace.
HIP runtime trace¶
The trace is generated by option ‘—hip-trace’ and includes HIP API timelines and GPU activity at the runtime level.
ROCr runtime trace¶
The trace is generated by option ‘—hsa-trace’ and includes ROCr API timelines and GPU activity at AQL queue level. Also, can provide counters per kernel.
KFD driver trace¶
The trace is generated by option ‘—kfd-trace’ and includes KFD Thunk API timelines.
It is planned to include memory allocations/migration activity tracing.
Code annotation¶
Support for application code annotation. Start/stop API is supported to programmatically control the profiling. A ‘roctx’ library provides annotation API. Annotation is visualized in JSON trace as a separate “Markers and Ranges” timeline section.
Start/stop API¶
// Tracing start API
void roctracer_start();
// Tracing stop API
void roctracer_stop();
rocTX basic markers API¶
// A marker created by given ASCII message
void roctxMark(const char* message);
// Returns the 0 based level of a nested range being
// started by given message associated to this range.
// A negative value is returned on the error.
int roctxRangePush(const char* message);
// Marks the end of a nested range.
// Returns the 0 based level the range.
// A negative value is returned on the error.
int roctxRangePop();
Multiple GPUs profiling¶
The profiler supports multiple GPU’s profiling and provide GPI id for counters and kernels data in CSV output file. Also, GPU id is indicating for respective GPU activity timeline in JSON trace.
Profiling control¶
Profiling can be controlled by specifying a profiling scope, by filtering trace events and specifying interesting time intervals.
Profiling scope¶
Counters profiling scope can be specified by GPU id list, kernel name substrings list and dispatch range. Supported range formats examples: “3:9”, “3:”, “3”. You can see an example of input file in 2.1.
Tracing control¶
Tracing can be filtered by events names using profiler input file and by enabling interesting time intervals by command line option.
Filtering Traced APIs¶
A list of traced API names can be specified in profiler input file. An example of input file line for ROCr runtime trace (HSA API):
hsa:hsa_queue_create hsa_amd_memory_pool_allocate
Tracing period¶
Tracing can be disabled on start so it can be enabled with start/stop API:
--trace-start <on|off>
Trace can be dumped periodically with initial delay, dumping period length and rate:
--trace-period <dealy:length:rate>
Concurrent kernels¶
Currently concurrent kernels profiling is not supported, which is a planned feature. Kernels are serialized.
Multi-processes profiling¶
Multi-processes profiling is not currently supported.
Errors logging¶
Profiler errors are logged to global logs:
/tmp/aql_profile_log.txt
/tmp/rocprofiler_log.txt
/tmp/roctracer_log.txt
3rd party visualization tools¶
‘rocprof’ produces JSON trace, which is compatible with Chrome Tracing. Chrome Tracing is an internal trace visualization tool in Google Chrome.
For more information about Chrome Tracing, see https://aras-p.info/blog/2017/01/23/Chrome-Tracing-as-Profiler-Frontend/
Runtime Environment Setup¶
You must set the ‘PATH’ environment variable to the ROCM bin directory. This enables the profiler to find the correct ROCm setup and get ROCm info metadata. For example, “export PATH=$PATH:/opt/rocm/bin”.
Command line options¶
The command line options can be printed with option ‘-h’:
rocprof [-h] [--list-basic] [--list-derived] [-i <input .txt/.xml file>]
[-o <output CSV file>] <app command line>
Options:
-h - this help
--verbose - verbose mode, dumping all base counters used in the input metrics
--list-basic - to print the list of basic HW counters
--list-derived - to print the list of derived metrics with formulas
--cmd-qts <on|off> - quoting profiled cmd line [on]
-i <.txt|.xml file> - input file
Input file .txt format, automatically rerun application for every pmc line:
# Perf counters group 1
pmc : Wavefronts VALUInsts SALUInsts SFetchInsts FlatVMemInsts LDSInsts
FlatLDSInsts GDSInsts FetchSize
# Perf counters group 2
pmc : VALUUtilization,WriteSize L2CacheHit
# Filter by dispatches range, GPU index and kernel names
# supported range formats: "3:9", "3:", "3"
range: 1 : 4
gpu: 0 1 2 3
kernel: simple Pass1 simpleConvolutionPass2
Input file .xml format, for single profiling run:
# Metrics list definition, also the form "<block-name>:<event-id>" can be used
# All defined metrics can be found in the 'metrics.xml'
# There are basic metrics for raw HW counters & high-level metrics for derived counters
<metric name=SQ:4,SQ_WAVES,VFetchInsts
></metric>
# Filter by dispatches range, GPU index and kernel names
<metric
# range formats: "3:9", "3:", "3"
range=""
# list of gpu indexes "0,1,2,3"
gpu_index=""
# list of matched sub-strings "Simple1,Conv1,SimpleConvolution"
kernel=""
></metric>
-o <output file> - output CSV file [<input file base>.csv]
The output CSV file columns meaning in the columns order:
Index - kernels dispatch order index
KernelName - the dispatched kernel name
gpu-id - GPU id the kernel was submitted to
queue-id - the ROCm queue unique id the kernel was submitted to
queue-index - The ROCm queue write index for the submitted AQL packet
tid - system application thread id which submitted the kernel
grd - the kernel's grid size
wgr - the kernel's work group size
lds - the kernel's LDS memory size
scr - the kernel's scratch memory size
vgpr - the kernel's VGPR size
sgpr - the kernel's SGPR size
fbar - the kernel's barriers limitation
sig - the kernel's completion signal
... - The columns with the counters values per kernel dispatch
DispatchNs/BeginNs/EndNs/CompleteNs - timestamp columns if time-stamping was enabled
-d <data directory> - directory where profiler store profiling data including
thread treaces [/tmp] The data directory is renoving
autonatically if the directory is matching the temporary
one, which is the default.
-t <temporary directory> - to change the temporary directory [/tmp]
By changing the temporary directory you can prevent
removing the profiling data from /tmp or enable
removing from not '/tmp' directory.
--basenames <on|off> - to turn on/off truncating of the kernel full function
names till the base ones [off]
--timestamp <on|off> - to turn on/off the kernel dispatches timestamps,
dispatch/begin/end/complete [off]
Four kernel timestamps in nanoseconds are reported:
DispatchNs - the time when the kernel AQL dispatch packet was written to the queue
BeginNs - the kernel execution begin time
EndNs - the kernel execution end time
CompleteNs - the time when the completion signal of the AQL dispatch
packet was received
--ctx-limit <max number> - maximum number of outstanding contexts [0 - unlimited]
--heartbeat <rate sec> - to print progress heartbeats [0 - disabled]
--stats - generating kernel execution stats, file <output name>.stats.csv
--roctx-trace - to enable rocTX applicatin code annotation trace
Will show the application code annotation in JSON trace "Markers and Ranges" section.
--sys-trace - to trace HIP/HSA APIs and GPU activity, generates stats and JSON
trace chrome-tracing compatible
--hip-trace - to trace HIP, generates API execution stats and JSON file
chrome-tracing compatible
--hsa-trace - to trace HSA, generates API execution stats and JSON file
chrome-tracing compatible
--kfd-trace - to trace KFD, generates API execution stats and JSON file
chrome-tracing compatible
Generated files: <output name>.<domain>_stats.txt <output name>.json
Traced API list can be set by input .txt or .xml files.
Input .txt:
hsa: hsa_queue_create hsa_amd_memory_pool_allocate
Input .xml:
<trace name="HSA">
<parameters list="hsa_queue_create, hsa_amd_memory_pool_allocate">
</parameters>
</trace>
--trace-start <on|off> - to enable tracing on start [on]
--trace-period <dealy:length:rate> - to enable trace with initial delay, with
periodic sample length and rate Supported time formats: <number(m|s|ms|us)>
--obj-tracking <on|off> - to turn on/off kernels code objects tracking [off]
To support V3 code objects.
Configuration file:
You can set your parameters defaults preferences in the configuration file
'rpl_rc.xml'. The search path sequence: .:/home/ evgeny:<package path>
First the configuration file is looking in the current directory, then in your
home, and then in the package directory. Configurable options: 'basenames',
'timestamp', 'ctx-limit', 'heartbeat', 'obj-tracking'.
An example of 'rpl_rc.xml':
<defaults
basenames=off
timestamp=off
ctx-limit=0
heartbeat=0
obj-tracking=off
></defaults>
Publicly available counters and metrics¶
The following counters are publicly available for commercially available VEGA10/20 GPUs.
Counters:
• GRBM_COUNT : Tie High - Count Number of Clocks
• GRBM_GUI_ACTIVE : The GUI is Active
• SQ_WAVES : Count number of waves sent to SQs. (per-simd, emulated, global)
• SQ_INSTS_VALU : Number of VALU instructions issued. (per-simd, emulated)
• SQ_INSTS_VMEM_WR : Number of VMEM write instructions issued (including FLAT).
(per-simd, emulated)
• SQ_INSTS_VMEM_RD : Number of VMEM read instructions issued (including FLAT).
(per-simd, emulated)
• SQ_INSTS_SALU : Number of SALU instructions issued. (per-simd, emulated)
• SQ_INSTS_SMEM : Number of SMEM instructions issued. (per-simd, emulated)
• SQ_INSTS_FLAT : Number of FLAT instructions issued. (per-simd, emulated)
• SQ_INSTS_FLAT_LDS_ONLY : Number of FLAT instructions issued that read/wrote
only from/to LDS (only works if EARLY_TA_DONE is
enabled). (per-simd, emulated)
• SQ_INSTS_LDS : Number of LDS instructions issued (including FLAT).
(per-simd, emulated)
• SQ_INSTS_GDS : Number of GDS instructions issued. (per-simd, emulated)
• SQ_WAIT_INST_LDS : Number of wave-cycles spent waiting for LDS instruction
issue. In units of 4 cycles. (per-simd, nondeterministic)
• SQ_ACTIVE_INST_VALU : regspec 71? Number of cycles the SQ instruction
arbiter is working on a VALU instruction. (per-simd,
nondeterministic)
• SQ_INST_CYCLES_SALU : Number of cycles needed to execute non-memory read
scalar operations. (per-simd, emulated)
• SQ_THREAD_CYCLES_VALU : Number of thread-cycles used to execute VALU
operations (similar to INST_CYCLES_VALU but
multiplied by # of active threads). (per-simd)
• SQ_LDS_BANK_CONFLICT : Number of cycles LDS is stalled by bank conflicts.
(emulated)
• TA_TA_BUSY[0-15] : TA block is busy. Perf_Windowing not supported for
this counter.
• TA_FLAT_READ_WAVEFRONTS[0-15] : Number of flat opcode reads processed by the TA.
• TA_FLAT_WRITE_WAVEFRONTS[0-15] : Number of flat opcode writes processed by the TA.
• TCC_HIT[0-15] : Number of cache hits.
• TCC_MISS[0-15] : Number of cache misses. UC reads count as misses.
• TCC_EA_WRREQ[0-15] : Number of transactions (either 32-byte or 64-byte)
going over the TC_EA_wrreq interface. Atomics may
travel over the same interface and are generally
classified as write requests. This does not include
probe commands.
• TCC_EA_WRREQ_64B[0-15] : Number of 64-byte transactions going (64-byte
write or CMPSWAP) over the TC_EA_wrreq interface.
• TCC_EA_WRREQ_STALL[0-15] : Number of cycles a write request was stalled.
• TCC_EA_RDREQ[0-15] : Number of TCC/EA read requests (either 32-byte or 64-byte)
• TCC_EA_RDREQ_32B[0-15] : Number of 32-byte TCC/EA read requests
• TCP_TCP_TA_DATA_STALL_CYCLES[0-15] : TCP stalls TA data interface. Now Windowed.
The following derived metrics have been defined and the profiler metrics XML specification can be found at: https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/test/tool/metrics.xml.
Metrics:
• TA_BUSY_avr : TA block is busy. Average over TA instances.
• TA_BUSY_max : TA block is busy. Max over TA instances.
• TA_BUSY_min : TA block is busy. Min over TA instances.
• TA_FLAT_READ_WAVEFRONTS_sum : Number of flat opcode reads processed by
the TA. Sum over TA instances.
• TA_FLAT_WRITE_WAVEFRONTS_sum : Number of flat opcode writes processed by
the TA. Sum over TA instances.
• TCC_HIT_sum : Number of cache hits. Sum over TCC instances.
• TCC_MISS_sum : Number of cache misses. Sum over TCC instances.
• TCC_EA_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over
TCC instances.
• TCC_EA_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or
64-byte). Sum over TCC instances.
• TCC_EA_WRREQ_sum : Number of transactions (either 32-byte or 64-byte)
going over the TC_EA_wrreq interface. Sum over TCC instances.
• TCC_EA_WRREQ_64B_sum : Number of 64-byte transactions going (64-byte write
or CMPSWAP) over the TC_EA_wrreq interface. Sum over
TCC instances.
• TCC_WRREQ_STALL_max : Number of cycles a write request was stalled. Max over
TCC instances.
• TCC_MC_WRREQ_sum : Number of 32-byte effective writes. Sum over TCC instaces.
• FETCH_SIZE : The total kilobytes fetched from the video memory. This is
measured with all extra fetches and any cache or memory effects
taken into account.
• WRITE_SIZE : The total kilobytes written to the video memory. This is measured
with all extra fetches and any cache or memory effects taken
into account.
• GPUBusy : The percentage of time GPU was busy.
• Wavefronts : Total wavefronts.
• VALUInsts : The average number of vector ALU instructions executed per work-item
(affected by flow control).
• SALUInsts : The average number of scalar ALU instructions executed per work-item
(affected by flow control).
• VFetchInsts : The average number of vector fetch instructions from the video
memory executed per work-item (affected by flow control). Excludes
FLAT instructions that fetch from video memory.
• SFetchInsts : The average number of scalar fetch instructions from the video
memory executed per work-item (affected by flow control).
• VWriteInsts : The average number of vector write instructions to the video
memory executed per work-item (affected by flow control).
Excludes FLAT instructions that write to video memory.
• FlatVMemInsts : The average number of FLAT instructions that read from or write
to the video memory executed per work item (affected by flow control).
Includes FLAT instructions that read from or write to scratch.
• LDSInsts : The average number of LDS read or LDS write instructions executed
per work item (affected by flow control). Excludes FLAT instructions
that read from or write to LDS.
• FlatLDSInsts : The average number of FLAT instructions that read or write
to LDS executed per work item (affected by flow control).
• GDSInsts : The average number of GDS read or GDS write instructions executed
per work item (affected by flow control).
• VALUUtilization : The percentage of active vector ALU threads in a wave. A
lower number can mean either more thread divergence in a
wave or that the work-group size is not a multiple of 64.
Value range: 0% (bad), 100% (ideal - no thread divergence).
• VALUBusy : The percentage of GPUTime vector ALU instructions are processed.
Value range: 0% (bad) to 100% (optimal).
• SALUBusy : The percentage of GPUTime scalar ALU instructions are processed.
Value range: 0% (bad) to 100% (optimal).
• Mem32Bwrites :
• FetchSize : The total kilobytes fetched from the video memory. This is measured
with all extra fetches and any cache or memory effects taken into account.
• WriteSize : The total kilobytes written to the video memory. This is measured
with all extra fetches and any cache or memory effects taken into account.
• L2CacheHit : The percentage of fetch, write, atomic, and other instructions that
hit the data in L2 cache. Value range: 0% (no hit) to 100% (optimal).
• MemUnitBusy : The percentage of GPUTime the memory unit is active. The result
includes the stall time (MemUnitStalled). This is measured with all
extra fetches and writes and any cache or memory effects taken into
account. Value range: 0% to 100% (fetch-bound).
• MemUnitStalled : The percentage of GPUTime the memory unit is stalled. Try
reducing the number or size of fetches and writes if possible.
Value range: 0% (optimal) to 100% (bad).
• WriteUnitStalled : The percentage of GPUTime the Write unit is stalled.
Value range: 0% to 100% (bad).
• ALUStalledByLDS : The percentage of GPUTime ALU units are stalled by the LDS input
queue being full or the output queue being not ready. If there
are LDS bank conflicts, reduce them. Otherwise, try reducing the
number of LDS accesses if possible. Value range: 0% (optimal) to
100% (bad).
• LDSBankConflict : The percentage of GPUTime LDS is stalled by bank conflicts. Value
range: 0% (optimal) to 100% (bad).
AMD ROCProfiler API¶
ROC profiler library. Profiling with perf-counters and derived metrics. Library supports GFX8/GFX9.
HW specific low-level performance analysis interface for profiling of GPU compute applications. The profiling includes HW performance counters with complex performance metrics.
GitHub: https://github.com/ROCm-Developer-Tools/rocprofiler
Metrics
API specification
To get sources
To clone ROC Profiler from GitHub:
git clone https://github.com/ROCm-Developer-Tools/rocprofiler
The library source tree:
* bin
* rocprof - Profiling tool run script
* doc - Documentation
* inc/rocprofiler.h - Library public API
* src - Library sources
* core - Library API sources
* util - Library utils sources
* xml - XML parser
* test - Library test suite
* tool - Profiling tool
* tool.cpp - tool sources
* metrics.xml - metrics config file
* ctrl - Test controll
* util - Test utils
* simple_convolution - Simple convolution test kernel
Build
Build environment:
export CMAKE_PREFIX_PATH=<path to hsa-runtime includes>:<path to hsa-runtime library>
export CMAKE_BUILD_TYPE=<debug|release> # release by default
export CMAKE_DEBUG_TRACE=1 # to enable debug tracing
To Build with the current installed ROCm:
To build and install to /opt/rocm/rocprofiler
export CMAKE_PREFIX_PATH=/opt/rocm/include/hsa:/opt/rocm
cd ../rocprofiler
mkdir build
cd build
cmake ..
make
make install
Internal ‘simple_convolution’ test run script:
cd ../rocprofiler/build
./run.sh
To enable error messages logging to ‘/tmp/rocprofiler_log.txt’:
export ROCPROFILER_LOG=1
To enable verbose tracing:
export ROCPROFILER_TRACE=1