HIP graphs

HIP graphs#

Note

The HIP graph API is currently in Beta. Some features can change and might have outstanding issues. Not all features supported by CUDA graphs are yet supported. For a list of all currently supported functions see the HIP graph API documentation.

HIP graphs are an alternative way of executing tasks on a GPU that can provide performance benefits over launching kernels using the standard method via streams. A HIP graph is made up of nodes and edges. The nodes of a HIP graph represent the operations performed, while the edges mark dependencies between those operations.

The nodes can be one of the following:

empty nodes
nested graphs
kernel launches
host-side function calls
HIP memory functions (copy, memset, …)
HIP events
signalling or waiting on external semaphores

Note

The available node types are specified by hipGraphNodeType.

The following figure visualizes the concept of graphs, compared to using streams.

Diagram depicting the difference between using streams to execute kernels with dependencies, resolved by explicitly synchronizing, or using graphs, where the edges denote the dependencies.

The standard method of launching kernels incurs a small overhead for each iteration of the operation involved. That overhead is negligible, when the kernel is launched directly with the HIP C/C++ API, but depending on the framework used, there can be several levels of redirection, until the actual kernel is launched by the HIP runtime, leading to significant overhead. Especially for some AI frameworks, a GPU kernel might run faster than the time it takes for the framework to set up and launch the kernel, and so the overhead of repeatedly launching kernels can have a significant impact on performance.

HIP graphs are designed to address this issue, by predefining the HIP API calls and their dependencies with a graph, and performing most of the initialization beforehand. Launching a graph only requires a single call, after which the HIP runtime takes care of executing the operations within the graph. Graphs can provide additional performance benefits, by enabling optimizations that are only possible when knowing the dependencies between the operations.

Diagram depicting the speed up achievable with HIP graphs compared to HIP streams when launching many short-running kernels. — Qualitative presentation of the execution time of many short-running kernels when launched using HIP stream versus HIP graph. This does not include the time needed to set up the graph.#

Using HIP graphs#

There are two different ways of creating graphs: Capturing kernel launches from a stream, or explicitly creating graphs. The difference between the two approaches is explained later in this chapter.

The general flow for using HIP graphs includes the following steps.

Create a hipGraph_t graph template using one of the two approaches described in this chapter
Create a hipGraphExec_t executable instance of the graph template using hipGraphInstantiate()
Use hipGraphLaunch() to launch the executable graph to a stream
After execution completes free and destroy graph resources

The first two steps are the initial setup and only need to be executed once. First step is the definition of the operations (nodes) and the dependencies (edges) between them. The second step is the instantiation of the graph. This takes care of validating and initializing the graph, to reduce the overhead when executing the graph. The third step is the execution of the graph, which takes care of launching all the kernels and executing the operations while respecting their dependencies and necessary synchronizations as specified.

Because HIP graphs require some setup and initialization overhead before their first execution, graphs only provide a benefit for workloads that require many iterations to complete.

In both methods the hipGraph_t template for a graph is used to define the graph. In order to actually launch a graph, the template needs to be instantiated using hipGraphInstantiate(), which results in an executable graph of type hipGraphExec_t. This executable graph can then be launched with hipGraphLaunch(), replaying the operations within the graph. Note, that launching graphs is fundamentally no different to executing other HIP functions on a stream, except for the fact, that scheduling the operations within the graph encompasses less overhead and can enable some optimizations, but they still need to be associated with a stream for execution.

Memory management#

Memory that is used by operations in graphs can either be pre-allocated or managed within the graph. Graphs can contain nodes that take care of allocating memory on the device or copying memory between the host and the device. Whether you want to pre-allocate the memory or manage it within the graph depends on the use-case. If the graph is executed in a tight loop the performance is usually better when the memory is preallocated, so that it does not need to be reallocated in every iteration.

The same rules as for normal memory allocations apply for memory allocated and freed by nodes, meaning that the nodes that access memory allocated in a graph must be ordered after allocation and before freeing.

Memory management within the graph enables the runtime to take care of memory reuse and optimizations. The lifetime of memory managed in a graph begins when the execution reaches the node allocating the memory, and ends when either reaching the corresponding free node within the graph, or after graph execution when a corresponding hipFreeAsync() or hipFree() call is reached. The memory can also be freed with a free node in a different graph that is associated with the same memory address.

Unlike device memory that is not associated with a graph, this does not necessarily mean that the freed memory is returned back to the operating system immediately. Graphs can retain a memory pool for quickly reusing memory within the graph. This can be especially useful when memory is freed and reallocated later on within a graph, as that memory doesn’t have to be requested from the operating system. It also potentially reduces the total memory footprint of the graph, by reusing the same memory.

The amount of memory allocated for graph memory pools on a specific device can be queried using hipDeviceGetGraphMemAttribute(). In order to return the freed memory hipDeviceGraphMemTrim() can be used. This will return any memory that is not in active use by graphs.

These memory allocations can also be set up to allow access from multiple GPUs, just like normal allocations. HIP then takes care of allocating and mapping the memory to the GPUs. When capturing a graph from a stream, the node sets the accessibility according to hipMemPoolSetAccess() at the time of capturing.

Capture graphs from a stream#

The easy way to integrate HIP graphs into already existing code is to use hipStreamBeginCapture() and hipStreamEndCapture() to obtain a hipGraph_t graph template that includes the captured operations.

When starting to capture operations for a graph using hipStreamBeginCapture(), the operations assigned to the stream are captured into a graph instead of being executed. The associated graph is returned when calling hipStreamEndCapture(), which also stops capturing operations. In order to capture to an already existing graph use hipStreamBeginCaptureToGraph().

The functions assigned to the capturing stream are not executed, but instead are captured and defined as nodes in the graph, to be run when the instantiated graph is launched.

Functions must be associated with a stream in order to be captured. This means that non-HIP API functions are not captured by default, but are executed as standard functions when encountered and not added to the graph. In order to assign host functions to a stream use hipLaunchHostFunc(), as shown in the following code example. They will then be captured and defined as a host node in the resulting graph, and won’t be executed when encountered.

Synchronous HIP API calls that are implicitly assigned to the default stream are not permitted while capturing a stream and will return an error. This is because they implicitly synchronize and cause a dependency that can not be captured within the stream. This includes functions like hipMalloc(), hipMemcpy() and hipFree(). In order to capture these to the stream, replace them with the corresponding asynchronous calls like hipMallocAsync(), hipMemcpyAsync() or hipFreeAsync().

The general flow for using stream capture to create a graph template is:

Create a stream from which to capture the operations
Call hipStreamBeginCapture() before the first operation to be captured
Call hipStreamEndCapture() after the last operation to be captured
1. Define a hipGraph_t graph template to which hipStreamEndCapture() passes the captured graph

The following code is an example of how to use the HIP graph API to capture a graph from a stream.

#include <hip/hip_runtime.h>
#include <vector>
#include <iostream>

#define HIP_CHECK(expression)                \
{                                            \
    const hipError_t status = expression;    \
    if(status != hipSuccess){                \
            std::cerr << "HIP error "        \
                << status << ": "            \
                << hipGetErrorString(status) \
                << " at " << __FILE__ << ":" \
                << __LINE__ << std::endl;    \
    }                                        \
}


__global__ void kernelA(double* arrayA, size_t size){
    const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
    if(x < size){arrayA[x] *= 2.0;}
};
__global__ void kernelB(int* arrayB, size_t size){
    const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
    if(x < size){arrayB[x] = 3;}
};
__global__ void kernelC(double* arrayA, const int* arrayB, size_t size){
    const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
    if(x < size){arrayA[x] += arrayB[x];}
};

struct set_vector_args{
    std::vector<double>& h_array;
    double value;
};

void set_vector(void* args){
    set_vector_args h_args{*(reinterpret_cast<set_vector_args*>(args))};

    std::vector<double>& vec{h_args.h_array};
    vec.assign(vec.size(), h_args.value);
}

int main(){
    constexpr int numOfBlocks = 1024;
    constexpr int threadsPerBlock = 1024;
    constexpr size_t arraySize = 1U << 20;

    // This example assumes that kernelA operates on data that needs to be initialized on
    // and copied from the host, while kernelB initializes the array that is passed to it.
    // Both arrays are then used as input to kernelC, where arrayA is also used as
   //  output, that is copied back to the host, while arrayB is only read from and not modified.

    double* d_arrayA;
    int* d_arrayB;
    std::vector<double> h_array(arraySize);
    constexpr double initValue = 2.0;

    hipStream_t captureStream;
    HIP_CHECK(hipStreamCreate(&captureStream));

    // Start capturing the operations assigned to the stream
    HIP_CHECK(hipStreamBeginCapture(captureStream, hipStreamCaptureModeGlobal));

    // hipMallocAsync and hipMemcpyAsync are needed, to be able to assign it to a stream
    HIP_CHECK(hipMallocAsync(&d_arrayA, arraySize*sizeof(double), captureStream));
    HIP_CHECK(hipMallocAsync(&d_arrayB, arraySize*sizeof(int), captureStream));

    // Assign host function to the stream
    // Needs a custom struct to pass the arguments
    set_vector_args args{h_array, initValue};
    HIP_CHECK(hipLaunchHostFunc(captureStream, set_vector, &args));

    HIP_CHECK(hipMemcpyAsync(d_arrayA, h_array.data(), arraySize*sizeof(double), hipMemcpyHostToDevice, captureStream));

    kernelA<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, arraySize);
    kernelB<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayB, arraySize);
    kernelC<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, d_arrayB, arraySize);

    HIP_CHECK(hipMemcpyAsync(h_array.data(), d_arrayA, arraySize*sizeof(*d_arrayA), hipMemcpyDeviceToHost, captureStream));

    HIP_CHECK(hipFreeAsync(d_arrayA, captureStream));
    HIP_CHECK(hipFreeAsync(d_arrayB, captureStream));

    // Stop capturing
    hipGraph_t graph;
    HIP_CHECK(hipStreamEndCapture(captureStream, &graph));

    // Create an executable graph from the captured graph
    hipGraphExec_t graphExec;
    HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));

    // The graph template can be deleted after the instantiation if it's not needed for later use
    HIP_CHECK(hipGraphDestroy(graph));

    // Actually launch the graph. The stream does not have
    // to be the same as the one used for capturing.
    HIP_CHECK(hipGraphLaunch(graphExec, captureStream));

    // Verify results
    constexpr double expected = initValue * 2.0 + 3;
    bool passed = true;
    for(size_t i = 0; i < arraySize; ++i){
            if(h_array[i] != expected){
                    passed = false;
                    std::cerr << "Validation failed! Expected " << expected << " got " << h_array[0] << std::endl;
                    break;
            }
    }
    if(passed){
            std::cerr << "Validation passed." << std::endl;
    }

    // Free graph and stream resources after usage
    HIP_CHECK(hipGraphExecDestroy(graphExec));
    HIP_CHECK(hipStreamDestroy(captureStream));
}

Explicit graph creation#

Graphs can also be created directly using the HIP graph API, giving more fine-grained control over the graph. In this case, the graph nodes are created explicitly, together with their parameters and dependencies, which specify the edges of the graph, thereby forming the graph structure.

The nodes are represented by the generic hipGraphNode_t type. The actual node type is implicitly defined by the specific function used to add the node to the graph, for example hipGraphAddKernelNode() See the HIP graph API documentation for the available functions, they are of type hipGraphAdd{Type}Node. Each type of node also has a predefined set of parameters depending on the operation, for example hipKernelNodeParams for a kernel launch. See the documentation for the general hipGraphNodeParams type for a list of available parameter types and their members.

The general flow for explicitly creating a graph is usually:

Create a graph hipGraph_t
Create the nodes and their parameters and add them to the graph
1. Define a hipGraphNode_t
2. Define the parameter struct for the desired operation, by explicitly setting the appropriate struct’s members.
3. Use the appropriate hipGraphAdd{Type}Node function to add the node to the graph.
  1. The dependencies can be defined when adding the node to the graph, or afterwards by using hipGraphAddDependencies()

The following code example demonstrates how to explicitly create nodes in order to create a graph.

#include <hip/hip_runtime.h>
#include <vector>
#include <iostream>

#define HIP_CHECK(expression)                \
{                                            \
    const hipError_t status = expression;    \
    if(status != hipSuccess){                \
            std::cerr << "HIP error "        \
                << status << ": "            \
                << hipGetErrorString(status) \
                << " at " << __FILE__ << ":" \
                << __LINE__ << std::endl;    \
    }                                        \
}

__global__ void kernelA(double* arrayA, size_t size){
    const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
    if(x < size){arrayA[x] *= 2.0;}
};
__global__ void kernelB(int* arrayB, size_t size){
    const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
    if(x < size){arrayB[x] = 3;}
};
__global__ void kernelC(double* arrayA, const int* arrayB, size_t size){
    const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
    if(x < size){arrayA[x] += arrayB[x];}
};

struct set_vector_args{
    std::vector<double>& h_array;
    double value;
};

void set_vector(void* args){
    set_vector_args h_args{*(reinterpret_cast<set_vector_args*>(args))};

    std::vector<double>& vec{h_args.h_array};
    vec.assign(vec.size(), h_args.value);
}

int main(){
    constexpr int numOfBlocks = 1024;
    constexpr int threadsPerBlock = 1024;
    size_t arraySize = 1U << 20;

    // The pointers to the device memory don't need to be declared here,
    // they are contained within the hipMemAllocNodeParams as the dptr member
    std::vector<double> h_array(arraySize);
    constexpr double initValue = 2.0;

    // Create graph an empty graph
    hipGraph_t graph;
    HIP_CHECK(hipGraphCreate(&graph, 0));

    // Parameters to allocate arrays
    hipMemAllocNodeParams allocArrayAParams{};
    allocArrayAParams.poolProps.allocType = hipMemAllocationTypePinned;
    allocArrayAParams.poolProps.location.type = hipMemLocationTypeDevice;
    allocArrayAParams.poolProps.location.id = 0; // GPU on which memory resides
    allocArrayAParams.bytesize = arraySize * sizeof(double);

    hipMemAllocNodeParams allocArrayBParams{};
    allocArrayBParams.poolProps.allocType = hipMemAllocationTypePinned;
    allocArrayBParams.poolProps.location.type = hipMemLocationTypeDevice;
    allocArrayBParams.poolProps.location.id = 0; // GPU on which memory resides
    allocArrayBParams.bytesize = arraySize * sizeof(int);

    // Add the allocation nodes to the graph. They don't have any dependencies
    hipGraphNode_t allocNodeA, allocNodeB;
    HIP_CHECK(hipGraphAddMemAllocNode(&allocNodeA, graph, nullptr, 0, &allocArrayAParams));
    HIP_CHECK(hipGraphAddMemAllocNode(&allocNodeB, graph, nullptr, 0, &allocArrayBParams));

    // Parameters for the host function
    // Needs custom struct to pass the arguments
    set_vector_args args{h_array, initValue};
    hipHostNodeParams hostParams{};
    hostParams.fn = set_vector;
    hostParams.userData = static_cast<void*>(&args);

    // Add the host node that initializes the host array. It also doesn't have any dependencies
    hipGraphNode_t hostNode;
    HIP_CHECK(hipGraphAddHostNode(&hostNode, graph, nullptr, 0, &hostParams));

    // Add memory copy node, that copies the initialized host array to the device.
    // It has to wait for the host array to be initialized and the device memory to be allocated
    hipGraphNode_t cpyNodeDependencies[] = {allocNodeA, hostNode};
    hipGraphNode_t cpyToDevNode;
    HIP_CHECK(hipGraphAddMemcpyNode1D(&cpyToDevNode, graph, cpyNodeDependencies, 1, allocArrayAParams.dptr, h_array.data(), arraySize * sizeof(double), hipMemcpyHostToDevice));

    // Parameters for kernelA
    hipKernelNodeParams kernelAParams;
    void* kernelAArgs[] = {&allocArrayAParams.dptr, static_cast<void*>(&arraySize)};
    kernelAParams.func = reinterpret_cast<void*>(kernelA);
    kernelAParams.gridDim = numOfBlocks;
    kernelAParams.blockDim = threadsPerBlock;
    kernelAParams.sharedMemBytes = 0;
    kernelAParams.kernelParams = kernelAArgs;
    kernelAParams.extra = nullptr;

    // Add the node for kernelA. It has to wait for the memory copy to finish, as it depends on the values from the host array.
    hipGraphNode_t kernelANode;
    HIP_CHECK(hipGraphAddKernelNode(&kernelANode, graph, &cpyToDevNode, 1, &kernelAParams));

    // Parameters for kernelB
    hipKernelNodeParams kernelBParams;
    void* kernelBArgs[] = {&allocArrayBParams.dptr, static_cast<void*>(&arraySize)};
    kernelBParams.func = reinterpret_cast<void*>(kernelB);
    kernelBParams.gridDim = numOfBlocks;
    kernelBParams.blockDim = threadsPerBlock;
    kernelBParams.sharedMemBytes = 0;
    kernelBParams.kernelParams = kernelBArgs;
    kernelBParams.extra = nullptr;

    //  Add the node for kernelB. It only has to wait for the memory to be allocated, as it initializes the array.
    hipGraphNode_t kernelBNode;
    HIP_CHECK(hipGraphAddKernelNode(&kernelBNode, graph, &allocNodeB, 1, &kernelBParams));

    // Parameters for kernelC
    hipKernelNodeParams kernelCParams;
    void* kernelCArgs[] = {&allocArrayAParams.dptr, &allocArrayBParams.dptr, static_cast<void*>(&arraySize)};
    kernelCParams.func = reinterpret_cast<void*>(kernelC);
    kernelCParams.gridDim = numOfBlocks;
    kernelCParams.blockDim = threadsPerBlock;
    kernelCParams.sharedMemBytes = 0;
    kernelCParams.kernelParams = kernelCArgs;
    kernelCParams.extra = nullptr;

    // Add the node for kernelC. It has to wait on both kernelA and kernelB to finish, as it depends on their results.
    hipGraphNode_t kernelCNode;
    hipGraphNode_t kernelCDependencies[] = {kernelANode, kernelBNode};
    HIP_CHECK(hipGraphAddKernelNode(&kernelCNode, graph, kernelCDependencies, 1, &kernelCParams));

    // Copy the results back to the host. Has to wait for kernelC to finish.
    hipGraphNode_t cpyToHostNode;
    HIP_CHECK(hipGraphAddMemcpyNode1D(&cpyToHostNode, graph, &kernelCNode, 1, h_array.data(), allocArrayAParams.dptr, arraySize * sizeof(double), hipMemcpyDeviceToHost));

    // Free array of allocNodeA. It needs to wait for the copy to finish, as kernelC stores its results in it.
    hipGraphNode_t freeNodeA;
    HIP_CHECK(hipGraphAddMemFreeNode(&freeNodeA, graph, &cpyToHostNode, 1, allocArrayAParams.dptr));
    // Free array of allocNodeB. It only needs to wait for kernelC to finish, as it is not written back to the host.
    hipGraphNode_t freeNodeB;
    HIP_CHECK(hipGraphAddMemFreeNode(&freeNodeB, graph, &kernelCNode, 1, allocArrayBParams.dptr));

    // Instantiate the graph in order to execute it
    hipGraphExec_t graphExec;
    HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));

    // The graph can be freed after the instantiation if it's not needed for other purposes
    HIP_CHECK(hipGraphDestroy(graph));

    // Actually launch the graph
    hipStream_t graphStream;
    HIP_CHECK(hipStreamCreate(&graphStream));
    HIP_CHECK(hipGraphLaunch(graphExec, graphStream));

    // Verify results
    constexpr double expected = initValue * 2.0 + 3;
    bool passed = true;
    for(size_t i = 0; i < arraySize; ++i){
            if(h_array[i] != expected){
                    passed = false;
                    std::cerr << "Validation failed! Expected " << expected << " got " << h_array[0] << std::endl;
                    break;
            }
    }
    if(passed){
            std::cerr << "Validation passed." << std::endl;
    }

    HIP_CHECK(hipGraphExecDestroy(graphExec));
    HIP_CHECK(hipStreamDestroy(graphStream));
}