Cooperative groups#

Cooperative kernel launches#

The following host-side functions are used for cooperative kernel launches.

hipError_t hipModuleLaunchCooperativeKernel(hipFunction_t f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, hipStream_t stream, void **kernelParams)#

launches kernel f with launch parameters and shared memory on stream with arguments passed to kernelParams, where thread blocks can cooperate and synchronize as they execute

Please note, HIP does not support kernel launch with total work items defined in dimension with size \( gridDim \cdot blockDim \geq 2^{32} \).

Parameters:
  • f[in] Kernel to launch.

  • gridDimX[in] X grid dimension specified as multiple of blockDimX.

  • gridDimY[in] Y grid dimension specified as multiple of blockDimY.

  • gridDimZ[in] Z grid dimension specified as multiple of blockDimZ.

  • blockDimX[in] X block dimension specified in work-items.

  • blockDimY[in] Y block dimension specified in work-items.

  • blockDimZ[in] Z block dimension specified in work-items.

  • sharedMemBytes[in] Amount of dynamic shared memory to allocate for this kernel. The HIP-Clang compiler provides support for extern shared declarations.

  • stream[in] Stream where the kernel should be dispatched. May be 0, in which case the default stream is used with associated synchronization rules.

  • kernelParams[in] A list of kernel arguments.

Returns:

hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidHandle, hipErrorInvalidImage, hipErrorInvalidValue, hipErrorInvalidConfiguration, hipErrorLaunchFailure, hipErrorLaunchOutOfResources, hipErrorLaunchTimeOut, hipErrorCooperativeLaunchTooLarge, hipErrorSharedObjectInitFailed

hipError_t hipModuleLaunchCooperativeKernelMultiDevice(hipFunctionLaunchParams *launchParamsList, unsigned int numDevices, unsigned int flags)#

Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they execute.

Parameters:
  • launchParamsList[in] List of launch parameters, one per device.

  • numDevices[in] Size of the launchParamsList array.

  • flags[in] Flags to control launch behavior.

Returns:

hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidHandle, hipErrorInvalidImage, hipErrorInvalidValue, hipErrorInvalidConfiguration, hipErrorInvalidResourceHandle, hipErrorLaunchFailure, hipErrorLaunchOutOfResources, hipErrorLaunchTimeOut, hipErrorCooperativeLaunchTooLarge, hipErrorSharedObjectInitFailed

hipError_t hipLaunchCooperativeKernel(const void *f, dim3 gridDim, dim3 blockDimX, void **kernelParams, unsigned int sharedMemBytes, hipStream_t stream)#

Launches kernel f with launch parameters and shared memory on stream with arguments passed to kernelparams or extra, where thread blocks can cooperate and synchronize as they execute.

Please note, HIP does not support kernel launch with total work items defined in dimension with size \( gridDim \cdot blockDim \geq 2^{32} \).

Parameters:
  • f[in] Kernel to launch.

  • gridDim[in] Grid dimensions specified as multiple of blockDim.

  • blockDimX[in] Block dimensions specified in work-items

  • kernelParams[in] A list of kernel arguments

  • sharedMemBytes[in] Amount of dynamic shared memory to allocate for this kernel. The HIP-Clang compiler provides support for extern shared declarations.

  • stream[in] Stream where the kernel should be dispatched. May be 0, in which case th default stream is used with associated synchronization rules.

Returns:

hipSuccess, hipErrorNotInitialized, hipErrorInvalidValue, hipErrorCooperativeLaunchTooLarge

hipError_t hipLaunchCooperativeKernelMultiDevice(hipLaunchParams *launchParamsList, int numDevices, unsigned int flags)#

Launches kernels on multiple devices where thread blocks can cooperate and synchronize as they execute.

Parameters:
  • launchParamsList[in] List of launch parameters, one per device.

  • numDevices[in] Size of the launchParamsList array.

  • flags[in] Flags to control launch behavior.

Returns:

hipSuccess, hipErrorNotInitialized, hipErrorInvalidValue, hipErrorCooperativeLaunchTooLarge

template<class T>
inline hipError_t hipLaunchCooperativeKernel(T f, dim3 gridDim, dim3 blockDim, void **kernelParams, unsigned int sharedMemBytes, hipStream_t stream)#

Launches a device function.

Template Parameters:

T – The type of the kernel function.

Parameters:
  • f[in] Kernel function to launch.

  • gridDim[in] Grid dimensions specified as multiple of blockDim.

  • blockDim[in] Block dimensions specified in work-items.

  • kernelParams[in] A list of kernel arguments.

  • sharedMemBytes[in] Amount of dynamic shared memory to allocate for this kernel. The HIP-Clang compiler provides support for extern shared declarations.

  • stream[in] Stream which on the kernel launched.

Returns:

hipSuccess, hipErrorLaunchFailure, hipErrorInvalidValue, hipErrorInvalidResourceHandle

template<class T>
inline hipError_t hipLaunchCooperativeKernelMultiDevice(hipLaunchParams *launchParamsList, unsigned int numDevices, unsigned int flags = 0)#

Launches kernel function on multiple devices, where thread blocks can cooperate and synchronize on execution.

Parameters:
  • launchParamsList[in] List of kernel launch parameters, one per device.

  • numDevices[in] Size of launchParamsList array.

  • flags[in] Flag to handle launch behavior.

Returns:

hipSuccess, hipErrorLaunchFailure, hipErrorInvalidValue, hipErrorInvalidResourceHandle

Cooperative groups classes#

The following cooperative groups classes can be used on the device side.

class thread_group#

The base type of all cooperative group types.

Holds the key properties of a constructed cooperative group types object, like the group type, its size, etc.

Note

Cooperative groups feature is implemented on Linux, under development on Microsoft Windows.

Subclassed by cooperative_groups::coalesced_group, cooperative_groups::grid_group, cooperative_groups::multi_grid_group, cooperative_groups::thread_block, cooperative_groups::tiled_group

class thread_block : public cooperative_groups::thread_group#

The workgroup (thread-block in CUDA terminology) cooperative group type.

Represents an intra-workgroup cooperative group type, where the participating threads within the group are the same threads that participated in the currently executing workgroup.

Note

This function is implemented on Linux and is under development on Microsoft Windows.

class grid_group : public cooperative_groups::thread_group#

The grid cooperative group type.

Represents an inter-workgroup cooperative group type, where the participating threads within the group spans across multiple workgroups running the (same) kernel on the same device.

Note

This is implemented on Linux and is under development on Microsoft Windows.

class multi_grid_group : public cooperative_groups::thread_group#

The multi-grid cooperative group type.

Represents an inter-device cooperative group type, where the participating threads within the group span across multiple devices, running the (same) kernel on these devices.

Note

The multi-grid cooperative group type is implemented on Linux, under development on Microsoft Windows.

template<unsigned int size, class ParentCGTy>
class thread_block_tile : public cooperative_groups::impl::thread_block_tile_internal<size, ParentCGTy>#

Group type - thread_block_tile.

Represents one tiled thread group in a wavefront. This group type also supports sub-wave level intrinsics.

Note

This type is implemented on Linux, under development on Microsoft Windows.

Public Functions

unsigned int thread_rank() const#

Rank of the calling thread within [0, size() ).

void sync()#

Synchronizes the threads in the group.

Causes all threads in the group to wait at this synchronization point, and for all shared and global memory accesses by the threads to complete, before running synchronization. This guarantees the visibility of accessed data for all threads in the group.

Note

There are potential read-after-write (RAW), write-after-read (WAR), or write-after-write (WAW) hazards, when threads in the group access the same addresses in shared or global memory. The data hazards can be avoided with synchronization of the group.

unsigned int meta_group_rank() const#

Returns the linear rank of the group within the set of tiles partitioned from a parent group (bounded by meta_group_size)

unsigned int meta_group_size() const#

Returns the number of groups created when the parent group was partitioned.

template<class T>
T shfl(T var, int srcRank) const#

Shuffle operation on group level.

Exchanging variables between threads without use of shared memory. Shuffle operation is a direct copy of var from srcRank thread ID of group.

Template Parameters:

T – The type can be a 32-bit integer or single-precision floating point.

Parameters:
  • var – [in] The source variable to copy. Only the srcRank thread ID of group is copied to other threads.

  • srcRank – [in] The source thread ID of the group for copy.

template<class T>
T shfl_down(T var, unsigned int lane_delta) const#

Shuffle down operation on group level.

Exchanging variables between threads without use of shared memory. Shuffle down operation is copy of var from thread with thread ID of group relative higher with lane_delta to caller thread ID.

Template Parameters:

T – The type can be a 32-bit integer or single-precision floating point.

Parameters:
  • var – [in] The source variable to copy.

  • lane_delta – [in] The lane_delta is the relative thread ID difference between caller thread ID and source of copy thread ID. sourceID = (threadID + lane_delta) % size()

template<class T>
T shfl_up(T var, unsigned int lane_delta) const#

Shuffle up operation on group level.

Exchanging variables between threads without use of shared memory. Shuffle up operation is copy of var from thread with thread ID of group relative lower with lane_delta to caller thread ID.

Template Parameters:

T – The type can be a 32-bit integer or single-precision floating point.

Parameters:
  • var – [in] The source variable to copy.

  • lane_delta – [in] The lane_delta is the relative thread ID difference between caller thread ID and source of copy thread ID. sourceID = (threadID - lane_delta) % size()

template<class T>
T shfl_xor(T var, unsigned int laneMask) const#

Shuffle xor operation on group level.

Exchanging variables between threads without use of shared memory. Shuffle xor operation is copy of var from thread with thread ID of group based on laneMask XOR of the caller thread ID.

Template Parameters:

T – The type can be a 32-bit integer or single-precision floating point.

Parameters:
  • var – [in] The source variable to copy.

  • laneMask – [in] The laneMask is the mask for XOR operation. sourceID = threadID ^ laneMask

unsigned long long ballot(int pred) const#

Ballot function on group level.

Returns a bit mask with the Nth bit set to one if the Nth thread predicate evaluates true.

Parameters:

pred – [in] The predicate to evaluate on group threads.

int any(int pred) const#

Any function on group level.

Returns non-zero if a predicate evaluates true for any threads.

Parameters:

pred – [in] The predicate to evaluate on group threads.

int all(int pred) const#

All function on group level.

Returns non-zero if a predicate evaluates true for all threads.

Parameters:

pred – [in] The predicate to evaluate on group threads.

template<typename T>
unsigned long long match_any(T value) const#

Match any function on group level.

Returns a bit mask containing a 1-bit for every participating thread if that thread has the same value in value as the caller thread.

Parameters:

value – [in] The value to examine on the current thread in group.

template<typename T>
unsigned long long match_all(T value, int &pred) const#

Match all function on group level.

Returns a bit mask containing a 1-bit for every participating thread if they all have the same value in value as the caller thread. The predicate pred is set to true if all participating threads have the same value in value.

Parameters:
  • value – [in] The value to examine on the current thread in group.

  • pred – [out] The predicate is set to true if all participating threads in the thread group have the same value.

class coalesced_group : public cooperative_groups::thread_group#

The coalesced_group cooperative group type.

Represents an active thread group in a wavefront. This group type also supports sub-wave level intrinsics.

Note

This is implemented on Linux and is under development on Microsoft Windows.

Cooperative groups construct functions#

The following functions are used to construct different group-type instances on the device side.

multi_grid_group this_multi_grid()#

User-exposed API interface to construct grid cooperative group type object - multi_grid_group.

User is not allowed to construct an object of type multi_grid_group directly. Instead, they should construct it through this API function.

Note

This multi-grid cooperative API type is implemented on Linux, under development on Microsoft Windows.

grid_group this_grid()#

User-exposed API interface to construct grid cooperative group type object - grid_group.

User is not allowed to construct an object of type grid_group directly. Instead, they should construct it through this API function.

Note

This function is implemented on Linux and is under development on Microsoft Windows.

thread_block this_thread_block()#

User-exposed API interface to construct workgroup cooperative group type object - thread_block.

User is not allowed to construct an object of type thread_block directly. Instead, they should construct it through this API function.

Note

This function is implemented on Linux and is under development on Microsoft Windows.

coalesced_group coalesced_threads()#

User-exposed API to create coalesced groups.

A collective operation that groups all active lanes into a new thread group.

Note

This function is implemented on Linux and is under development on Microsoft Windows.

thread_group tiled_partition(const thread_group &parent, unsigned int tile_size)#

User-exposed API to partition groups.

A collective operation that partitions the parent group into a one-dimensional, row-major, tiling of subgroups.

template<unsigned int size, class ParentCGTy>
thread_block_tile<size, ParentCGTy> tiled_partition(const ParentCGTy &g)#

Create a partition.

This constructs a templated class derived from thread_group. The template defines the tile size of the new thread group at compile time.

Template Parameters:
  • size – The new size of the partition.

  • ParentCGTy – The cooperative group class template parameter of the input group.

Parameters:

g – [in] The coalesced group for split.

coalesced_group binary_partition(const coalesced_group &cgrp, bool pred)#

Binary partition.

This splits the input thread group into two partitions determined by predicate.

Parameters:
  • cgrp – [in] The coalesced group for split.

  • pred – [in] The predicate used during the group split up.

template<unsigned int size, class parent>
coalesced_group binary_partition(const thread_block_tile<size, parent> &tgrp, bool pred)#

Binary partition.

This splits the input thread group into two partitions determined by predicate.

Template Parameters:
  • size – The size of the input thread block tile group.

  • parent – The cooperative group class template parameter of the input group.

Parameters:
  • tgrp – [in] The thread block tile group for split.

  • pred – [in] The predicate used during the group split up.

Cooperative groups exposed API functions#

The following functions are the exposed API for different group-type instances on the device side.

template<class CGTy>
uint32_t group_size(CGTy const &g)#

Returns the size of the group.

Total number of threads in the thread group, and this serves the purpose for all derived cooperative group types because their size is directly saved during the construction.

Note

Implementation of publicly exposed wrapper API on top of basic cooperative group type APIs. This function is implemented on Linux and is under development on Microsoft Windows.

Template Parameters:

CGTy – The cooperative group class template parameter.

Parameters:

g – [in] The cooperative group for size returns.

template<class CGTy>
uint32_t thread_rank(CGTy const &g)#

Returns the rank of thread of the group.

Rank of the calling thread within [0, size() ).

Note

Implementation of publicly exposed wrapper API on top of basic cooperative group type APIs. This function is implemented on Linux and is under development on Microsoft Windows.

Template Parameters:

CGTy – The cooperative group class template parameter.

Parameters:

g – [in] The cooperative group for rank returns.

template<class CGTy>
bool is_valid(CGTy const &g)#

Returns true if the group has not violated any API constraints.

Note

Implementation of publicly exposed wrapper API on top of basic cooperative group type APIs. This function is implemented on Linux and is under development on Microsoft Windows.

Template Parameters:

CGTy – The cooperative group class template parameter.

Parameters:

g – [in] The cooperative group for validity check.

template<class CGTy>
void sync(CGTy const &g)#

Synchronizes the threads in the group.

Note

Implementation of publicly exposed wrapper API on top of basic cooperative group type APIs. This function is implemented on Linux and is under development on Microsoft Windows.

Template Parameters:

CGTy – The cooperative group class template parameter.

Parameters:

g – [in] The cooperative group for synchronization.