Adjacent difference#
-
template<class T, unsigned int BlockSizeX, unsigned int BlockSizeY = 1, unsigned int BlockSizeZ = 1>
class block_adjacent_difference# The
block_adjacent_difference
class is a block level parallel primitive which provides methods for applying binary functions for pairs of consecutive items partition across a thread block.- Overview
There are two types of flags:
Head flags.
Tail flags.
The above flags are used to differentiate items from their predecessors or successors.
E.g. Head flags are convenient for differentiating disjoint data segments as part of a segmented reduction/scan.
- Examples
In the examples discontinuity operation is performed on block of 128 threads, using type
int
.__global__ void example_kernel(...) { // specialize discontinuity for int and a block of 128 threads using block_adjacent_difference_int = rocprim::block_adjacent_difference<int, 128>; // allocate storage in shared memory __shared__ block_adjacent_difference_int::storage_type storage; // segment of consecutive items to be used int input[8]; ... int head_flags[8]; block_adjacent_difference_int b_discontinuity; using flag_op_type = typename rocprim::greater<int>; b_discontinuity.flag_heads(head_flags, input, flag_op_type(), storage); ... }
- Template Parameters:
T – the input type.
BlockSize – the number of threads in a block.
Public Types
-
using storage_type = storage_type_#
Struct used to allocate a temporary memory that is required for thread communication during operations provided by related parallel primitive.
Depending on the implemention the operations exposed by parallel primitive may require a temporary storage for thread communication. The storage should be allocated using keywords
. It can be aliased to an externally allocated memory, or be a part of a union type with other storage types to increase shared memory reusability.
Public Functions
-
template<typename Output, unsigned int ItemsPerThread, typename BinaryFunction>
__device__ inline void subtract_left(const T (&input)[ItemsPerThread], Output (&output)[ItemsPerThread], const BinaryFunction op, storage_type &storage)# Apply a function to each consecutive pair of elements partitioned across threads in the block and write the output to the position of the left item.
The first item in the first thread is copied from the input then for the rest the following code applies.
// For each i in [1, block_size * ItemsPerThread) across threads in a block output[i] = op(input[i], input[i-1]);
- Storage reuse
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
or rocprim::syncthreads() .
- Template Parameters:
Output – [inferred] the type of output, must be assignable from the result of
op
ItemsPerThread – [inferred] the number of items processed by each thread
BinaryFunction – [inferred] the type of the function to apply
- Parameters:
input – [in] array that data is loaded from partitioned across the threads in the block
output – [out] array where the result of function application will be written to
op – [in] binary function applied to the items. The signature of the function should be equivalent to the following:
bool f(const T &a, const T &b)
The signature does not need to haveconst &
but the function object must not modify the objects passed to it.storage – reference to a temporary storage object of type storage_type
-
template<typename Output, unsigned int ItemsPerThread, typename BinaryFunction>
__device__ inline void subtract_left(const T (&input)[ItemsPerThread], Output (&output)[ItemsPerThread], const BinaryFunction op, const T tile_predecessor, storage_type &storage)# Apply a function to each consecutive pair of elements partitioned across threads in the block and write the output to the position of the left item, with an explicit item before the tile.
// For the first item on the first thread use the tile predecessor output[0] = op(input[0], tile_predecessor) // For other items, i in [1, block_size * ItemsPerThread) across threads in a block output[i] = op(input[i], input[i-1]);
- Storage reuse
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
or rocprim::syncthreads() .
- Template Parameters:
Output – [inferred] the type of output, must be assignable from the result of
op
ItemsPerThread – [inferred] the number of items processed by each thread
BinaryFunction – [inferred] the type of the function to apply
- Parameters:
input – [in] array that data is loaded from partitioned across the threads in the block
output – [out] array where the result of function application will be written to
op – [in] binary function applied to the items. The signature of the function should be equivalent to the following:
bool f(const T &a, const T &b)
The signature does not need to haveconst &
but the function object must not modify the objects passed to it.tile_predecessor – [in] the item before the tile, will be used as the input of the first application of
op
storage – reference to a temporary storage object of type storage_type
-
template<typename Output, unsigned int ItemsPerThread, typename BinaryFunction>
__device__ inline void subtract_left_partial(const T (&input)[ItemsPerThread], Output (&output)[ItemsPerThread], const BinaryFunction op, const unsigned int valid_items, storage_type &storage)# Apply a function to each consecutive pair of elements partitioned across threads in the block and write the output to the position of the left item, in a partial tile.
output[0] = input[0] // For each item i in [1, valid_items) across threads in a block output[i] = op(input[i], input[i-1]); // Just copy "invalid" items in [valid_items, block_size * ItemsPerThread) output[i] = input[i]
- Storage reuse
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
or rocprim::syncthreads() .
- Template Parameters:
Output – [inferred] the type of output, must be assignable from the result of
op
ItemsPerThread – [inferred] the number of items processed by each thread
BinaryFunction – [inferred] the type of the function to apply
- Parameters:
input – [in] array that data is loaded from partitioned across the threads in the block
output – [out] array where the result of function application will be written to
op – [in] binary function applied to the items. The signature of the function should be equivalent to the following:
bool f(const T &a, const T &b)
The signature does not need to haveconst &
but the function object must not modify the objects passed to it.valid_items – [in] number of items in the block which are considered “valid” and will be used. Must be less or equal to
BlockSize
*ItemsPerThread
storage – reference to a temporary storage object of type storage_type
-
template<typename Output, unsigned int ItemsPerThread, typename BinaryFunction>
__device__ inline void subtract_left_partial(const T (&input)[ItemsPerThread], Output (&output)[ItemsPerThread], const BinaryFunction op, const T tile_predecessor, const unsigned int valid_items, storage_type &storage)# Apply a function to each consecutive pair of elements partitioned across threads in the block and write the output to the position of the left item, in a partial tile with a predecessor.
This combines subtract_left_partial() with a tile predecessor.
- Storage reuse
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
or rocprim::syncthreads() .
- Template Parameters:
Output – [inferred] the type of output, must be assignable from the result of
op
ItemsPerThread – [inferred] the number of items processed by each thread
BinaryFunction – [inferred] the type of the function to apply
- Parameters:
input – [in] array that data is loaded from partitioned across the threads in the block
output – [out] array where the result of function application will be written to
op – [in] binary function applied to the items. The signature of the function should be equivalent to the following:
bool f(const T &a, const T &b)
The signature does not need to haveconst &
but the function object must not modify the objects passed to it.tile_predecessor – [in] the item before the tile, will be used as the input of the first application of
op
valid_items – [in] number of items in the block which are considered “valid” and will be used. Must be less or equal to
BlockSize
*ItemsPerThread
storage – reference to a temporary storage object of type storage_type
-
template<typename Output, unsigned int ItemsPerThread, typename BinaryFunction>
__device__ inline void subtract_right(const T (&input)[ItemsPerThread], Output (&output)[ItemsPerThread], const BinaryFunction op, storage_type &storage)# Apply a function to each consecutive pair of elements partitioned across threads in the block and write the output to the position of the right item.
The last item in the last thread is copied from the input then for the rest the following code applies.
// For each i in [0, block_size * ItemsPerThread - 1) across threads in a block output[i] = op(input[i], input[i+1]);
- Storage reuse
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
or rocprim::syncthreads() .
- Template Parameters:
Output – [inferred] the type of output, must be assignable from the result of
op
ItemsPerThread – [inferred] the number of items processed by each thread
BinaryFunction – [inferred] the type of the function to apply
- Parameters:
input – [in] array that data is loaded from partitioned across the threads in the block
output – [out] array where the result of function application will be written to
op – [in] binary function applied to the items. The signature of the function should be equivalent to the following:
bool f(const T &a, const T &b)
The signature does not need to haveconst &
but the function object must not modify the objects passed to it.storage – reference to a temporary storage object of type storage_type
-
template<typename Output, unsigned int ItemsPerThread, typename BinaryFunction>
__device__ inline void subtract_right(const T (&input)[ItemsPerThread], Output (&output)[ItemsPerThread], const BinaryFunction op, const T tile_successor, storage_type &storage)# Apply a function to each consecutive pair of elements partitioned across threads in the block and write the output to the position of the right item, with an explicit item after the tile.
// For each items i in [0, block_size * ItemsPerThread - 1) across threads in a block output[i] = op(input[i], input[i+1]); // For the last item on the last thread use the tile successor output[block_size * ItemsPerThread - 1] = op(input[block_size * ItemsPerThread - 1], tile_successor)
- Storage reuse
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
or rocprim::syncthreads() .
- Template Parameters:
Output – [inferred] the type of output, must be assignable from the result of
op
ItemsPerThread – [inferred] the number of items processed by each thread
BinaryFunction – [inferred] the type of the function to apply
- Parameters:
input – [in] array that data is loaded from partitioned across the threads in the block
output – [out] array where the result of function application will be written to
op – [in] binary function applied to the items. The signature of the function should be equivalent to the following:
bool f(const T &a, const T &b)
The signature does not need to haveconst &
but the function object must not modify the objects passed to it.tile_successor – [in] the item after the tile, will be used as the input of the last application of
op
storage – reference to a temporary storage object of type storage_type
-
template<typename Output, unsigned int ItemsPerThread, typename BinaryFunction>
__device__ inline void subtract_right_partial(const T (&input)[ItemsPerThread], Output (&output)[ItemsPerThread], const BinaryFunction op, const unsigned int valid_items, storage_type &storage)# Apply a function to each consecutive pair of elements partitioned across threads in the block and write the output to the position of the right item, in a partial tile.
// For each item i in [0, valid_items) across threads in a block output[i] = op(input[i], input[i + 1]); // Just copy "invalid" items in [valid_items, block_size * ItemsPerThread) output[i] = input[i]
- Storage reuse
Synchronization barrier should be placed before
storage
is reused or repurposed:__syncthreads()
or rocprim::syncthreads() .
- Template Parameters:
Output – [inferred] the type of output, must be assignable from the result of
op
ItemsPerThread – [inferred] the number of items processed by each thread
BinaryFunction – [inferred] the type of the function to apply
- Parameters:
input – [in] array that data is loaded from partitioned across the threads in the block
output – [out] array where the result of function application will be written to
op – [in] binary function applied to the items. The signature of the function should be equivalent to the following:
bool f(const T &a, const T &b)
The signature does not need to haveconst &
but the function object must not modify the objects passed to it.valid_items – [in] number of items in the block which are considered “valid” and will be used. Must be less or equal to
BlockSize
*ItemsPerThread
storage – reference to a temporary storage object of type storage_type