/home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-composable-kernel/checkouts/develop/include/ck/wrapper/operations/gemm.hpp File Reference

/home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-composable-kernel/checkouts/develop/include/ck/wrapper/operations/gemm.hpp File Reference#

Composable Kernel: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-composable-kernel/checkouts/develop/include/ck/wrapper/operations/gemm.hpp File Reference
gemm.hpp File Reference

Go to the source code of this file.

Functions

template<typename DataType , index_t BlockSize, typename GemmTraits , typename ATensorType , typename BTensorType , typename CTensorType >
__device__ void blockwise_gemm_xdl (const ATensorType &a_local_tile_tensor, const BTensorType &b_local_tile_tensor, CTensorType &c_reg_tensor)
 Perform blockwise gemm xdl on tensors stored in lds. Result will be stored in Vgpr register. A data layout must be (MPerBlock, KPerBlock) or (K0PerBlock, MPerBlock, K1) and B data layout must be (NPerBlock, KPerBlock) or (K0PerBlock, NPerBlock, K1). More...
 
template<typename DataType , typename ATileLayout , typename BTileLayout , index_t BlockSize, typename GemmTraits , typename CTensorType >
__host__ constexpr __device__ auto make_blockwise_gemm_xdl_c_local_partition (CTensorType &c_local_tile_tensor)
 Create local partition per thread for C tensor. More...
 
template<typename DataType , typename ATileLayout , typename BTileLayout , index_t BlockSize, typename GemmTraits >
__host__ constexpr __device__ auto make_blockwise_gemm_xdl_c_vgpr ()
 Create local partition per thread for C tensor. More...
 

Function Documentation

◆ blockwise_gemm_xdl()

template<typename DataType , index_t BlockSize, typename GemmTraits , typename ATensorType , typename BTensorType , typename CTensorType >
__device__ void blockwise_gemm_xdl ( const ATensorType &  a_local_tile_tensor,
const BTensorType &  b_local_tile_tensor,
CTensorType &  c_reg_tensor 
)

Perform blockwise gemm xdl on tensors stored in lds. Result will be stored in Vgpr register. A data layout must be (MPerBlock, KPerBlock) or (K0PerBlock, MPerBlock, K1) and B data layout must be (NPerBlock, KPerBlock) or (K0PerBlock, NPerBlock, K1).

Note
C output Vgpr register layout (8D):
  • MXdlPerWave - The number of MFMA instructions run by single wave in M dimension per tile.
  • NXdlPerWave - The number of MFMA instructions run by single wave in N dimension per tile.
  • MWave - Equals to 1 since this is for single wave.
  • NWave - Equals to 1 since this is for single wave.
  • NumGroupsPerBlock - Mfma instruction internal layout (depeneds on the instruction size).
  • NumInputsBlock - Mfma instruction internal layout (depeneds on the instruction size).
  • GroupSize - Mfma instruction internal layout (depeneds on the instruction size).
  • NumThreadsPerBlock - Mfma instruction internal layout (depeneds on the instruction size).
Template Parameters
DataTypeInput data types.
BlockSizeTensor to pad.
GemmTraitsTraits of gemm xdl operation.
Parameters
a_local_tile_tensorA tensor in LDS memory for blockwise gemm (MPerBlock, KPerBlock) or (K0PerBlock, MPerBlock, K1) layout.
b_local_tile_tensorB tensor in LDS memory for blockwise gemm (NPerBlock, KPerBlock) or (K0PerBlock, NPerBlock, K1) layout.
c_reg_tensorC tensor VGPR memory for blockwise gemm.

◆ make_blockwise_gemm_xdl_c_local_partition()

template<typename DataType , typename ATileLayout , typename BTileLayout , index_t BlockSize, typename GemmTraits , typename CTensorType >
__host__ constexpr __device__ auto make_blockwise_gemm_xdl_c_local_partition ( CTensorType &  c_local_tile_tensor)
constexpr

Create local partition per thread for C tensor.

Note
C output global memory layout (8D):
  • MXdlPerWave - The number of MFMA instructions run by single wave in M dimension.
  • NXdlPerWave - The number of MFMA instructions run by single wave in N dimension.
  • MWave - The number of waves in single tile M dimension per tile.
  • NWave - The number of waves in single tile N dimension per tile.
  • NumGroupsPerBlock - Mfma instruction internal layout (depeneds on the instruction size).
  • NumInputsBlock - Mfma instruction internal layout (depeneds on the instruction size).
  • GroupSize - Mfma instruction internal layout (depeneds on the instruction size).
  • NumThreadsPerBlock - Mfma instruction internal layout (depeneds on the instruction size).
Template Parameters
DataTypeInput data types.
ATileLayoutA tensor layout.
BTileLayoutB tensor layout.
BlockSizeNumber of threads in block.
GemmTraitsTraits of gemm xdl operation.
Parameters
c_local_tile_tensorC tensor in LDS memory for blockwise gemm (MPerBlock, NPerBlock) layout.
Returns
Partition c tensor for blockwise gemm.

◆ make_blockwise_gemm_xdl_c_vgpr()

template<typename DataType , typename ATileLayout , typename BTileLayout , index_t BlockSize, typename GemmTraits >
__host__ constexpr __device__ auto make_blockwise_gemm_xdl_c_vgpr ( )
constexpr

Create local partition per thread for C tensor.

Note
C output Vgpr register layout (8D):
  • MXdlPerWave - The number of MFMA instructions run by single wave in M dimension per tile.
  • NXdlPerWave - The number of MFMA instructions run by single wave in N dimension per tile.
  • MWave - Equals to 1 since this is for single wave.
  • NWave - Equals to 1 since this is for single wave.
  • NumGroupsPerBlock - Mfma instruction internal layout (depeneds on the instruction size).
  • NumInputsBlock - Mfma instruction internal layout (depeneds on the instruction size).
  • GroupSize - Mfma instruction internal layout (depeneds on the instruction size).
  • NumThreadsPerBlock - Mfma instruction internal layout (depeneds on the instruction size).
Template Parameters
DataTypeInput data types.
ATileLayoutA tensor layout.
BTileLayoutB tensor layout.
BlockSizeNumber of threads in block.
GemmTraitsTraits of gemm xdl operation.
Returns
Vgpr c tensor for blockwise gemm.