Device memory#
Device memory is random access memory that is physically located on a GPU. In general it is memory with a bandwidth that is an order of magnitude higher compared to RAM available to the host. That high bandwidth is only available to on-device accesses, accesses from the host or other devices have to go over a special interface which is considerably slower, usually the PCIe bus or the AMD Infinity Fabric.
On certain architectures like APUs, the GPU and CPU share the same physical memory.
There is also a special local data share on-chip directly accessible to the compute units, that can be used for shared memory.
The physical device memory can be used to back up several different memory spaces in HIP, as described in the following.
Global memory#
Global memory is the general read-write accessible memory visible to all threads
on a given device. Since variables located in global memory have to be marked
with the __device__
qualifier, this memory space is also referred to as
device memory.
Without explicitly copying it, it can only be accessed by the threads within a kernel operating on the device, however Unified memory management can be used to let the runtime manage this, if desired.
Allocating global memory#
This memory needs to be explicitly allocated.
It can be allocated from the host via the HIP runtime memory management
functions like hipMalloc()
, or can be
defined using the __device__
qualifier on variables.
It can also be allocated within a kernel using malloc
or new
.
The specified amount of memory is allocated by each thread that executes the
instructions. The recommended way to allocate the memory depends on the use
case. If the memory is intended to be shared between the threads of a block, it
is generally beneficial to allocate one large block of memory, due to the way
the memory is accessed.
Note
Memory allocated within a kernel can only be freed in kernels, not by the HIP
runtime on the host, like hipFree()
. It is also not possible to
free device memory allocated on the host, with hipMalloc()
for
example, in a kernel.
An example for how to share memory allocated within a kernel by only one thread is given in the following example. In case the device memory is only needed for communication between the threads in a single block, Shared memory is the better option, but is also limited in size.
__global__ void kernel_memory_allocation(TYPE* pointer){
// The pointer is stored in shared memory, so that all
// threads of the block can access the pointer
__shared__ int *memory;
size_t blockSize = blockDim.x;
constexpr size_t elementsPerThread = 1024;
if(threadIdx.x == 0){
// allocate memory in one contiguous block
memory = new int[blockDim.x * elementsPerThread];
}
__syncthreads();
// load pointer into thread-local variable to avoid
// unnecessary accesses to shared memory
int *localPtr = memory;
// work with allocated memory, e.g. initialization
for(int i = 0; i < elementsPerThread; ++i){
// access in a contiguous way
localPtr[i * blockSize + threadIdx.x] = i;
}
// synchronize to make sure no thread is accessing the memory before freeing
__syncthreads();
if(threadIdx.x == 0){
delete[] memory;
}
}
Copying between device and host#
When not using Unified memory management, memory has to be explicitly copied between the device and the host, using the HIP runtime API.
size_t elements = 1 << 20;
size_t size_bytes = elements * sizeof(int);
// allocate host and device memory
int *host_pointer = new int[elements];
int *device_input, *device_result;
HIP_CHECK(hipMalloc(&device_input, size_bytes));
HIP_CHECK(hipMalloc(&device_result, size_bytes));
// copy from host to the device
HIP_CHECK(hipMemcpy(device_input, host_pointer, size_bytes, hipMemcpyHostToDevice));
// Use memory on the device, i.e. execute kernels
// copy from device to host, to e.g. get results from the kernel
HIP_CHECK(hipMemcpy(host_pointer, device_result, size_bytes, hipMemcpyDeviceToHost));
// free memory when not needed any more
HIP_CHECK(hipFree(device_result));
HIP_CHECK(hipFree(device_input));
delete[] host_pointer;
Constant memory#
Constant memory is read-only storage visible to all threads on a given device. It is a limited segment backed by device memory, that takes a different caching route than normal device memory accesses. It needs to be set by the host before kernel execution.
In order to get the highest bandwidth from the constant memory, all threads of a warp have to access the same memory address. If they access different addresses, the accesses get serialized and the bandwidth is therefore reduced.
Using constant memory#
Constant memory can not be dynamically allocated, and the size has to be specified during compile time. If the values can not be specified during compile time, they have to be set by the host before the kernel, that accesses the constant memory, is called.
constexpr size_t const_array_size = 32;
__constant__ double const_array[const_array_size];
void set_constant_memory(double* values){
hipMemcpyToSymbol(const_array, values, const_array_size * sizeof(double));
}
__global__ void kernel_using_const_memory(double* array){
int warpIdx = threadIdx.x / warpSize;
// uniform access of warps to const_array for best performance
array[blockDim.x] *= const_array[warpIdx];
}
Texture memory#
Texture memory is special read-only memory visible to all threads on a given device and accessible through additional APIs. Its origins come from graphics APIs, and provides performance benefits when accessing memory in a pattern where the addresses are close to each other in a 2D or 3D representation of the memory. It also provides additional features like filtering and addressing for out-of-bounds accesses, which are further explained in Texture fetching.
The original use of the texture cache was also to take pressure off the global memory and other caches, however on modern GPUs, that support textures, the L1 cache and texture cache are combined, so the main purpose is to make use of the texture specific features.
To find out whether textures are supported on a device, query
hipDeviceAttributeImageSupport
.
Using texture memory#
Textures are more complex than just a region of memory, so their layout has to
be specified. They are represented by hipTextureObject_t
and created using
hipCreateTextureObject()
.
The underlying memory is a 1D, 2D or 3D hipArray_t
, that needs to be
allocated using hipMallocArray()
.
On the device side, texture objects are accessed using the tex1D/2D/3D
functions.
The texture management functions can be found in the Texture management API reference
A full example for how to use textures can be found in the ROCm texture management example
Surface memory#
A read-write version of texture memory. It is created in the same way as a
texture, but with hipCreateSurfaceObject()
.
Since surfaces are also cached in the read-only texture cache, the changes written back to the surface can’t be observed in the same kernel. A new kernel has to be launched in order to see the updated surface.
The corresponding functions are listed in the Surface object API reference.