ROCm Libraries

rocBLAS

Refer rocBLAS User Guide for the updated rocBLAS user manual.

A BLAS implementation on top of AMD’s Radeon Open Compute ROCm runtime and toolchains. rocBLAS is implemented in the HIP programming language and optimized for AMD’s latest discrete GPUs.

Prerequisites

  • A ROCm enabled platform, more information here.

  • Base software stack, which includes * HIP

Installing pre-built packages

Download pre-built packages either from ROCm’s package servers or by clicking the github releases tab and manually downloading, which could be newer. Release notes are available for each release on the releases tab.

sudo apt update && sudo apt install rocblas

Quickstart rocBLAS build

Bash helper build script (Ubuntu only)

The root of this repository has a helper bash script install.sh to build and install rocBLAS on Ubuntu with a single command. It does not take a lot of options and hard-codes configuration that can be specified through invoking cmake directly, but it’s a great way to get started quickly and can serve as an example of how to build/install. A few commands in the script need sudo access, so it may prompt you for a password.

./install -h -- shows help
./install -id -- build library, build dependencies and install (-d flag only needs to be passed once on a system)

Manual build (all supported platforms)

If you use a distro other than Ubuntu, or would like more control over the build process, the rocblas build wiki has helpful information on how to configure cmake and manually build.

Functions supported

A list of exported functions. from rocblas can be found on the wiki.

rocBLAS interface examples

In general, the rocBLAS interface is compatible with CPU oriented Netlib BLAS and the cuBLAS-v2 API, with the explicit exception that traditional BLAS interfaces do not accept handles. The cuBLAS’ cublasHandle_t is replaced with rocblas_handle everywhere. Thus, porting a CUDA application which originally calls the cuBLAS API to a HIP application calling rocBLAS API should be relatively straightforward. For example, the rocBLAS SGEMV interface is

GEMV API

rocblas_status
rocblas_sgemv(rocblas_handle handle,
              rocblas_operation trans,
              rocblas_int m, rocblas_int n,
              const float* alpha,
              const float* A, rocblas_int lda,
              const float* x, rocblas_int incx,
              const float* beta,
              float* y, rocblas_int incy);

Batched and strided GEMM API

rocBLAS GEMM can process matrices in batches with regular strides. There are several permutations of these API’s, the following is an example that takes everything

rocblas_status
rocblas_sgemm_strided_batched(
    rocblas_handle handle,
    rocblas_operation transa, rocblas_operation transb,
    rocblas_int m, rocblas_int n, rocblas_int k,
    const float* alpha,
    const float* A, rocblas_int ls_a, rocblas_int ld_a, rocblas_int bs_a,
    const float* B, rocblas_int ls_b, rocblas_int ld_b, rocblas_int bs_b,
    const float* beta,
          float* C, rocblas_int ls_c, rocblas_int ld_c, rocblas_int bs_c,
    rocblas_int batch_count )

rocBLAS assumes matrices A and vectors x, y are allocated in GPU memory space filled with data. Users are responsible for copying data from/to the host and device memory. HIP provides memcpy style API’s to facilitate data management.

Asynchronous API

Except a few routines (like TRSM) having memory allocation inside preventing asynchronicity, most of the library routines (like BLAS-1 SCAL, BLAS-2 GEMV, BLAS-3 GEMM) are configured to operate in asynchronous fashion with respect to CPU, meaning these library functions return immediately.

For more information regarding rocBLAS library and corresponding API documentation, refer rocBLAS

API

This section provides details of the library API

Types

Definitions
rocblas_int
typedef int32_t rocblas_int

To specify whether int32 or int64 is used.

rocblas_stride
typedef int64_t rocblas_stride
rocblas_half
struct rocblas_half

Represents a 16 bit floating point number.

rocblas_handle
typedef struct _rocblas_handle *rocblas_handle

rocblas_handle is a structure holding the rocblas library context. It must be initialized using rocblas_create_handle() and the returned handle must be passed to all subsequent library function calls. It should be destroyed at the end using rocblas_destroy_handle().

Enums

Enumeration constants have numbering that is consistent with CBLAS, ACML and most standard C BLAS libraries.

rocblas_operation
enum rocblas_operation

Used to specify whether the matrix is to be transposed or not.

parameter constants. numbering is consistent with CBLAS, ACML and most standard C BLAS libraries

Values:

enumerator rocblas_operation_none = 111

Operate with the matrix.

enumerator rocblas_operation_transpose = 112

Operate with the transpose of the matrix.

enumerator rocblas_operation_conjugate_transpose = 113

Operate with the conjugate transpose of the matrix.

rocblas_fill
enum rocblas_fill

Used by the Hermitian, symmetric and triangular matrix routines to specify whether the upper or lower triangle is being referenced.

Values:

enumerator rocblas_fill_upper = 121

Upper triangle.

enumerator rocblas_fill_lower = 122

Lower triangle.

enumerator rocblas_fill_full = 123
rocblas_diagonal
enum rocblas_diagonal

It is used by the triangular matrix routines to specify whether the matrix is unit triangular.

Values:

enumerator rocblas_diagonal_non_unit = 131

Non-unit triangular.

enumerator rocblas_diagonal_unit = 132

Unit triangular.

rocblas_side
enum rocblas_side

Indicates the side matrix A is located relative to matrix B during multiplication.

Values:

enumerator rocblas_side_left = 141

Multiply general matrix by symmetric, Hermitian or triangular matrix on the left.

enumerator rocblas_side_right = 142

Multiply general matrix by symmetric, Hermitian or triangular matrix on the right.

enumerator rocblas_side_both = 143
rocblas_status
enum rocblas_status

rocblas status codes definition

Values:

enumerator rocblas_status_success = 0

success

enumerator rocblas_status_invalid_handle = 1

handle not initialized, invalid or null

enumerator rocblas_status_not_implemented = 2

function is not implemented

enumerator rocblas_status_invalid_pointer = 3

invalid pointer argument

enumerator rocblas_status_invalid_size = 4

invalid size argument

enumerator rocblas_status_memory_error = 5

failed internal memory allocation, copy or dealloc

enumerator rocblas_status_internal_error = 6

other internal library failure

enumerator rocblas_status_perf_degraded = 7

performance degraded due to low device memory

enumerator rocblas_status_size_query_mismatch = 8

unmatched start/stop size query

enumerator rocblas_status_size_increased = 9

queried device memory size increased

enumerator rocblas_status_size_unchanged = 10

queried device memory size unchanged

enumerator rocblas_status_invalid_value = 11

passed argument not valid

enumerator rocblas_status_continue = 12

nothing preventing function to proceed

rocblas_datatype
enum rocblas_datatype

Indicates the precision width of data stored in a blas type.

Values:

enumerator rocblas_datatype_f16_r = 150

16 bit floating point, real

enumerator rocblas_datatype_f32_r = 151

32 bit floating point, real

enumerator rocblas_datatype_f64_r = 152

64 bit floating point, real

enumerator rocblas_datatype_f16_c = 153

16 bit floating point, complex

enumerator rocblas_datatype_f32_c = 154

32 bit floating point, complex

enumerator rocblas_datatype_f64_c = 155

64 bit floating point, complex

enumerator rocblas_datatype_i8_r = 160

8 bit signed integer, real

enumerator rocblas_datatype_u8_r = 161

8 bit unsigned integer, real

enumerator rocblas_datatype_i32_r = 162

32 bit signed integer, real

enumerator rocblas_datatype_u32_r = 163

32 bit unsigned integer, real

enumerator rocblas_datatype_i8_c = 164

8 bit signed integer, complex

enumerator rocblas_datatype_u8_c = 165

8 bit unsigned integer, complex

enumerator rocblas_datatype_i32_c = 166

32 bit signed integer, complex

enumerator rocblas_datatype_u32_c = 167

32 bit unsigned integer, complex

enumerator rocblas_datatype_bf16_r = 168

16 bit bfloat, real

enumerator rocblas_datatype_bf16_c = 169

16 bit bfloat, complex

rocblas_pointer_mode
enum rocblas_pointer_mode

Indicates the pointer is device pointer or host pointer. This is typically used for scalars such as alpha and beta.

Values:

enumerator rocblas_pointer_mode_host = 0

Scalar values affected by this variable will be located on the host.

enumerator rocblas_pointer_mode_device = 1

Scalar values affected by this variable will be located on the device.

rocblas_layer_mode
enum rocblas_layer_mode

Indicates if layer is active with bitmask.

Values:

enumerator rocblas_layer_mode_none = 0b0000000000

No logging will take place.

enumerator rocblas_layer_mode_log_trace = 0b0000000001

A line containing the function name and value of arguments passed will be printed with each rocBLAS function call.

enumerator rocblas_layer_mode_log_bench = 0b0000000010

Outputs a line each time a rocBLAS function is called, this line can be used with rocblas-bench to make the same call again.

enumerator rocblas_layer_mode_log_profile = 0b0000000100

Outputs a YAML description of each rocBLAS function called, along with its arguments and number of times it was called.

rocblas_gemm_algo
enum rocblas_gemm_algo

Indicates if layer is active with bitmask.

Values:

enumerator rocblas_gemm_algo_standard = 0b0000000000

Functions

Level 1 BLAS
rocblas_<type>scal()
rocblas_status rocblas_dscal(rocblas_handle handle, rocblas_int n, const double *alpha, double *x, rocblas_int incx)
rocblas_status rocblas_sscal(rocblas_handle handle, rocblas_int n, const float *alpha, float *x, rocblas_int incx)

BLAS Level 1 API.

scal scales each element of vector x with scalar alpha.

x := alpha * x

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in x.

  • [in] alpha: device pointer or host pointer for the scalar alpha.

  • [inout] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

rocblas_status rocblas_cscal(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *alpha, rocblas_float_complex *x, rocblas_int incx)
rocblas_status rocblas_zscal(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *alpha, rocblas_double_complex *x, rocblas_int incx)
rocblas_status rocblas_csscal(rocblas_handle handle, rocblas_int n, const float *alpha, rocblas_float_complex *x, rocblas_int incx)
rocblas_status rocblas_zdscal(rocblas_handle handle, rocblas_int n, const double *alpha, rocblas_double_complex *x, rocblas_int incx)
rocblas_<type>scal_batched()
rocblas_status rocblas_sscal_batched(rocblas_handle handle, rocblas_int n, const float *alpha, float *const x[], rocblas_int incx, rocblas_int batch_count)

BLAS Level 1 API.

scal_batched scales each element of vector x_i with scalar alpha, for i = 1, … , batch_count.

 x_i := alpha * x_i

where (x_i) is the i-th instance of the batch.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in each x_i.

  • [in] alpha: host pointer or device pointer for the scalar alpha.

  • [inout] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i.

  • [in] batch_count: [rocblas_int] specifies the number of batches in x.

rocblas_status rocblas_dscal_batched(rocblas_handle handle, rocblas_int n, const double *alpha, double *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_cscal_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *alpha, rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_zscal_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *alpha, rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_csscal_batched(rocblas_handle handle, rocblas_int n, const float *alpha, rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_zdscal_batched(rocblas_handle handle, rocblas_int n, const double *alpha, rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_<type>scal_strided_batched()
rocblas_status rocblas_sscal_strided_batched(rocblas_handle handle, rocblas_int n, const float *alpha, float *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)

BLAS Level 1 API.

scal_strided_batched scales each element of vector x_i with scalar alpha, for i = 1, … , batch_count.

 x_i := alpha * x_i ,

where (x_i) is the i-th instance of the batch.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in each x_i.

  • [in] alpha: host pointer or device pointer for the scalar alpha.

  • [inout] x: device pointer to the first vector (x_1) in the batch.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

  • [in] stride_x: [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x, however the user should take care to ensure that stride_x is of appropriate size, for a typical case this means stride_x >= n * incx.

  • [in] batch_count: [rocblas_int] specifies the number of batches in x.

rocblas_status rocblas_dscal_strided_batched(rocblas_handle handle, rocblas_int n, const double *alpha, double *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_cscal_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *alpha, rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_zscal_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *alpha, rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_csscal_strided_batched(rocblas_handle handle, rocblas_int n, const float *alpha, rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_zdscal_strided_batched(rocblas_handle handle, rocblas_int n, const double *alpha, rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_<type>copy()
rocblas_status rocblas_dcopy(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *y, rocblas_int incy)
rocblas_status rocblas_scopy(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *y, rocblas_int incy)

BLAS Level 1 API.

copy copies each element x[i] into y[i], for i = 1 , … , n

y := x,

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in x to be copied to y.

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

  • [out] y: device pointer storing vector y.

  • [in] incy: [rocblas_int] specifies the increment for the elements of y.

rocblas_status rocblas_ccopy(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_float_complex *y, rocblas_int incy)
rocblas_status rocblas_zcopy(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_double_complex *y, rocblas_int incy)
rocblas_<type>copy_batched()
rocblas_status rocblas_scopy_batched(rocblas_handle handle, rocblas_int n, const float *const x[], rocblas_int incx, float *const y[], rocblas_int incy, rocblas_int batch_count)

BLAS Level 1 API.

copy_batched copies each element x_i[j] into y_i[j], for j = 1 , … , n; i = 1 , … , batch_count

y_i := x_i,

where (x_i, y_i) is the i-th instance of the batch. x_i and y_i are vectors.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in each x_i to be copied to y_i.

  • [in] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each vector x_i.

  • [out] y: device array of device pointers storing each vector y_i.

  • [in] incy: [rocblas_int] specifies the increment for the elements of each vector y_i.

  • [in] batch_count: [rocblas_int] number of instances in the batch

rocblas_status rocblas_dcopy_batched(rocblas_handle handle, rocblas_int n, const double *const x[], rocblas_int incx, double *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_ccopy_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_float_complex *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_zcopy_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_double_complex *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_<type>copy_strided_batched()
rocblas_status rocblas_scopy_strided_batched(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_stride stridex, float *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)

BLAS Level 1 API.

copy_strided_batched copies each element x_i[j] into y_i[j], for j = 1 , … , n; i = 1 , … , batch_count

y_i := x_i,

where (x_i, y_i) is the i-th instance of the batch. x_i and y_i are vectors.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in each x_i to be copied to y_i.

  • [in] x: device pointer to the first vector (x_1) in the batch.

  • [in] incx: [rocblas_int] specifies the increments for the elements of vectors x_i.

  • [in] stridex: [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x, however the user should take care to ensure that stride_x is of appropriate size, for a typical case this means stride_x >= n * incx.

  • [out] y: device pointer to the first vector (y_1) in the batch.

  • [in] incy: [rocblas_int] specifies the increment for the elements of vectors y_i.

  • [in] stridey: [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1). There are no restrictions placed on stride_y, however the user should take care to ensure that stride_y is of appropriate size, for a typical case this means stride_y >= n * incy. stridey should be non zero.

  • [in] incy: [rocblas_int] specifies the increment for the elements of y.

  • [in] batch_count: [rocblas_int] number of instances in the batch

rocblas_status rocblas_dcopy_strided_batched(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_stride stridex, double *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_ccopy_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_float_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_zcopy_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_double_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_<type>dot()
rocblas_status rocblas_ddot(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, const double *y, rocblas_int incy, double *result)
rocblas_status rocblas_sdot(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, const float *y, rocblas_int incy, float *result)

BLAS Level 1 API.

dot(u) performs the dot product of vectors x and y

result = x * y;

dotc performs the dot product of the conjugate of complex vector x and complex vector y

result = conjugate (x) * y;

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in x and y.

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of y.

  • [in] y: device pointer storing vector y.

  • [in] incy: [rocblas_int] specifies the increment for the elements of y.

  • [inout] result: device pointer or host pointer to store the dot product. return is 0.0 if n <= 0.

rocblas_status rocblas_hdot(rocblas_handle handle, rocblas_int n, const rocblas_half *x, rocblas_int incx, const rocblas_half *y, rocblas_int incy, rocblas_half *result)
rocblas_status rocblas_bfdot(rocblas_handle handle, rocblas_int n, const rocblas_bfloat16 *x, rocblas_int incx, const rocblas_bfloat16 *y, rocblas_int incy, rocblas_bfloat16 *result)
rocblas_status rocblas_cdotu(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, const rocblas_float_complex *y, rocblas_int incy, rocblas_float_complex *result)
rocblas_status rocblas_cdotc(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, const rocblas_float_complex *y, rocblas_int incy, rocblas_float_complex *result)
rocblas_status rocblas_zdotu(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, const rocblas_double_complex *y, rocblas_int incy, rocblas_double_complex *result)
rocblas_status rocblas_zdotc(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, const rocblas_double_complex *y, rocblas_int incy, rocblas_double_complex *result)
rocblas_<type>dot_batched()
rocblas_status rocblas_sdot_batched(rocblas_handle handle, rocblas_int n, const float *const x[], rocblas_int incx, const float *const y[], rocblas_int incy, rocblas_int batch_count, float *result)

BLAS Level 1 API.

dot_batched(u) performs a batch of dot products of vectors x and y

result_i = x_i * y_i;

dotc_batched performs a batch of dot products of the conjugate of complex vector x and complex vector y

result_i = conjugate (x_i) * y_i;

where (x_i, y_i) is the i-th instance of the batch. x_i and y_i are vectors, for i = 1, …, batch_count

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in each x_i and y_i.

  • [in] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i.

  • [in] y: device array of device pointers storing each vector y_i.

  • [in] incy: [rocblas_int] specifies the increment for the elements of each y_i.

  • [in] batch_count: [rocblas_int] number of instances in the batch

  • [inout] result: device array or host array of batch_count size to store the dot products of each batch. return 0.0 for each element if n <= 0.

rocblas_status rocblas_ddot_batched(rocblas_handle handle, rocblas_int n, const double *const x[], rocblas_int incx, const double *const y[], rocblas_int incy, rocblas_int batch_count, double *result)
rocblas_status rocblas_hdot_batched(rocblas_handle handle, rocblas_int n, const rocblas_half *const x[], rocblas_int incx, const rocblas_half *const y[], rocblas_int incy, rocblas_int batch_count, rocblas_half *result)
rocblas_status rocblas_bfdot_batched(rocblas_handle handle, rocblas_int n, const rocblas_bfloat16 *const x[], rocblas_int incx, const rocblas_bfloat16 *const y[], rocblas_int incy, rocblas_int batch_count, rocblas_bfloat16 *result)
rocblas_status rocblas_cdotu_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, const rocblas_float_complex *const y[], rocblas_int incy, rocblas_int batch_count, rocblas_float_complex *result)
rocblas_status rocblas_cdotc_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, const rocblas_float_complex *const y[], rocblas_int incy, rocblas_int batch_count, rocblas_float_complex *result)
rocblas_status rocblas_zdotu_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, const rocblas_double_complex *const y[], rocblas_int incy, rocblas_int batch_count, rocblas_double_complex *result)
rocblas_status rocblas_zdotc_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, const rocblas_double_complex *const y[], rocblas_int incy, rocblas_int batch_count, rocblas_double_complex *result)
rocblas_<type>dot_strided_batched()
rocblas_status rocblas_sdot_strided_batched(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_stride stridex, const float *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, float *result)

BLAS Level 1 API.

dot_strided_batched(u) performs a batch of dot products of vectors x and y

result_i = x_i * y_i;

dotc_strided_batched performs a batch of dot products of the conjugate of complex vector x and complex vector y

result_i = conjugate (x_i) * y_i;

where (x_i, y_i) is the i-th instance of the batch. x_i and y_i are vectors, for i = 1, …, batch_count

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in each x_i and y_i.

  • [in] x: device pointer to the first vector (x_1) in the batch.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i.

  • [in] stridex: [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1)

  • [in] y: device pointer to the first vector (y_1) in the batch.

  • [in] incy: [rocblas_int] specifies the increment for the elements of each y_i.

  • [in] stridey: [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1)

  • [in] batch_count: [rocblas_int] number of instances in the batch

  • [inout] result: device array or host array of batch_count size to store the dot products of each batch. return 0.0 for each element if n <= 0.

rocblas_status rocblas_ddot_strided_batched(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_stride stridex, const double *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, double *result)
rocblas_status rocblas_hdot_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_half *x, rocblas_int incx, rocblas_stride stridex, const rocblas_half *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_half *result)
rocblas_status rocblas_bfdot_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_bfloat16 *x, rocblas_int incx, rocblas_stride stridex, const rocblas_bfloat16 *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_bfloat16 *result)
rocblas_status rocblas_cdotu_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, const rocblas_float_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_float_complex *result)
rocblas_status rocblas_cdotc_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, const rocblas_float_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_float_complex *result)
rocblas_status rocblas_zdotu_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, const rocblas_double_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_double_complex *result)
rocblas_status rocblas_zdotc_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, const rocblas_double_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_double_complex *result)
rocblas_<type>swap()
rocblas_status rocblas_sswap(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, float *y, rocblas_int incy)

BLAS Level 1 API.

swap interchanges vectors x and y.

y := x; x := y

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in x and y.

  • [inout] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

  • [inout] y: device pointer storing vector y.

  • [in] incy: [rocblas_int] specifies the increment for the elements of y.

rocblas_status rocblas_dswap(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, double *y, rocblas_int incy)
rocblas_status rocblas_cswap(rocblas_handle handle, rocblas_int n, rocblas_float_complex *x, rocblas_int incx, rocblas_float_complex *y, rocblas_int incy)
rocblas_status rocblas_zswap(rocblas_handle handle, rocblas_int n, rocblas_double_complex *x, rocblas_int incx, rocblas_double_complex *y, rocblas_int incy)
rocblas_<type>swap_batched()
rocblas_status rocblas_sswap_batched(rocblas_handle handle, rocblas_int n, float *x[], rocblas_int incx, float *y[], rocblas_int incy, rocblas_int batch_count)

BLAS Level 1 API.

swap_batched interchanges vectors x_i and y_i, for i = 1 , … , batch_count

y_i := x_i; x_i := y_i

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in each x_i and y_i.

  • [inout] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i.

  • [inout] y: device array of device pointers storing each vector y_i.

  • [in] incy: [rocblas_int] specifies the increment for the elements of each y_i.

  • [in] batch_count: [rocblas_int] number of instances in the batch.

rocblas_status rocblas_dswap_batched(rocblas_handle handle, rocblas_int n, double *x[], rocblas_int incx, double *y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_cswap_batched(rocblas_handle handle, rocblas_int n, rocblas_float_complex *x[], rocblas_int incx, rocblas_float_complex *y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_zswap_batched(rocblas_handle handle, rocblas_int n, rocblas_double_complex *x[], rocblas_int incx, rocblas_double_complex *y[], rocblas_int incy, rocblas_int batch_count)
rocblas_<type>swap_strided_batched()
rocblas_status rocblas_sswap_strided_batched(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, rocblas_stride stridex, float *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)

BLAS Level 1 API.

swap_strided_batched interchanges vectors x_i and y_i, for i = 1 , … , batch_count

y_i := x_i; x_i := y_i

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in each x_i and y_i.

  • [inout] x: device pointer to the first vector x_1.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

  • [in] stridex: [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x, however the user should take care to ensure that stride_x is of appropriate size, for a typical case this means stride_x >= n * incx.

  • [inout] y: device pointer to the first vector y_1.

  • [in] incy: [rocblas_int] specifies the increment for the elements of y.

  • [in] stridey: [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1). There are no restrictions placed on stride_x, however the user should take care to ensure that stride_y is of appropriate size, for a typical case this means stride_y >= n * incy. stridey should be non zero.

  • [in] batch_count: [rocblas_int] number of instances in the batch.

rocblas_status rocblas_dswap_strided_batched(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, rocblas_stride stridex, double *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_cswap_strided_batched(rocblas_handle handle, rocblas_int n, rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_float_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_zswap_strided_batched(rocblas_handle handle, rocblas_int n, rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_double_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_<type>axpy()
rocblas_status rocblas_daxpy(rocblas_handle handle, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, double *y, rocblas_int incy)
rocblas_status rocblas_saxpy(rocblas_handle handle, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, float *y, rocblas_int incy)

BLAS Level 1 API.

axpy computes constant alpha multiplied by vector x, plus vector y

y := alpha * x + y

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in x and y.

  • [in] alpha: device pointer or host pointer to specify the scalar alpha.

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

  • [out] y: device pointer storing vector y.

  • [inout] incy: [rocblas_int] specifies the increment for the elements of y.

rocblas_status rocblas_haxpy(rocblas_handle handle, rocblas_int n, const rocblas_half *alpha, const rocblas_half *x, rocblas_int incx, rocblas_half *y, rocblas_int incy)
rocblas_status rocblas_caxpy(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *x, rocblas_int incx, rocblas_float_complex *y, rocblas_int incy)
rocblas_status rocblas_zaxpy(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *x, rocblas_int incx, rocblas_double_complex *y, rocblas_int incy)
rocblas_<type>asum()
rocblas_status rocblas_dasum(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *result)
rocblas_status rocblas_sasum(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *result)

BLAS Level 1 API.

asum computes the sum of the magnitudes of elements of a real vector x, or the sum of magnitudes of the real and imaginary parts of elements if x is a complex vector

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in x and y.

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x. incx must be > 0.

  • [inout] result: device pointer or host pointer to store the asum product. return is 0.0 if n <= 0.

rocblas_status rocblas_scasum(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, float *result)
rocblas_status rocblas_dzasum(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, double *result)
rocblas_<type>asum_batched()
rocblas_status rocblas_sasum_batched(rocblas_handle handle, rocblas_int n, const float *const x[], rocblas_int incx, rocblas_int batch_count, float *results)

BLAS Level 1 API.

asum_batched computes the sum of the magnitudes of the elements in a batch of real vectors x_i, or the sum of magnitudes of the real and imaginary parts of elements if x_i is a complex vector, for i = 1, …, batch_count

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] number of elements in each vector x_i

  • [in] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • [out] results: device array or host array of batch_count size for results. return is 0.0 if n, incx<=0.

  • [in] batch_count: [rocblas_int] number of instances in the batch.

rocblas_status rocblas_dasum_batched(rocblas_handle handle, rocblas_int n, const double *const x[], rocblas_int incx, rocblas_int batch_count, double *results)
rocblas_status rocblas_scasum_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count, float *results)
rocblas_status rocblas_dzasum_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count, double *results)
rocblas_<type>asum_strided_batched()
rocblas_status rocblas_sasum_strided_batched(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, float *results)

BLAS Level 1 API.

asum_strided_batched computes the sum of the magnitudes of elements of a real vectors x_i, or the sum of magnitudes of the real and imaginary parts of elements if x_i is a complex vector, for i = 1, …, batch_count

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] number of elements in each vector x_i

  • [in] x: device pointer to the first vector x_1.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • [in] stridex: [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x, however the user should take care to ensure that stride_x is of appropriate size, for a typical case this means stride_x >= n * incx.

  • [out] results: device pointer or host pointer to array for storing contiguous batch_count results. return is 0.0 if n, incx<=0.

  • [in] batch_count: [rocblas_int] number of instances in the batch

rocblas_status rocblas_dasum_strided_batched(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, double *results)
rocblas_status rocblas_scasum_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, float *results)
rocblas_status rocblas_dzasum_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, double *results)
rocblas_<type>nrm2()
rocblas_status rocblas_dnrm2(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *result)
rocblas_status rocblas_snrm2(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *result)

BLAS Level 1 API.

nrm2 computes the euclidean norm of a real or complex vector

      result := sqrt( x'*x ) for real vectors
      result := sqrt( x**H*x ) for complex vectors

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in x.

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of y.

  • [inout] result: device pointer or host pointer to store the nrm2 product. return is 0.0 if n, incx<=0.

rocblas_status rocblas_scnrm2(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, float *result)
rocblas_status rocblas_dznrm2(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, double *result)
rocblas_<type>nrm2_batched()
rocblas_status rocblas_snrm2_batched(rocblas_handle handle, rocblas_int n, const float *const x[], rocblas_int incx, rocblas_int batch_count, float *results)

BLAS Level 1 API.

nrm2_batched computes the euclidean norm over a batch of real or complex vectors

      result := sqrt( x_i'*x_i ) for real vectors x, for i = 1, ..., batch_count
      result := sqrt( x_i**H*x_i ) for complex vectors x, for i = 1, ..., batch_count

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] number of elements in each x_i.

  • [in] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • [in] batch_count: [rocblas_int] number of instances in the batch

  • [out] results: device pointer or host pointer to array of batch_count size for nrm2 results. return is 0.0 for each element if n <= 0, incx<=0.

rocblas_status rocblas_dnrm2_batched(rocblas_handle handle, rocblas_int n, const double *const x[], rocblas_int incx, rocblas_int batch_count, double *results)
rocblas_status rocblas_scnrm2_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count, float *results)
rocblas_status rocblas_dznrm2_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count, double *results)
rocblas_<type>nrm2_strided_batched()
rocblas_status rocblas_snrm2_strided_batched(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, float *results)

BLAS Level 1 API.

nrm2_strided_batched computes the euclidean norm over a batch of real or complex vectors

      := sqrt( x_i'*x_i ) for real vectors x, for i = 1, ..., batch_count
      := sqrt( x_i**H*x_i ) for complex vectors, for i = 1, ..., batch_count

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] number of elements in each x_i.

  • [in] x: device pointer to the first vector x_1.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • [in] stridex: [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x, however the user should take care to ensure that stride_x is of appropriate size, for a typical case this means stride_x >= n * incx.

  • [in] batch_count: [rocblas_int] number of instances in the batch

  • [out] results: device pointer or host pointer to array for storing contiguous batch_count results. return is 0.0 for each element if n <= 0, incx<=0.

rocblas_status rocblas_dnrm2_strided_batched(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, double *results)
rocblas_status rocblas_scnrm2_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, float *results)
rocblas_status rocblas_dznrm2_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, double *results)
rocblas_i<type>amax()
rocblas_status rocblas_idamax(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_int *result)
rocblas_status rocblas_isamax(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_int *result)

BLAS Level 1 API.

amax finds the first index of the element of maximum magnitude of a vector x. vector

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in x.

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of y.

  • [inout] result: device pointer or host pointer to store the amax index. return is 0.0 if n, incx<=0.

rocblas_status rocblas_icamax(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_int *result)
rocblas_status rocblas_izamax(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_int *result)
rocblas_i<type>amax_batched()
rocblas_status rocblas_isamax_batched(rocblas_handle handle, rocblas_int n, const float *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)

BLAS Level 1 API.

amax_batched finds the first index of the element of maximum magnitude of each vector x_i in a batch, for i = 1, …, batch_count.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] number of elements in each vector x_i

  • [in] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • [in] batch_count: [rocblas_int] number of instances in the batch, must be > 0.

  • [out] result: device or host array of pointers of batch_count size for results. return is 0 if n, incx<=0.

rocblas_status rocblas_idamax_batched(rocblas_handle handle, rocblas_int n, const double *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_icamax_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_izamax_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)
rocblas_i<type>amax_strided_batched()
rocblas_status rocblas_isamax_strided_batched(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)

BLAS Level 1 API.

amax_strided_batched finds the first index of the element of maximum magnitude of each vector x_i in a batch, for i = 1, …, batch_count.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] number of elements in each vector x_i

  • [in] x: device pointer to the first vector x_1.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • [in] stridex: [rocblas_stride] specifies the pointer increment between one x_i and the next x_(i + 1).

  • [in] batch_count: [rocblas_int] number of instances in the batch

  • [out] result: device or host pointer for storing contiguous batch_count results. return is 0 if n <= 0, incx<=0.

rocblas_status rocblas_idamax_strided_batched(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_icamax_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_izamax_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)
rocblas_i<type>amin()
rocblas_status rocblas_idamin(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_int *result)
rocblas_status rocblas_isamin(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_int *result)

BLAS Level 1 API.

amin finds the first index of the element of minimum magnitude of a vector x.

vector

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in x.

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of y.

  • [inout] result: device pointer or host pointer to store the amin index. return is 0.0 if n, incx<=0.

rocblas_status rocblas_icamin(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_int *result)
rocblas_status rocblas_izamin(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_int *result)
rocblas_i<type>amin_batched()
rocblas_status rocblas_isamin_batched(rocblas_handle handle, rocblas_int n, const float *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)

BLAS Level 1 API.

amin_batched finds the first index of the element of minimum magnitude of each vector x_i in a batch, for i = 1, …, batch_count.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] number of elements in each vector x_i

  • [in] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • [in] batch_count: [rocblas_int] number of instances in the batch, must be > 0.

  • [out] result: device or host pointers to array of batch_count size for results. return is 0 if n, incx<=0.

rocblas_status rocblas_idamin_batched(rocblas_handle handle, rocblas_int n, const double *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_icamin_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_izamin_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)
rocblas_i<type>amin_strided_batched()
rocblas_status rocblas_isamin_strided_batched(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)

BLAS Level 1 API.

amin_strided_batched finds the first index of the element of minimum magnitude of each vector x_i in a batch, for i = 1, …, batch_count.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] number of elements in each vector x_i

  • [in] x: device pointer to the first vector x_1.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • [in] stridex: [rocblas_stride] specifies the pointer increment between one x_i and the next x_(i + 1)

  • [in] batch_count: [rocblas_int] number of instances in the batch

  • [out] result: device or host pointer to array for storing contiguous batch_count results. return is 0 if n <= 0, incx<=0.

rocblas_status rocblas_idamin_strided_batched(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_icamin_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_izamin_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)
rocblas_<type>rot()
rocblas_status rocblas_srot(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, float *y, rocblas_int incy, const float *c, const float *s)

BLAS Level 1 API.

rot applies the Givens rotation matrix defined by c=cos(alpha) and s=sin(alpha) to vectors x and y. Scalars c and s may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] number of elements in the x and y vectors.

  • [inout] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment between elements of x.

  • [inout] y: device pointer storing vector y.

  • [in] incy: [rocblas_int] specifies the increment between elements of y.

  • [in] c: device pointer or host pointer storing scalar cosine component of the rotation matrix.

  • [in] s: device pointer or host pointer storing scalar sine component of the rotation matrix.

rocblas_status rocblas_drot(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, double *y, rocblas_int incy, const double *c, const double *s)
rocblas_status rocblas_crot(rocblas_handle handle, rocblas_int n, rocblas_float_complex *x, rocblas_int incx, rocblas_float_complex *y, rocblas_int incy, const float *c, const rocblas_float_complex *s)
rocblas_status rocblas_csrot(rocblas_handle handle, rocblas_int n, rocblas_float_complex *x, rocblas_int incx, rocblas_float_complex *y, rocblas_int incy, const float *c, const float *s)
rocblas_status rocblas_zrot(rocblas_handle handle, rocblas_int n, rocblas_double_complex *x, rocblas_int incx, rocblas_double_complex *y, rocblas_int incy, const double *c, const rocblas_double_complex *s)
rocblas_status rocblas_zdrot(rocblas_handle handle, rocblas_int n, rocblas_double_complex *x, rocblas_int incx, rocblas_double_complex *y, rocblas_int incy, const double *c, const double *s)
rocblas_<type>rot_batched()
rocblas_status rocblas_srot_batched(rocblas_handle handle, rocblas_int n, float *const x[], rocblas_int incx, float *const y[], rocblas_int incy, const float *c, const float *s, rocblas_int batch_count)

BLAS Level 1 API.

rot_batched applies the Givens rotation matrix defined by c=cos(alpha) and s=sin(alpha) to batched vectors x_i and y_i, for i = 1, …, batch_count. Scalars c and s may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] number of elements in each x_i and y_i vectors.

  • [inout] x: device array of deivce pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment between elements of each x_i.

  • [inout] y: device array of device pointers storing each vector y_i.

  • [in] incy: [rocblas_int] specifies the increment between elements of each y_i.

  • [in] c: device pointer or host pointer to scalar cosine component of the rotation matrix.

  • [in] s: device pointer or host pointer to scalar sine component of the rotation matrix.

  • [in] batch_count: [rocblas_int] the number of x and y arrays, i.e. the number of batches.

rocblas_status rocblas_drot_batched(rocblas_handle handle, rocblas_int n, double *const x[], rocblas_int incx, double *const y[], rocblas_int incy, const double *c, const double *s, rocblas_int batch_count)
rocblas_status rocblas_crot_batched(rocblas_handle handle, rocblas_int n, rocblas_float_complex *const x[], rocblas_int incx, rocblas_float_complex *const y[], rocblas_int incy, const float *c, const rocblas_float_complex *s, rocblas_int batch_count)
rocblas_status rocblas_csrot_batched(rocblas_handle handle, rocblas_int n, rocblas_float_complex *const x[], rocblas_int incx, rocblas_float_complex *const y[], rocblas_int incy, const float *c, const float *s, rocblas_int batch_count)
rocblas_status rocblas_zrot_batched(rocblas_handle handle, rocblas_int n, rocblas_double_complex *const x[], rocblas_int incx, rocblas_double_complex *const y[], rocblas_int incy, const double *c, const rocblas_double_complex *s, rocblas_int batch_count)
rocblas_status rocblas_zdrot_batched(rocblas_handle handle, rocblas_int n, rocblas_double_complex *const x[], rocblas_int incx, rocblas_double_complex *const y[], rocblas_int incy, const double *c, const double *s, rocblas_int batch_count)
rocblas_<type>rot_strided_batched()
rocblas_status rocblas_srot_strided_batched(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, rocblas_stride stride_x, float *y, rocblas_int incy, rocblas_stride stride_y, const float *c, const float *s, rocblas_int batch_count)

BLAS Level 1 API.

rot_strided_batched applies the Givens rotation matrix defined by c=cos(alpha) and s=sin(alpha) to strided batched vectors x_i and y_i, for i = 1, …, batch_count. Scalars c and s may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] number of elements in each x_i and y_i vectors.

  • [inout] x: device pointer to the first vector x_1.

  • [in] incx: [rocblas_int] specifies the increment between elements of each x_i.

  • [in] stride_x: [rocblas_stride] specifies the increment from the beginning of x_i to the beginning of x_(i+1)

  • [inout] y: device pointer to the first vector y_1.

  • [in] incy: [rocblas_int] specifies the increment between elements of each y_i.

  • [in] stride_y: [rocblas_stride] specifies the increment from the beginning of y_i to the beginning of y_(i+1)

  • [in] c: device pointer or host pointer to scalar cosine component of the rotation matrix.

  • [in] s: device pointer or host pointer to scalar sine component of the rotation matrix.

  • [in] batch_count: [rocblas_int] the number of x and y arrays, i.e. the number of batches.

rocblas_status rocblas_drot_strided_batched(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, rocblas_stride stride_x, double *y, rocblas_int incy, rocblas_stride stride_y, const double *c, const double *s, rocblas_int batch_count)
rocblas_status rocblas_crot_strided_batched(rocblas_handle handle, rocblas_int n, rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_float_complex *y, rocblas_int incy, rocblas_stride stride_y, const float *c, const rocblas_float_complex *s, rocblas_int batch_count)
rocblas_status rocblas_csrot_strided_batched(rocblas_handle handle, rocblas_int n, rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_float_complex *y, rocblas_int incy, rocblas_stride stride_y, const float *c, const float *s, rocblas_int batch_count)
rocblas_status rocblas_zrot_strided_batched(rocblas_handle handle, rocblas_int n, rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_double_complex *y, rocblas_int incy, rocblas_stride stride_y, const double *c, const rocblas_double_complex *s, rocblas_int batch_count)
rocblas_status rocblas_zdrot_strided_batched(rocblas_handle handle, rocblas_int n, rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_double_complex *y, rocblas_int incy, rocblas_stride stride_y, const double *c, const double *s, rocblas_int batch_count)
rocblas_<type>rotg()
rocblas_status rocblas_srotg(rocblas_handle handle, float *a, float *b, float *c, float *s)

BLAS Level 1 API.

rotg creates the Givens rotation matrix for the vector (a b). Scalars c and s and arrays a and b may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode. If the pointer mode is set to rocblas_pointer_mode_host, this function blocks the CPU until the GPU has finished and the results are available in host memory. If the pointer mode is set to rocblas_pointer_mode_device, this function returns immediately and synchronization is required to read the results.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [inout] a: device pointer or host pointer to input vector element, overwritten with r.

  • [inout] b: device pointer or host pointer to input vector element, overwritten with z.

  • [inout] c: device pointer or host pointer to cosine element of Givens rotation.

  • [inout] s: device pointer or host pointer sine element of Givens rotation.

rocblas_status rocblas_drotg(rocblas_handle handle, double *a, double *b, double *c, double *s)
rocblas_status rocblas_crotg(rocblas_handle handle, rocblas_float_complex *a, rocblas_float_complex *b, float *c, rocblas_float_complex *s)
rocblas_status rocblas_zrotg(rocblas_handle handle, rocblas_double_complex *a, rocblas_double_complex *b, double *c, rocblas_double_complex *s)
rocblas_<type>rotg_batched()
rocblas_status rocblas_srotg_batched(rocblas_handle handle, float *const a[], float *const b[], float *const c[], float *const s[], rocblas_int batch_count)

BLAS Level 1 API.

rotg_batched creates the Givens rotation matrix for the batched vectors (a_i b_i), for i = 1, …, batch_count. a, b, c, and s may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode. If the pointer mode is set to rocblas_pointer_mode_host, this function blocks the CPU until the GPU has finished and the results are available in host memory. If the pointer mode is set to rocblas_pointer_mode_device, this function returns immediately and synchronization is required to read the results.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [inout] a: device array of device pointers storing each single input vector element a_i, overwritten with r_i.

  • [inout] b: device array of device pointers storing each single input vector element b_i, overwritten with z_i.

  • [inout] c: device array of device pointers storing each cosine element of Givens rotation for the batch.

  • [inout] s: device array of device pointers storing each sine element of Givens rotation for the batch.

  • [in] batch_count: [rocblas_int] number of batches (length of arrays a, b, c, and s).

rocblas_status rocblas_drotg_batched(rocblas_handle handle, double *const a[], double *const b[], double *const c[], double *const s[], rocblas_int batch_count)
rocblas_status rocblas_crotg_batched(rocblas_handle handle, rocblas_float_complex *const a[], rocblas_float_complex *const b[], float *const c[], rocblas_float_complex *const s[], rocblas_int batch_count)
rocblas_status rocblas_zrotg_batched(rocblas_handle handle, rocblas_double_complex *const a[], rocblas_double_complex *const b[], double *const c[], rocblas_double_complex *const s[], rocblas_int batch_count)
rocblas_<type>rotg_strided_batched()
rocblas_status rocblas_srotg_strided_batched(rocblas_handle handle, float *a, rocblas_stride stride_a, float *b, rocblas_stride stride_b, float *c, rocblas_stride stride_c, float *s, rocblas_stride stride_s, rocblas_int batch_count)

BLAS Level 1 API.

rotg_strided_batched creates the Givens rotation matrix for the strided batched vectors (a_i b_i), for i = 1, …, batch_count. a, b, c, and s may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode. If the pointer mode is set to rocblas_pointer_mode_host, this function blocks the CPU until the GPU has finished and the results are available in host memory. If the pointer mode is set to rocblas_pointer_mode_device, this function returns immediately and synchronization is required to read the results.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [inout] a: device strided_batched pointer or host strided_batched pointer to first single input vector element a_1, overwritten with r.

  • [in] stride_a: [rocblas_stride] distance between elements of a in batch (distance between a_i and a_(i + 1))

  • [inout] b: device strided_batched pointer or host strided_batched pointer to first single input vector element b_1, overwritten with z.

  • [in] stride_b: [rocblas_stride] distance between elements of b in batch (distance between b_i and b_(i + 1))

  • [inout] c: device strided_batched pointer or host strided_batched pointer to first cosine element of Givens rotations c_1.

  • [in] stride_c: [rocblas_stride] distance between elements of c in batch (distance between c_i and c_(i + 1))

  • [inout] s: device strided_batched pointer or host strided_batched pointer to sine element of Givens rotations s_1.

  • [in] stride_s: [rocblas_stride] distance between elements of s in batch (distance between s_i and s_(i + 1))

  • [in] batch_count: [rocblas_int] number of batches (length of arrays a, b, c, and s).

rocblas_status rocblas_drotg_strided_batched(rocblas_handle handle, double *a, rocblas_stride stride_a, double *b, rocblas_stride stride_b, double *c, rocblas_stride stride_c, double *s, rocblas_stride stride_s, rocblas_int batch_count)
rocblas_status rocblas_crotg_strided_batched(rocblas_handle handle, rocblas_float_complex *a, rocblas_stride stride_a, rocblas_float_complex *b, rocblas_stride stride_b, float *c, rocblas_stride stride_c, rocblas_float_complex *s, rocblas_stride stride_s, rocblas_int batch_count)
rocblas_status rocblas_zrotg_strided_batched(rocblas_handle handle, rocblas_double_complex *a, rocblas_stride stride_a, rocblas_double_complex *b, rocblas_stride stride_b, double *c, rocblas_stride stride_c, rocblas_double_complex *s, rocblas_stride stride_s, rocblas_int batch_count)
rocblas_<type>rotm()
rocblas_status rocblas_srotm(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, float *y, rocblas_int incy, const float *param)

BLAS Level 1 API.

rotm applies the modified Givens rotation matrix defined by param to vectors x and y.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] number of elements in the x and y vectors.

  • [inout] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment between elements of x.

  • [inout] y: device pointer storing vector y.

  • [in] incy: [rocblas_int] specifies the increment between elements of y.

  • [in] param: device vector or host vector of 5 elements defining the rotation. param[0] = flag param[1] = H11 param[2] = H21 param[3] = H12 param[4] = H22 The flag parameter defines the form of H: flag = -1 => H = ( H11 H12 H21 H22 ) flag = 0 => H = ( 1.0 H12 H21 1.0 ) flag = 1 => H = ( H11 1.0 -1.0 H22 ) flag = -2 => H = ( 1.0 0.0 0.0 1.0 ) param may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode.

rocblas_status rocblas_drotm(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, double *y, rocblas_int incy, const double *param)
rocblas_<type>rotm_batched()
rocblas_status rocblas_srotm_batched(rocblas_handle handle, rocblas_int n, float *const x[], rocblas_int incx, float *const y[], rocblas_int incy, const float *const param[], rocblas_int batch_count)

BLAS Level 1 API.

rotm_batched applies the modified Givens rotation matrix defined by param_i to batched vectors x_i and y_i, for i = 1, …, batch_count.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] number of elements in the x and y vectors.

  • [inout] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment between elements of each x_i.

  • [inout] y: device array of device pointers storing each vector y_1.

  • [in] incy: [rocblas_int] specifies the increment between elements of each y_i.

  • [in] param: device array of device vectors of 5 elements defining the rotation. param[0] = flag param[1] = H11 param[2] = H21 param[3] = H12 param[4] = H22 The flag parameter defines the form of H: flag = -1 => H = ( H11 H12 H21 H22 ) flag = 0 => H = ( 1.0 H12 H21 1.0 ) flag = 1 => H = ( H11 1.0 -1.0 H22 ) flag = -2 => H = ( 1.0 0.0 0.0 1.0 ) param may ONLY be stored on the device for the batched version of this function.

  • [in] batch_count: [rocblas_int] the number of x and y arrays, i.e. the number of batches.

rocblas_status rocblas_drotm_batched(rocblas_handle handle, rocblas_int n, double *const x[], rocblas_int incx, double *const y[], rocblas_int incy, const double *const param[], rocblas_int batch_count)
rocblas_<type>rotm_strided_batched()
rocblas_status rocblas_srotm_strided_batched(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, rocblas_stride stride_x, float *y, rocblas_int incy, rocblas_stride stride_y, const float *param, rocblas_stride stride_param, rocblas_int batch_count)

BLAS Level 1 API.

rotm_strided_batched applies the modified Givens rotation matrix defined by param_i to strided batched vectors x_i and y_i, for i = 1, …, batch_count

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] number of elements in the x and y vectors.

  • [inout] x: device pointer pointing to first strided batched vector x_1.

  • [in] incx: [rocblas_int] specifies the increment between elements of each x_i.

  • [in] stride_x: [rocblas_stride] specifies the increment between the beginning of x_i and x_(i + 1)

  • [inout] y: device pointer pointing to first strided batched vector y_1.

  • [in] incy: [rocblas_int] specifies the increment between elements of each y_i.

  • [in] stride_y: [rocblas_stride] specifies the increment between the beginning of y_i and y_(i + 1)

  • [in] param: device pointer pointing to first array of 5 elements defining the rotation (param_1). param[0] = flag param[1] = H11 param[2] = H21 param[3] = H12 param[4] = H22 The flag parameter defines the form of H: flag = -1 => H = ( H11 H12 H21 H22 ) flag = 0 => H = ( 1.0 H12 H21 1.0 ) flag = 1 => H = ( H11 1.0 -1.0 H22 ) flag = -2 => H = ( 1.0 0.0 0.0 1.0 ) param may ONLY be stored on the device for the strided_batched version of this function.

  • [in] stride_param: [rocblas_stride] specifies the increment between the beginning of param_i and param_(i + 1)

  • [in] batch_count: [rocblas_int] the number of x and y arrays, i.e. the number of batches.

rocblas_status rocblas_drotm_strided_batched(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, rocblas_stride stride_x, double *y, rocblas_int incy, rocblas_stride stride_y, const double *param, rocblas_stride stride_param, rocblas_int batch_count)
rocblas_<type>rotmg()
rocblas_status rocblas_srotmg(rocblas_handle handle, float *d1, float *d2, float *x1, const float *y1, float *param)

BLAS Level 1 API.

rotmg creates the modified Givens rotation matrix for the vector (d1 * x1, d2 * y1). Parameters may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode. If the pointer mode is set to rocblas_pointer_mode_host, this function blocks the CPU until the GPU has finished and the results are available in host memory. If the pointer mode is set to rocblas_pointer_mode_device, this function returns immediately and synchronization is required to read the results.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [inout] d1: device pointer or host pointer to input scalar that is overwritten.

  • [inout] d2: device pointer or host pointer to input scalar that is overwritten.

  • [inout] x1: device pointer or host pointer to input scalar that is overwritten.

  • [in] y1: device pointer or host pointer to input scalar.

  • [out] param: device vector or host vector of 5 elements defining the rotation. param[0] = flag param[1] = H11 param[2] = H21 param[3] = H12 param[4] = H22 The flag parameter defines the form of H: flag = -1 => H = ( H11 H12 H21 H22 ) flag = 0 => H = ( 1.0 H12 H21 1.0 ) flag = 1 => H = ( H11 1.0 -1.0 H22 ) flag = -2 => H = ( 1.0 0.0 0.0 1.0 ) param may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode.

rocblas_status rocblas_drotmg(rocblas_handle handle, double *d1, double *d2, double *x1, const double *y1, double *param)
rocblas_<type>rotmg_batched()
rocblas_status rocblas_srotmg_batched(rocblas_handle handle, float *const d1[], float *const d2[], float *const x1[], const float *const y1[], float *const param[], rocblas_int batch_count)

BLAS Level 1 API.

rotmg_batched creates the modified Givens rotation matrix for the batched vectors (d1_i * x1_i, d2_i * y1_i), for i = 1, …, batch_count. Parameters may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode. If the pointer mode is set to rocblas_pointer_mode_host, this function blocks the CPU until the GPU has finished and the results are available in host memory. If the pointer mode is set to rocblas_pointer_mode_device, this function returns immediately and synchronization is required to read the results.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [inout] d1: device batched array or host batched array of input scalars that is overwritten.

  • [inout] d2: device batched array or host batched array of input scalars that is overwritten.

  • [inout] x1: device batched array or host batched array of input scalars that is overwritten.

  • [in] y1: device batched array or host batched array of input scalars.

  • [out] param: device batched array or host batched array of vectors of 5 elements defining the rotation. param[0] = flag param[1] = H11 param[2] = H21 param[3] = H12 param[4] = H22 The flag parameter defines the form of H: flag = -1 => H = ( H11 H12 H21 H22 ) flag = 0 => H = ( 1.0 H12 H21 1.0 ) flag = 1 => H = ( H11 1.0 -1.0 H22 ) flag = -2 => H = ( 1.0 0.0 0.0 1.0 ) param may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode.

  • [in] batch_count: [rocblas_int] the number of instances in the batch.

rocblas_status rocblas_drotmg_batched(rocblas_handle handle, double *const d1[], double *const d2[], double *const x1[], const double *const y1[], double *const param[], rocblas_int batch_count)
rocblas_<type>rotmg_strided_batched()
rocblas_status rocblas_srotmg_strided_batched(rocblas_handle handle, float *d1, rocblas_stride stride_d1, float *d2, rocblas_stride stride_d2, float *x1, rocblas_stride stride_x1, const float *y1, rocblas_stride stride_y1, float *param, rocblas_stride stride_param, rocblas_int batch_count)

BLAS Level 1 API.

rotmg_strided_batched creates the modified Givens rotation matrix for the strided batched vectors (d1_i * x1_i, d2_i * y1_i), for i = 1, …, batch_count. Parameters may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode. If the pointer mode is set to rocblas_pointer_mode_host, this function blocks the CPU until the GPU has finished and the results are available in host memory. If the pointer mode is set to rocblas_pointer_mode_device, this function returns immediately and synchronization is required to read the results.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [inout] d1: device strided_batched array or host strided_batched array of input scalars that is overwritten.

  • [in] stride_d1: [rocblas_stride] specifies the increment between the beginning of d1_i and d1_(i+1)

  • [inout] d2: device strided_batched array or host strided_batched array of input scalars that is overwritten.

  • [in] stride_d2: [rocblas_stride] specifies the increment between the beginning of d2_i and d2_(i+1)

  • [inout] x1: device strided_batched array or host strided_batched array of input scalars that is overwritten.

  • [in] stride_x1: [rocblas_stride] specifies the increment between the beginning of x1_i and x1_(i+1)

  • [in] y1: device strided_batched array or host strided_batched array of input scalars.

  • [in] stride_y1: [rocblas_stride] specifies the increment between the beginning of y1_i and y1_(i+1)

  • [out] param: device strided_batched array or host strided_batched array of vectors of 5 elements defining the rotation. param[0] = flag param[1] = H11 param[2] = H21 param[3] = H12 param[4] = H22 The flag parameter defines the form of H: flag = -1 => H = ( H11 H12 H21 H22 ) flag = 0 => H = ( 1.0 H12 H21 1.0 ) flag = 1 => H = ( H11 1.0 -1.0 H22 ) flag = -2 => H = ( 1.0 0.0 0.0 1.0 ) param may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode.

  • [in] stride_param: [rocblas_stride] specifies the increment between the beginning of param_i and param_(i + 1)

  • [in] batch_count: [rocblas_int] the number of instances in the batch.

rocblas_status rocblas_drotmg_strided_batched(rocblas_handle handle, double *d1, rocblas_stride stride_d1, double *d2, rocblas_stride stride_d2, double *x1, rocblas_stride stride_x1, const double *y1, rocblas_stride stride_y1, double *param, rocblas_stride stride_param, rocblas_int batch_count)
Level 2 BLAS
rocblas_<type>gemv()
rocblas_status rocblas_dgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, const double *x, rocblas_int incx, const double *beta, double *y, rocblas_int incy)
rocblas_status rocblas_sgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, const float *x, rocblas_int incx, const float *beta, float *y, rocblas_int incy)

BLAS Level 2 API.

xGEMV performs one of the matrix-vector operations

y := alpha*A*x    + beta*y,   or
y := alpha*A**T*x + beta*y,   or
y := alpha*A**H*x + beta*y,

where alpha and beta are scalars, x and y are vectors and A is an m by n matrix.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] trans: [rocblas_operation] indicates whether matrix A is tranposed (conjugated) or not

  • [in] m: [rocblas_int] number of rows of matrix A

  • [in] n: [rocblas_int] number of columns of matrix A

  • [in] alpha: device pointer or host pointer to scalar alpha.

  • [in] A: device pointer storing matrix A.

  • [in] lda: [rocblas_int] specifies the leading dimension of A.

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

  • [in] beta: device pointer or host pointer to scalar beta.

  • [inout] y: device pointer storing vector y.

  • [in] incy: [rocblas_int] specifies the increment for the elements of y.

rocblas_status rocblas_cgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, const rocblas_float_complex *x, rocblas_int incx, const rocblas_float_complex *beta, rocblas_float_complex *y, rocblas_int incy)
rocblas_status rocblas_zgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, const rocblas_double_complex *x, rocblas_int incx, const rocblas_double_complex *beta, rocblas_double_complex *y, rocblas_int incy)
rocblas_<type>hemv()
rocblas_status rocblas_chemv(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, const rocblas_float_complex *x, rocblas_int incx, const rocblas_float_complex *beta, rocblas_float_complex *y, rocblas_int incy)

BLAS Level 2 API.

xHEMV performs one of the matrix-vector operations

y := alpha*A*x + beta*y

where alpha and beta are scalars, x and y are n element vectors and A is an n by n hermitian matrix.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: A is an upper banded triangular matrix. rocblas_fill_lower: A is a lower banded triangular matrix.

  • [in] n: [rocblas_int] the order of the matrix A.

  • [in] alpha: device pointer or host pointer to scalar alpha.

  • [in] A: device pointer storing matrix A. Of dimension (lda, n). if uplo == rocblas_fill_upper: The upper triangular part of A must contain the upper triangular part of a hermitian matrix. The lower triangular part of A will not be referenced. if uplo == rocblas_fill_lower: The lower triangular part of A must contain the lower triangular part of a hermitian matrix. The upper triangular part of A will not be referenced. As a hermitian matrix, the imaginary part of the main diagonal of A will not be referenced and is assumed to be == 0.

  • [in] lda: [rocblas_int] specifies the leading dimension of A. must be >= max(1, n)

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

  • [in] beta: device pointer or host pointer to scalar beta.

  • [inout] y: device pointer storing vector y.

  • [in] incy: [rocblas_int] specifies the increment for the elements of y.

rocblas_status rocblas_zhemv(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, const rocblas_double_complex *x, rocblas_int incx, const rocblas_double_complex *beta, rocblas_double_complex *y, rocblas_int incy)
rocblas_<type>hemv_batched()
rocblas_status rocblas_chemv_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *const A[], rocblas_int lda, const rocblas_float_complex *const x[], rocblas_int incx, const rocblas_float_complex *beta, rocblas_float_complex *const y[], rocblas_int incy, rocblas_int batch_count)

BLAS Level 2 API.

xHEMV_BATCHED performs one of the matrix-vector operations

y_i := alpha*A_i*x_i + beta*y_i

where alpha and beta are scalars, x_i and y_i are n element vectors and A_i is an n by n hermitian matrix, for each batch in i = [1, batch_count].

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: A is an upper banded triangular matrix. rocblas_fill_lower: A is a lower banded triangular matrix.

  • [in] n: [rocblas_int] the order of each matrix A_i.

  • [in] alpha: device pointer or host pointer to scalar alpha.

  • [in] A: device array of device pointers storing each matrix A_i of dimension (lda, n). if uplo == rocblas_fill_upper: The upper triangular part of each A_i must contain the upper triangular part of a hermitian matrix. The lower triangular part of each A_i will not be referenced. if uplo == rocblas_fill_lower: The lower triangular part of each A_i must contain the lower triangular part of a hermitian matrix. The upper triangular part of each A_i will not be referenced. As a hermitian matrix, the imaginary part of the main diagonal of each A_i will not be referenced and is assumed to be == 0.

  • [in] lda: [rocblas_int] specifies the leading dimension of each A_i. must be >= max(1, n)

  • [in] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i.

  • [in] beta: device pointer or host pointer to scalar beta.

  • [inout] y: device array of device pointers storing each vector y_i.

  • [in] incy: [rocblas_int] specifies the increment for the elements of y.

  • [in] batch_count: [rocblas_int] number of instances in the batch.

rocblas_status rocblas_zhemv_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *const A[], rocblas_int lda, const rocblas_double_complex *const x[], rocblas_int incx, const rocblas_double_complex *beta, rocblas_double_complex *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_<type>hemv_strided_batched()
rocblas_status rocblas_chemv_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, const rocblas_float_complex *beta, rocblas_float_complex *y, rocblas_int incy, rocblas_stride stride_y, rocblas_int batch_count)

BLAS Level 2 API.

xHEMV_STRIDED_BATCHED performs one of the matrix-vector operations

y_i := alpha*A_i*x_i + beta*y_i

where alpha and beta are scalars, x_i and y_i are n element vectors and A_i is an n by n hermitian matrix, for each batch in i = [1, batch_count].

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: A is an upper banded triangular matrix. rocblas_fill_lower: A is a lower banded triangular matrix.

  • [in] n: [rocblas_int] the order of each matrix A_i.

  • [in] alpha: device pointer or host pointer to scalar alpha.

  • [in] A: device array of device pointers storing each matrix A_i of dimension (lda, n). if uplo == rocblas_fill_upper: The upper triangular part of each A_i must contain the upper triangular part of a hermitian matrix. The lower triangular part of each A_i will not be referenced. if uplo == rocblas_fill_lower: The lower triangular part of each A_i must contain the lower triangular part of a hermitian matrix. The upper triangular part of each A_i will not be referenced. As a hermitian matrix, the imaginary part of the main diagonal of each A_i will not be referenced and is assumed to be == 0.

  • [in] lda: [rocblas_int] specifies the leading dimension of each A_i. must be >= max(1, n)

  • [in] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i.

  • [in] beta: device pointer or host pointer to scalar beta.

  • [inout] y: device array of device pointers storing each vector y_i.

  • [in] incy: [rocblas_int] specifies the increment for the elements of y.

  • [in] batch_count: [rocblas_int] number of instances in the batch.

rocblas_status rocblas_zhemv_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, const rocblas_double_complex *beta, rocblas_double_complex *y, rocblas_int incy, rocblas_stride stride_y, rocblas_int batch_count)
rocblas_<type>gemv_batched()
rocblas_status rocblas_sgemv_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const float *alpha, const float *const A[], rocblas_int lda, const float *const x[], rocblas_int incx, const float *beta, float *const y[], rocblas_int incy, rocblas_int batch_count)

BLAS Level 2 API.

xGEMV_BATCHED performs a batch of matrix-vector operations

y_i := alpha*A_i*x_i    + beta*y_i,   or
y_i := alpha*A_i**T*x_i + beta*y_i,   or
y_i := alpha*A_i**H*x_i + beta*y_i,

where (A_i, x_i, y_i) is the i-th instance of the batch. alpha and beta are scalars, x_i and y_i are vectors and A_i is an m by n matrix, for i = 1, …, batch_count.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] trans: [rocblas_operation] indicates whether matrices A_i are tranposed (conjugated) or not

  • [in] m: [rocblas_int] number of rows of each matrix A_i

  • [in] n: [rocblas_int] number of columns of each matrix A_i

  • [in] alpha: device pointer or host pointer to scalar alpha.

  • [in] A: device array of device pointers storing each matrix A_i.

  • [in] lda: [rocblas_int] specifies the leading dimension of each matrix A_i.

  • [in] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each vector x_i.

  • [in] beta: device pointer or host pointer to scalar beta.

  • [inout] y: device array of device pointers storing each vector y_i.

  • [in] incy: [rocblas_int] specifies the increment for the elements of each vector y_i.

  • [in] batch_count: [rocblas_int] number of instances in the batch

rocblas_status rocblas_dgemv_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const double *alpha, const double *const A[], rocblas_int lda, const double *const x[], rocblas_int incx, const double *beta, double *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_cgemv_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *const A[], rocblas_int lda, const rocblas_float_complex *const x[], rocblas_int incx, const rocblas_float_complex *beta, rocblas_float_complex *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_zgemv_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *const A[], rocblas_int lda, const rocblas_double_complex *const x[], rocblas_int incx, const rocblas_double_complex *beta, rocblas_double_complex *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_<type>gemv_strided_batched()
rocblas_status rocblas_sgemv_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, rocblas_stride strideA, const float *x, rocblas_int incx, rocblas_stride stridex, const float *beta, float *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)

BLAS Level 2 API.

xGEMV_STRIDED_BATCHED performs a batch of matrix-vector operations

y_i := alpha*A_i*x_i    + beta*y_i,   or
y_i := alpha*A_i**T*x_i + beta*y_i,   or
y_i := alpha*A_i**H*x_i + beta*y_i,

where (A_i, x_i, y_i) is the i-th instance of the batch. alpha and beta are scalars, x_i and y_i are vectors and A_i is an m by n matrix, for i = 1, …, batch_count.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] transA: [rocblas_operation] indicates whether matrices A_i are tranposed (conjugated) or not

  • [in] m: [rocblas_int] number of rows of matrices A_i

  • [in] n: [rocblas_int] number of columns of matrices A_i

  • [in] alpha: device pointer or host pointer to scalar alpha.

  • [in] A: device pointer to the first matrix (A_1) in the batch.

  • [in] lda: [rocblas_int] specifies the leading dimension of matrices A_i.

  • [in] strideA: [rocblas_stride] stride from the start of one matrix (A_i) and the next one (A_i+1)

  • [in] x: device pointer to the first vector (x_1) in the batch.

  • [in] incx: [rocblas_int] specifies the increment for the elements of vectors x_i.

  • [in] stridex: [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x, however the user should take care to ensure that stride_x is of appropriate size. When trans equals rocblas_operation_none this typically means stride_x >= n * incx, otherwise stride_x >= m * incx.

  • [in] beta: device pointer or host pointer to scalar beta.

  • [inout] y: device pointer to the first vector (y_1) in the batch.

  • [in] incy: [rocblas_int] specifies the increment for the elements of vectors y_i.

  • [in] stridey: [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1). There are no restrictions placed on stride_y, however the user should take care to ensure that stride_y is of appropriate size. When trans equals rocblas_operation_none this typically means stride_y >= m * incy, otherwise stride_y >= n * incy. stridey should be non zero.

  • [in] batch_count: [rocblas_int] number of instances in the batch

rocblas_status rocblas_dgemv_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, rocblas_stride strideA, const double *x, rocblas_int incx, rocblas_stride stridex, const double *beta, double *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_cgemv_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride strideA, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, const rocblas_float_complex *beta, rocblas_float_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_zgemv_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride strideA, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, const rocblas_double_complex *beta, rocblas_double_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_<type>trmv()
rocblas_status rocblas_strmv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const float *A, rocblas_int lda, float *x, rocblas_int incx)

BLAS Level 2 API.

trmv performs one of the matrix-vector operations

 x = A*x or x = A**T*x,

where x is an n element vector and A is an n by n unit, or non-unit, upper or lower triangular matrix.

The vector x is overwritten.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.

  • [in] transA: [rocblas_operation]

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.

  • [in] m: [rocblas_int] m specifies the number of rows of A. m >= 0.

  • [in] A: device pointer storing matrix A, of dimension ( lda, m )

  • [in] lda: [rocblas_int] specifies the leading dimension of A. lda = max( 1, m ).

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

rocblas_status rocblas_dtrmv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const double *A, rocblas_int lda, double *x, rocblas_int incx)
rocblas_status rocblas_ctrmv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const rocblas_float_complex *A, rocblas_int lda, rocblas_float_complex *x, rocblas_int incx)
rocblas_status rocblas_ztrmv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const rocblas_double_complex *A, rocblas_int lda, rocblas_double_complex *x, rocblas_int incx)
rocblas_<type>trmv_batched()
rocblas_status rocblas_strmv_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const float *const *A, rocblas_int lda, float *const *x, rocblas_int incx, rocblas_int batch_count)

BLAS Level 2 API.

trmv_batched performs one of the matrix-vector operations

 x_i = A_i*x_i or x_i = A**T*x_i, 0 \le i < batch_count

where x_i is an n element vector and A_i is an n by n (unit, or non-unit, upper or lower triangular matrix)

The vectors x_i are overwritten.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: A_i is an upper triangular matrix. rocblas_fill_lower: A_i is a lower triangular matrix.

  • [in] transA: [rocblas_operation]

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: A_i is assumed to be unit triangular. rocblas_diagonal_non_unit: A_i is not assumed to be unit triangular.

  • [in] m: [rocblas_int] m specifies the number of rows of matrices A_i. m >= 0.

  • [in] A: device pointer storing pointer of matrices A_i, of dimension ( lda, m )

  • [in] lda: [rocblas_int] specifies the leading dimension of A_i. lda >= max( 1, m ).

  • [in] x: device pointer storing vectors x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of vectors x_i.

  • [in] batch_count: [rocblas_int] The number of batched matrices/vectors.

rocblas_status rocblas_dtrmv_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const double *const *A, rocblas_int lda, double *const *x, rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_ctrmv_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const rocblas_float_complex *const *A, rocblas_int lda, rocblas_float_complex *const *x, rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_ztrmv_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const rocblas_double_complex *const *A, rocblas_int lda, rocblas_double_complex *const *x, rocblas_int incx, rocblas_int batch_count)
rocblas_<type>trmv_strided_batched()
rocblas_status rocblas_strmv_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const float *A, rocblas_int lda, rocblas_stride stridea, float *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count)

BLAS Level 2 API.

trmv_strided_batched performs one of the matrix-vector operations

 x_i = A_i*x_i or x_i = A**T*x_i, 0 \le i < batch_count

where x_i is an n element vector and A_i is an n by n (unit, or non-unit, upper or lower triangular matrix) with strides specifying how to retrieve $x_i$ (resp. $A_i$) from $x_{i-1}$ (resp. $A_i$).

The vectors x_i are overwritten.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: A_i is an upper triangular matrix. rocblas_fill_lower: A_i is a lower triangular matrix.

  • [in] transA: [rocblas_operation]

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: A_i is assumed to be unit triangular. rocblas_diagonal_non_unit: A_i is not assumed to be unit triangular.

  • [in] m: [rocblas_int] m specifies the number of rows of matrices A_i. m >= 0.

  • [in] A: device pointer of the matrix A_0, of dimension ( lda, m )

  • [in] lda: [rocblas_int] specifies the leading dimension of A_i. lda >= max( 1, m ).

  • [in] stride_a: [rocblas_stride] stride from the start of one A_i matrix to the next A_{i + 1}

  • [in] x: device pointer storing the vector x_0.

  • [in] incx: [rocblas_int] specifies the increment for the elements of one vector x.

  • [in] stride_x: [rocblas_stride] stride from the start of one x_i vector to the next x_{i + 1}

  • [in] batch_count: [rocblas_int] The number of batched matrices/vectors.

rocblas_status rocblas_dtrmv_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const double *A, rocblas_int lda, rocblas_stride stridea, double *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count)
rocblas_status rocblas_ctrmv_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride stridea, rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count)
rocblas_status rocblas_ztrmv_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride stridea, rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count)
rocblas_<type>tbmv()
rocblas_status rocblas_stbmv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation trans, rocblas_diagonal diag, rocblas_int m, rocblas_int k, const float *A, rocblas_int lda, float *x, rocblas_int incx)

BLAS Level 2 API.

xTBMV performs one of the matrix-vector operations

x := A*x      or
x := A**T*x   or
x := A**H*x,

x is a vectors and A is a banded m by m matrix (see description below).

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: A is an upper banded triangular matrix. rocblas_fill_lower: A is a lower banded triangular matrix.

  • [in] trans: [rocblas_operation] indicates whether matrix A is tranposed (conjugated) or not.

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: The main diagonal of A is assumed to consist of only 1’s and is not referenced. rocblas_diagonal_non_unit: No assumptions are made of A’s main diagonal.

  • [in] m: [rocblas_int] the number of rows and columns of the matrix represented by A.

  • [in] k: [rocblas_int] if uplo == rocblas_fill_upper, k specifies the number of super-diagonals of the matrix A. if uplo == rocblas_fill_lower, k specifies the number of sub-diagonals of the matrix A. k must satisfy k > 0 && k < lda.

  • [in] A: device pointer storing banded triangular matrix A. if uplo == rocblas_fill_upper: The matrix represented is an upper banded triangular matrix with the main diagonal and k super-diagonals, everything else can be assumed to be 0. The matrix is compacted so that the main diagonal resides on the k’th row, the first super diagonal resides on the RHS of the k-1’th row, etc, with the k’th diagonal on the RHS of the 0’th row. Ex: (rocblas_fill_upper; m = 5; k = 2) 1 6 9 0 0 0 0 9 8 7 0 2 7 8 0 0 6 7 8 9 0 0 3 8 7 -> 1 2 3 4 5 0 0 0 4 9 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 if uplo == rocblas_fill_lower: The matrix represnted is a lower banded triangular matrix with the main diagonal and k sub-diagonals, everything else can be assumed to be 0. The matrix is compacted so that the main diagonal resides on the 0’th row, working up to the k’th diagonal residing on the LHS of the k’th row. Ex: (rocblas_fill_lower; m = 5; k = 2) 1 0 0 0 0 1 2 3 4 5 6 2 0 0 0 6 7 8 9 0 9 7 3 0 0 -> 9 8 7 0 0 0 8 8 4 0 0 0 0 0 0 0 0 7 9 5 0 0 0 0 0

  • [in] lda: [rocblas_int] specifies the leading dimension of A. lda must satisfy lda > k.

  • [inout] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

rocblas_status rocblas_dtbmv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation trans, rocblas_diagonal diag, rocblas_int m, rocblas_int k, const double *A, rocblas_int lda, double *x, rocblas_int incx)
rocblas_status rocblas_ctbmv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation trans, rocblas_diagonal diag, rocblas_int m, rocblas_int k, const rocblas_float_complex *A, rocblas_int lda, rocblas_float_complex *x, rocblas_int incx)
rocblas_status rocblas_ztbmv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation trans, rocblas_diagonal diag, rocblas_int m, rocblas_int k, const rocblas_double_complex *A, rocblas_int lda, rocblas_double_complex *x, rocblas_int incx)
rocblas_<type>tbmv_batched()
rocblas_status rocblas_stbmv_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation trans, rocblas_diagonal diag, rocblas_int m, rocblas_int k, const float *const A[], rocblas_int lda, float *const x[], rocblas_int incx, rocblas_int batch_count)

BLAS Level 2 API.

xTBMV_BATCHED performs one of the matrix-vector operations

x_i := A_i*x_i      or
x_i := A_i**T*x_i   or
x_i := A_i**H*x_i,

where (A_i, x_i) is the i-th instance of the batch. x_i is a vector and A_i is an m by m matrix, for i = 1, …, batch_count.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: each A_i is an upper banded triangular matrix. rocblas_fill_lower: each A_i is a lower banded triangular matrix.

  • [in] trans: [rocblas_operation] indicates whether each matrix A_i is tranposed (conjugated) or not.

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: The main diagonal of each A_i is assumed to consist of only 1’s and is not referenced. rocblas_diagonal_non_unit: No assumptions are made of each A_i’s main diagonal.

  • [in] m: [rocblas_int] the number of rows and columns of the matrix represented by each A_i.

  • [in] k: [rocblas_int] if uplo == rocblas_fill_upper, k specifies the number of super-diagonals of each matrix A_i. if uplo == rocblas_fill_lower, k specifies the number of sub-diagonals of each matrix A_i. k must satisfy k > 0 && k < lda.

  • [in] A: device array of device pointers storing each banded triangular matrix A_i. if uplo == rocblas_fill_upper: The matrix represented is an upper banded triangular matrix with the main diagonal and k super-diagonals, everything else can be assumed to be 0. The matrix is compacted so that the main diagonal resides on the k’th row, the first super diagonal resides on the RHS of the k-1’th row, etc, with the k’th diagonal on the RHS of the 0’th row. Ex: (rocblas_fill_upper; m = 5; k = 2) 1 6 9 0 0 0 0 9 8 7 0 2 7 8 0 0 6 7 8 9 0 0 3 8 7 -> 1 2 3 4 5 0 0 0 4 9 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 if uplo == rocblas_fill_lower: The matrix represnted is a lower banded triangular matrix with the main diagonal and k sub-diagonals, everything else can be assumed to be 0. The matrix is compacted so that the main diagonal resides on the 0’th row, working up to the k’th diagonal residing on the LHS of the k’th row. Ex: (rocblas_fill_lower; m = 5; k = 2) 1 0 0 0 0 1 2 3 4 5 6 2 0 0 0 6 7 8 9 0 9 7 3 0 0 -> 9 8 7 0 0 0 8 8 4 0 0 0 0 0 0 0 0 7 9 5 0 0 0 0 0

  • [in] lda: [rocblas_int] specifies the leading dimension of each A_i. lda must satisfy lda > k.

  • [inout] x: device array of device pointer storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i.

  • [in] batch_count: [rocblas_int] number of instances in the batch.

rocblas_status rocblas_dtbmv_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation trans, rocblas_diagonal diag, rocblas_int m, rocblas_int k, const double *const A[], rocblas_int lda, double *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_ctbmv_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation trans, rocblas_diagonal diag, rocblas_int m, rocblas_int k, const rocblas_float_complex *const A[], rocblas_int da, rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_ztbmv_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation trans, rocblas_diagonal diag, rocblas_int m, rocblas_int k, const rocblas_double_complex *const A[], rocblas_int lda, rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_<type>tbmv_strided_batched()
rocblas_status rocblas_stbmv_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation trans, rocblas_diagonal diag, rocblas_int m, rocblas_int k, const float *A, rocblas_int lda, rocblas_stride stride_A, float *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)

BLAS Level 2 API.

xTBMV_STRIDED_BATCHED performs one of the matrix-vector operations

x_i := A_i*x_i      or
x_i := A_i**T*x_i   or
x_i := A_i**H*x_i,

where (A_i, x_i) is the i-th instance of the batch. x_i is a vector and A_i is an m by m matrix, for i = 1, …, batch_count.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: each A_i is an upper banded triangular matrix. rocblas_fill_lower: each A_i is a lower banded triangular matrix.

  • [in] trans: [rocblas_operation] indicates whether each matrix A_i is tranposed (conjugated) or not.

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: The main diagonal of each A_i is assumed to consist of only 1’s and is not referenced. rocblas_diagonal_non_unit: No assumptions are made of each A_i’s main diagonal.

  • [in] m: [rocblas_int] the number of rows and columns of the matrix represented by each A_i.

  • [in] k: [rocblas_int] if uplo == rocblas_fill_upper, k specifies the number of super-diagonals of each matrix A_i. if uplo == rocblas_fill_lower, k specifies the number of sub-diagonals of each matrix A_i. k must satisfy k > 0 && k < lda.

  • [in] A: device array to the first matrix A_i of the batch. Stores each banded triangular matrix A_i. if uplo == rocblas_fill_upper: The matrix represented is an upper banded triangular matrix with the main diagonal and k super-diagonals, everything else can be assumed to be 0. The matrix is compacted so that the main diagonal resides on the k’th row, the first super diagonal resides on the RHS of the k-1’th row, etc, with the k’th diagonal on the RHS of the 0’th row. Ex: (rocblas_fill_upper; m = 5; k = 2) 1 6 9 0 0 0 0 9 8 7 0 2 7 8 0 0 6 7 8 9 0 0 3 8 7 -> 1 2 3 4 5 0 0 0 4 9 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 if uplo == rocblas_fill_lower: The matrix represnted is a lower banded triangular matrix with the main diagonal and k sub-diagonals, everything else can be assumed to be 0. The matrix is compacted so that the main diagonal resides on the 0’th row, working up to the k’th diagonal residing on the LHS of the k’th row. Ex: (rocblas_fill_lower; m = 5; k = 2) 1 0 0 0 0 1 2 3 4 5 6 2 0 0 0 6 7 8 9 0 9 7 3 0 0 -> 9 8 7 0 0 0 8 8 4 0 0 0 0 0 0 0 0 7 9 5 0 0 0 0 0

  • [in] lda: [rocblas_int] specifies the leading dimension of each A_i. lda must satisfy lda > k.

  • [in] stride_A: [rocblas_stride] stride from the start of one A_i matrix to the next A_(i + 1).

  • [inout] x: device array to the first vector x_i of the batch.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i.

  • [in] stride_x: [rocblas_stride] stride from the start of one x_i matrix to the next x_(i + 1).

  • [in] batch_count: [rocblas_int] number of instances in the batch.

rocblas_status rocblas_dtbmv_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation trans, rocblas_diagonal diag, rocblas_int m, rocblas_int k, const double *A, rocblas_int lda, rocblas_stride stride_A, double *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_ctbmv_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation trans, rocblas_diagonal diag, rocblas_int m, rocblas_int k, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride stride_A, rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_ztbmv_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation trans, rocblas_diagonal diag, rocblas_int m, rocblas_int k, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride stride_A, rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_<type>trsv()
rocblas_status rocblas_dtrsv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const double *A, rocblas_int lda, double *x, rocblas_int incx)
rocblas_status rocblas_strsv(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const float *A, rocblas_int lda, float *x, rocblas_int incx)

BLAS Level 2 API.

trsv solves

 A*x = b or A**T*x = b,

where x and b are vectors and A is a triangular matrix.

The vector x is overwritten on b.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.

  • [in] transA: [rocblas_operation]

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.

  • [in] m: [rocblas_int] m specifies the number of rows of b. m >= 0.

  • [in] A: device pointer storing matrix A, of dimension ( lda, m )

  • [in] lda: [rocblas_int] specifies the leading dimension of A. lda = max( 1, m ).

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

rocblas_<type>trsv_batched()
rocblas_status rocblas_strsv_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const float *const A[], rocblas_int lda, float *const x[], rocblas_int incx, rocblas_int batch_count)

BLAS Level 2 API.

trsv_batched solves

 A_i*x_i = b_i or A_i**T*x_i = b_i,

where (A_i, x_i, b_i) is the i-th instance of the batch. x_i and b_i are vectors and A_i is an m by m triangular matrix.

The vector x is overwritten on b.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.

  • [in] transA: [rocblas_operation]

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.

  • [in] m: [rocblas_int] m specifies the number of rows of b. m >= 0.

  • [in] A: device array of device pointers storing each matrix A_i.

  • [in] lda: [rocblas_int] specifies the leading dimension of each A_i. lda = max(1, m)

  • [in] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

  • [in] batch_count: [rocblas_int] number of instances in the batch

rocblas_status rocblas_dtrsv_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const double *const A[], rocblas_int lda, double *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_<type>trsv_strided_batched()
rocblas_status rocblas_strsv_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const float *A, rocblas_int lda, rocblas_stride stride_A, float *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)

BLAS Level 2 API.

trsv_strided_batched solves

 A_i*x_i = b_i or A_i**T*x_i = b_i,

where (A_i, x_i, b_i) is the i-th instance of the batch. x_i and b_i are vectors and A_i is an m by m triangular matrix, for i = 1, …, batch_count.

The vector x is overwritten on b.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.

  • [in] transA: [rocblas_operation]

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.

  • [in] m: [rocblas_int] m specifies the number of rows of each b_i. m >= 0.

  • [in] A: device pointer to the first matrix (A_1) in the batch, of dimension ( lda, m )

  • [in] stride_A: [rocblas_stride] stride from the start of one A_i matrix to the next A_(i + 1)

  • [in] lda: [rocblas_int] specifies the leading dimension of each A_i. lda = max( 1, m ).

  • [inout] x: device pointer to the first vector (x_1) in the batch.

  • [in] stride_x: [rocblas_stride] stride from the start of one x_i vector to the next x_(i + 1)

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i.

  • [in] batch_count: [rocblas_int] number of instances in the batch

rocblas_status rocblas_dtrsv_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, const double *A, rocblas_int lda, rocblas_stride stride_A, double *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_<type>ger()
rocblas_status rocblas_dger(rocblas_handle handle, rocblas_int m, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, const double *y, rocblas_int incy, double *A, rocblas_int lda)
rocblas_status rocblas_sger(rocblas_handle handle, rocblas_int m, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, const float *y, rocblas_int incy, float *A, rocblas_int lda)

BLAS Level 2 API.

xGER performs the matrix-vector operations

A := A + alpha*x*y**T

where alpha is a scalar, x and y are vectors, and A is an m by n matrix.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] m: [rocblas_int] the number of rows of the matrix A.

  • [in] n: [rocblas_int] the number of columns of the matrix A.

  • [in] alpha: device pointer or host pointer to scalar alpha.

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

  • [in] y: device pointer storing vector y.

  • [in] incy: [rocblas_int] specifies the increment for the elements of y.

  • [inout] A: device pointer storing matrix A.

  • [in] lda: [rocblas_int] specifies the leading dimension of A.

rocblas_<type>ger_batched()
rocblas_status rocblas_sger_batched(rocblas_handle handle, rocblas_int m, rocblas_int n, const float *alpha, const float *const x[], rocblas_int incx, const float *const y[], rocblas_int incy, float *const A[], rocblas_int lda, rocblas_int batch_count)

BLAS Level 2 API.

xGER_BATCHED performs a batch of the matrix-vector operations

A_i := A_i + alpha*x_i*y_i**T

where (A_i, x_i, y_i) is the i-th instance of the batch. alpha is a scalar, x_i and y_i are vectors and A_i is an m by n matrix, for i = 1, …, batch_count.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] m: [rocblas_int] the number of rows of each matrix A_i.

  • [in] n: [rocblas_int] the number of columns of eaceh matrix A_i.

  • [in] alpha: device pointer or host pointer to scalar alpha.

  • [in] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each vector x_i.

  • [in] y: device array of device pointers storing each vector y_i.

  • [in] incy: [rocblas_int] specifies the increment for the elements of each vector y_i.

  • [inout] A: device array of device pointers storing each matrix A_i.

  • [in] lda: [rocblas_int] specifies the leading dimension of each A_i.

  • [in] batch_count: [rocblas_int] number of instances in the batch

rocblas_status rocblas_dger_batched(rocblas_handle handle, rocblas_int m, rocblas_int n, const double *alpha, const double *const x[], rocblas_int incx, const double *const y[], rocblas_int incy, double *const A[], rocblas_int lda, rocblas_int batch_count)
rocblas_<type>ger_strided_batched()
rocblas_status rocblas_sger_strided_batched(rocblas_handle handle, rocblas_int m, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, rocblas_stride stridex, const float *y, rocblas_int incy, rocblas_stride stridey, float *A, rocblas_int lda, rocblas_stride strideA, rocblas_int batch_count)

BLAS Level 2 API.

xGER_STRIDED_BATCHED performs the matrix-vector operations

A_i := A_i + alpha*x_i*y_i**T

where (A_i, x_i, y_i) is the i-th instance of the batch. alpha is a scalar, x_i and y_i are vectors and A_i is an m by n matrix, for i = 1, …, batch_count.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] m: [rocblas_int] the number of rows of each matrix A_i.

  • [in] n: [rocblas_int] the number of columns of each matrix A_i.

  • [in] alpha: device pointer or host pointer to scalar alpha.

  • [in] x: device pointer to the first vector (x_1) in the batch.

  • [in] incx: [rocblas_int] specifies the increments for the elements of each vector x_i.

  • [in] stridex: [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x, however the user should take care to ensure that stride_x is of appropriate size, for a typical case this means stride_x >= m * incx.

  • [inout] y: device pointer to the first vector (y_0) in the batch.

  • [in] incy: [rocblas_int] specifies the increment for the elements of each vector y_i.

  • [in] stridey: [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1). There are no restrictions placed on stride_y, however the user should take care to ensure that stride_y is of appropriate size, for a typical case this means stride_y >= n * incy.

  • [inout] A: device pointer to the first matrix (A_1) in the batch.

  • [in] lda: [rocblas_int] specifies the leading dimension of each A_i.

  • [in] strideA: [rocblas_stride] stride from the start of one matrix (A_i) and the next one (A_i+1)

  • [in] batch_count: [rocblas_int] number of instances in the batch

rocblas_status rocblas_dger_strided_batched(rocblas_handle handle, rocblas_int m, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, rocblas_stride stridex, const double *y, rocblas_int incy, rocblas_stride stridey, double *A, rocblas_int lda, rocblas_stride strideA, rocblas_int batch_count)
rocblas_<type>syr()
rocblas_status rocblas_dsyr(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, double *A, rocblas_int lda)
rocblas_status rocblas_ssyr(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, float *A, rocblas_int lda)

BLAS Level 2 API.

xSYR performs the matrix-vector operations

A := A + alpha*x*x**T

where alpha is a scalar, x is a vector, and A is an n by n symmetric matrix.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] specifies whether the upper ‘rocblas_fill_upper’ or lower ‘rocblas_fill_lower’ if rocblas_fill_upper, the lower part of A is not referenced if rocblas_fill_lower, the upper part of A is not referenced

  • [in] n: [rocblas_int] the number of rows and columns of matrix A.

  • [in] alpha: device pointer or host pointer to scalar alpha.

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

  • [inout] A: device pointer storing matrix A.

  • [in] lda: [rocblas_int] specifies the leading dimension of A.

rocblas_<type>syr_batched()
rocblas_status rocblas_ssyr_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const float *alpha, const float *const x[], rocblas_int incx, float *const A[], rocblas_int lda, rocblas_int batch_count)

BLAS Level 2 API.

xSYR_batched performs a batch of matrix-vector operations

A[i] := A[i] + alpha*x[i]*x[i]**T

where alpha is a scalar, x is an array of vectors, and A is an array of n by n symmetric matrices, for i = 1 , … , batch_count

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] specifies whether the upper ‘rocblas_fill_upper’ or lower ‘rocblas_fill_lower’ if rocblas_fill_upper, the lower part of A is not referenced if rocblas_fill_lower, the upper part of A is not referenced

  • [in] n: [rocblas_int] the number of rows and columns of matrix A.

  • [in] alpha: device pointer or host pointer to scalar alpha.

  • [in] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i.

  • [inout] A: device array of device pointers storing each matrix A_i.

  • [in] lda: [rocblas_int] specifies the leading dimension of each A_i.

  • [in] batch_count: [rocblas_int] number of instances in the batch

rocblas_status rocblas_dsyr_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const double *alpha, const double *const x[], rocblas_int incx, double *const A[], rocblas_int lda, rocblas_int batch_count)
rocblas_<type>syr_strided_batched()
rocblas_status rocblas_ssyr_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, rocblas_stride stridex, float *A, rocblas_int lda, rocblas_stride strideA, rocblas_int batch_count)

BLAS Level 2 API.

xSYR_strided_batched performs the matrix-vector operations

A[i] := A[i] + alpha*x[i]*x[i]**T

where alpha is a scalar, vectors, and A is an array of n by n symmetric matrices, for i = 1 , … , batch_count

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] specifies whether the upper ‘rocblas_fill_upper’ or lower ‘rocblas_fill_lower’ if rocblas_fill_upper, the lower part of A is not referenced if rocblas_fill_lower, the upper part of A is not referenced

  • [in] n: [rocblas_int] the number of rows and columns of each matrix A.

  • [in] alpha: device pointer or host pointer to scalar alpha.

  • [in] x: device pointer to the first vector x_1.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i.

  • [in] stridex: [rocblas_stride] specifies the pointer increment between vectors (x_i) and (x_i+1).

  • [inout] A: device pointer to the first matrix A_1.

  • [in] lda: [rocblas_int] specifies the leading dimension of each A_i.

  • [in] strideA: [rocblas_stride] stride from the start of one matrix (A_i) and the next one (A_i+1)

  • [in] batch_count: [rocblas_int] number of instances in the batch

rocblas_status rocblas_dsyr_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, rocblas_stride stridex, double *A, rocblas_int lda, rocblas_stride strideA, rocblas_int batch_count)
Level 3 BLAS
rocblas_<type>trtri()
rocblas_status rocblas_strtri(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const float *A, rocblas_int lda, float *invA, rocblas_int ldinvA)

BLAS Level 3 API.

trtri compute the inverse of a matrix A, namely, invA

and write the result into invA;

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] specifies whether the upper ‘rocblas_fill_upper’ or lower ‘rocblas_fill_lower’ if rocblas_fill_upper, the lower part of A is not referenced if rocblas_fill_lower, the upper part of A is not referenced

  • [in] diag: [rocblas_diagonal] = ‘rocblas_diagonal_non_unit’, A is non-unit triangular; = ‘rocblas_diagonal_unit’, A is unit triangular;

  • [in] n: [rocblas_int] size of matrix A and invA

  • [in] A: device pointer storing matrix A.

  • [in] lda: [rocblas_int] specifies the leading dimension of A.

  • [out] invA: device pointer storing matrix invA.

  • [in] ldinvA: [rocblas_int] specifies the leading dimension of invA.

rocblas_status rocblas_dtrtri(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const double *A, rocblas_int lda, double *invA, rocblas_int ldinvA)
rocblas_<type>trtri_batched()
rocblas_status rocblas_strtri_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const float *const A[], rocblas_int lda, float *invA[], rocblas_int ldinvA, rocblas_int batch_count)

BLAS Level 3 API.

trtri_batched compute the inverse of A_i and write into invA_i where A_i and invA_i are the i-th matrices in the batch, for i = 1, …, batch_count.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] specifies whether the upper ‘rocblas_fill_upper’ or lower ‘rocblas_fill_lower’

  • [in] diag: [rocblas_diagonal] = ‘rocblas_diagonal_non_unit’, A is non-unit triangular; = ‘rocblas_diagonal_unit’, A is unit triangular;

  • [in] n: [rocblas_int]

  • [in] A: device array of device pointers storing each matrix A_i.

  • [in] lda: [rocblas_int] specifies the leading dimension of each A_i.

  • [out] invA: device array of device pointers storing the inverse of each matrix A_i. Partial inplace operation is supported, see below. If UPLO = ‘U’, the leading N-by-N upper triangular part of the invA will store the inverse of the upper triangular matrix, and the strictly lower triangular part of invA is cleared. If UPLO = ‘L’, the leading N-by-N lower triangular part of the invA will store the inverse of the lower triangular matrix, and the strictly upper triangular part of invA is cleared.

  • [in] ldinvA: [rocblas_int] specifies the leading dimension of each invA_i.

  • [in] batch_count: [rocblas_int] numbers of matrices in the batch

rocblas_status rocblas_dtrtri_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const double *const A[], rocblas_int lda, double *invA[], rocblas_int ldinvA, rocblas_int batch_count)
rocblas_<type>trtri_strided_batched()
rocblas_status rocblas_strtri_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const float *A, rocblas_int lda, rocblas_stride stride_a, float *invA, rocblas_int ldinvA, rocblas_stride stride_invA, rocblas_int batch_count)

BLAS Level 3 API.

trtri_strided_batched compute the inverse of A_i and write into invA_i where A_i and invA_i are the i-th matrices in the batch, for i = 1, …, batch_count

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] uplo: [rocblas_fill] specifies whether the upper ‘rocblas_fill_upper’ or lower ‘rocblas_fill_lower’

  • [in] diag: [rocblas_diagonal] = ‘rocblas_diagonal_non_unit’, A is non-unit triangular; = ‘rocblas_diagonal_unit’, A is unit triangular;

  • [in] n: [rocblas_int]

  • [in] A: device pointer pointing to address of first matrix A_1.

  • [in] lda: [rocblas_int] specifies the leading dimension of each A.

  • [in] stride_a: [rocblas_stride] “batch stride a”: stride from the start of one A_i matrix to the next A_(i + 1).

  • [out] invA: device pointer storing the inverses of each matrix A_i. Partial inplace operation is supported, see below. If UPLO = ‘U’, the leading N-by-N upper triangular part of the invA will store the inverse of the upper triangular matrix, and the strictly lower triangular part of invA is cleared. If UPLO = ‘L’, the leading N-by-N lower triangular part of the invA will store the inverse of the lower triangular matrix, and the strictly upper triangular part of invA is cleared.

  • [in] ldinvA: [rocblas_int] specifies the leading dimension of each invA_i.

  • [in] stride_invA: [rocblas_stride] “batch stride invA”: stride from the start of one invA_i matrix to the next invA_(i + 1).

  • [in] batch_count: [rocblas_int] numbers of matrices in the batch

rocblas_status rocblas_dtrtri_strided_batched(rocblas_handle handle, rocblas_fill uplo, rocblas_diagonal diag, rocblas_int n, const double *A, rocblas_int lda, rocblas_stride stride_a, double *invA, rocblas_int ldinvA, rocblas_stride stride_invA, rocblas_int batch_count)
rocblas_<type>trsm()
rocblas_status rocblas_dtrsm(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, double *B, rocblas_int ldb)
rocblas_status rocblas_strsm(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, float *B, rocblas_int ldb)

BLAS Level 3 API.

trsm solves

op(A)*X = alpha*B or  X*op(A) = alpha*B,

where alpha is a scalar, X and B are m by n matrices, A is triangular matrix and op(A) is one of

op( A ) = A   or   op( A ) = A^T   or   op( A ) = A^H.

The matrix X is overwritten on B.

Note about memory allocation: When trsm is launched with a k evenly divisible by the internal block size of 128, and is no larger than 10 of these blocks, the API takes advantage of utilizing pre-allocated memory found in the handle to increase overall performance. This memory can be managed by using the environment variable WORKBUF_TRSM_B_CHNK. When this variable is not set the device memory used for temporary storage will default to 1 MB and may result in chunking, which in turn may reduce performance. Under these circumstances it is recommended that WORKBUF_TRSM_B_CHNK be set to the desired chunk of right hand sides to be used at a time.

(where k is m when rocblas_side_left and is n when rocblas_side_right)

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] side: [rocblas_side] rocblas_side_left: op(A)*X = alpha*B. rocblas_side_right: X*op(A) = alpha*B.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.

  • [in] transA: [rocblas_operation] transB: op(A) = A. rocblas_operation_transpose: op(A) = A^T. rocblas_operation_conjugate_transpose: op(A) = A^H.

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.

  • [in] m: [rocblas_int] m specifies the number of rows of B. m >= 0.

  • [in] n: [rocblas_int] n specifies the number of columns of B. n >= 0.

  • [in] alpha: device pointer or host pointer specifying the scalar alpha. When alpha is &zero then A is not referenced and B need not be set before entry.

  • [in] A: device pointer storing matrix A. of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.

  • [in] lda: [rocblas_int] lda specifies the first dimension of A. if side = rocblas_side_left, lda >= max( 1, m ), if side = rocblas_side_right, lda >= max( 1, n ).

  • [inout] B: device pointer storing matrix B.

  • [in] ldb: [rocblas_int] ldb specifies the first dimension of B. ldb >= max( 1, m ).

rocblas_<type>trsm_batched()
rocblas_status rocblas_strsm_batched(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const float *alpha, const float *const A[], rocblas_int lda, float *B[], rocblas_int ldb, rocblas_int batch_count)

BLAS Level 3 API.

trsm_batched performs the following batched operation:

op(A_i)*X_i = alpha*B_i or  X_i*op(A_i) = alpha*B_i, for i = 1, ..., batch_count.

where alpha is a scalar, X and B are batched m by n matrices, A is triangular batched matrix and op(A) is one of

op( A ) = A   or   op( A ) = A^T   or   op( A ) = A^H.

Each matrix X_i is overwritten on B_i for i = 1, …, batch_count.

Note about memory allocation: When trsm is launched with a k evenly divisible by the internal block size of 128, and is no larger than 10 of these blocks, the API takes advantage of utilizing pre-allocated memory found in the handle to increase overall performance. This memory can be managed by using the environment variable WORKBUF_TRSM_B_CHNK. When this variable is not set the device memory used for temporary storage will default to 1 MB and may result in chunking, which in turn may reduce performance. Under these circumstances it is recommended that WORKBUF_TRSM_B_CHNK be set to the desired chunk of right hand sides to be used at a time. (where k is m when rocblas_side_left and is n when rocblas_side_right)

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] side: [rocblas_side] rocblas_side_left: op(A)*X = alpha*B. rocblas_side_right: X*op(A) = alpha*B.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: each A_i is an upper triangular matrix. rocblas_fill_lower: each A_i is a lower triangular matrix.

  • [in] transA: [rocblas_operation] transB: op(A) = A. rocblas_operation_transpose: op(A) = A^T. rocblas_operation_conjugate_transpose: op(A) = A^H.

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: each A_i is assumed to be unit triangular. rocblas_diagonal_non_unit: each A_i is not assumed to be unit triangular.

  • [in] m: [rocblas_int] m specifies the number of rows of each B_i. m >= 0.

  • [in] n: [rocblas_int] n specifies the number of columns of each B_i. n >= 0.

  • [in] alpha: device pointer or host pointer specifying the scalar alpha. When alpha is &zero then A is not referenced and B need not be set before entry.

  • [in] A: device array of device pointers storing each matrix A_i on the GPU. Matricies are of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.

  • [in] lda: [rocblas_int] lda specifies the first dimension of each A_i. if side = rocblas_side_left, lda >= max( 1, m ), if side = rocblas_side_right, lda >= max( 1, n ).

  • [inout] B: device array of device pointers storing each matrix B_i on the GPU.

  • [in] ldb: [rocblas_int] ldb specifies the first dimension of each B_i. ldb >= max( 1, m ).

  • [in] batch_count: [rocblas_int] number of trsm operatons in the batch.

rocblas_status rocblas_dtrsm_batched(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const double *alpha, const double *const A[], rocblas_int lda, double *B[], rocblas_int ldb, rocblas_int batch_count)
rocblas_<type>trsm_strided_batched()
rocblas_status rocblas_strsm_strided_batched(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, rocblas_stride stride_a, float *B, rocblas_int ldb, rocblas_stride stride_b, rocblas_int batch_count)

BLAS Level 3 API.

trsm_srided_batched performs the following strided batched operation:

op(A_i)*X_i = alpha*B_i or  X_i*op(A_i) = alpha*B_i, for i = 1, ..., batch_count.

where alpha is a scalar, X and B are strided batched m by n matrices, A is triangular strided batched matrix and op(A) is one of

op( A ) = A   or   op( A ) = A^T   or   op( A ) = A^H.

Each matrix X_i is overwritten on B_i for i = 1, …, batch_count.

Note about memory allocation: When trsm is launched with a k evenly divisible by the internal block size of 128, and is no larger than 10 of these blocks, the API takes advantage of utilizing pre-allocated memory found in the handle to increase overall performance. This memory can be managed by using the environment variable WORKBUF_TRSM_B_CHNK. When this variable is not set the device memory used for temporary storage will default to 1 MB and may result in chunking, which in turn may reduce performance. Under these circumstances it is recommended that WORKBUF_TRSM_B_CHNK be set to the desired chunk of right hand sides to be used at a time. (where k is m when rocblas_side_left and is n when rocblas_side_right)

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] side: [rocblas_side] rocblas_side_left: op(A)*X = alpha*B. rocblas_side_right: X*op(A) = alpha*B.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: each A_i is an upper triangular matrix. rocblas_fill_lower: each A_i is a lower triangular matrix.

  • [in] transA: [rocblas_operation] transB: op(A) = A. rocblas_operation_transpose: op(A) = A^T. rocblas_operation_conjugate_transpose: op(A) = A^H.

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: each A_i is assumed to be unit triangular. rocblas_diagonal_non_unit: each A_i is not assumed to be unit triangular.

  • [in] m: [rocblas_int] m specifies the number of rows of each B_i. m >= 0.

  • [in] n: [rocblas_int] n specifies the number of columns of each B_i. n >= 0.

  • [in] alpha: device pointer or host pointer specifying the scalar alpha. When alpha is &zero then A is not referenced and B need not be set before entry.

  • [in] A: device pointer pointing to the first matrix A_1. of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.

  • [in] lda: [rocblas_int] lda specifies the first dimension of each A_i. if side = rocblas_side_left, lda >= max( 1, m ), if side = rocblas_side_right, lda >= max( 1, n ).

  • [in] stride_a: [rocblas_stride] stride from the start of one A_i matrix to the next A_(i + 1).

  • [inout] B: device pointer pointing to the first matrix B_1.

  • [in] ldb: [rocblas_int] ldb specifies the first dimension of each B_i. ldb >= max( 1, m ).

  • [in] stride_b: [rocblas_stride] stride from the start of one B_i matrix to the next B_(i + 1).

  • [in] batch_count: [rocblas_int] number of trsm operatons in the batch.

rocblas_status rocblas_dtrsm_strided_batched(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, rocblas_stride stride_a, double *B, rocblas_int ldb, rocblas_stride stride_b, rocblas_int batch_count)
rocblas_<type>trmm()
rocblas_status rocblas_strmm(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, float *B, rocblas_int ldb)

BLAS Level 3 API.

trmm performs one of the matrix-matrix operations

B := alpha*op( A )*B, or B := alpha*B*op( A )

where alpha is a scalar, B is an m by n matrix, A is a unit, or non-unit, upper or lower triangular matrix and op( A ) is one of

op( A ) = A   or   op( A ) = A^T   or   op( A ) = A^H.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] side: [rocblas_side] rocblas_side_left: C := alpha*op( A )*B. rocblas_side_right: C := alpha*B*op( A ).

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.

  • [in] transA: [rocblas_operation] transB: op(A) = A. rocblas_operation_transpose: op(A) = A^T. rocblas_operation_conjugate_transpose: op(A) = A^H.

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.

  • [in] m: [rocblas_int] m specifies the number of rows of B. m >= 0.

  • [in] n: [rocblas_int] n specifies the number of columns of B. n >= 0.

  • [in] alpha: alpha specifies the scalar alpha. When alpha is zero then A is not referenced and B need not be set before entry.

  • [in] A: pointer storing matrix A on the GPU. of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.

  • [in] lda: [rocblas_int] lda specifies the first dimension of A. if side = rocblas_side_left, lda >= max( 1, m ), if side = rocblas_side_right, lda >= max( 1, n ).

  • [in] B: pointer storing matrix B on the GPU.

  • [in] ldb: [rocblas_int] ldb specifies the first dimension of B. ldb >= max( 1, m ).

rocblas_status rocblas_dtrmm(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, double *B, rocblas_int ldb)
rocblas_<type>gemm()
rocblas_status rocblas_dgemm(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, const double *B, rocblas_int ldb, const double *beta, double *C, rocblas_int ldc)
rocblas_status rocblas_sgemm(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, const float *B, rocblas_int ldb, const float *beta, float *C, rocblas_int ldc)

BLAS Level 3 API.

xGEMM performs one of the matrix-matrix operations

C = alpha*op( A )*op( B ) + beta*C,

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C an m by n matrix.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] transA: [rocblas_operation] specifies the form of op( A )

  • [in] transB: [rocblas_operation] specifies the form of op( B )

  • [in] m: [rocblas_int] number or rows of matrices op( A ) and C

  • [in] n: [rocblas_int] number of columns of matrices op( B ) and C

  • [in] k: [rocblas_int] number of columns of matrix op( A ) and number of rows of matrix op( B )

  • [in] alpha: device pointer or host pointer specifying the scalar alpha.

  • [in] A: device pointer storing matrix A.

  • [in] lda: [rocblas_int] specifies the leading dimension of A.

  • [in] B: device pointer storing matrix B.

  • [in] ldb: [rocblas_int] specifies the leading dimension of B.

  • [in] beta: device pointer or host pointer specifying the scalar beta.

  • [inout] C: device pointer storing matrix C on the GPU.

  • [in] ldc: [rocblas_int] specifies the leading dimension of C.

rocblas_status rocblas_hgemm(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, const rocblas_half *B, rocblas_int ldb, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc)
rocblas_status rocblas_cgemm(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, const rocblas_float_complex *B, rocblas_int ldb, const rocblas_float_complex *beta, rocblas_float_complex *C, rocblas_int ldc)
rocblas_status rocblas_zgemm(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, const rocblas_double_complex *B, rocblas_int ldb, const rocblas_double_complex *beta, rocblas_double_complex *C, rocblas_int ldc)
rocblas_<type>gemm_batched()
rocblas_status rocblas_sgemm_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *const A[], rocblas_int lda, const float *const B[], rocblas_int ldb, const float *beta, float *const C[], rocblas_int ldc, rocblas_int batch_count)

BLAS Level 3 API.

xGEMM_BATCHED performs one of the batched matrix-matrix operations C_i = alpha*op( A_i )*op( B_i ) + beta*C_i, for i = 1, …, batch_count. where op( X ) is one of op( X ) = X or op( X ) = X**T or op( X ) = X**H, alpha and beta are scalars, and A, B and C are strided batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) an k by n by batch_count strided_batched matrix and C an m by n by batch_count strided_batched matrix.

Parameters
  • [in] handle: [rocblas_handle handle to the rocblas library context queue.

  • [in] transA: [rocblas_operation] specifies the form of op( A )

  • [in] transB: [rocblas_operation] specifies the form of op( B )

  • [in] m: [rocblas_int] matrix dimention m.

  • [in] n: [rocblas_int] matrix dimention n.

  • [in] k: [rocblas_int] matrix dimention k.

  • [in] alpha: device pointer or host pointer specifying the scalar alpha.

  • [in] A: device array of device pointers storing each matrix A_i.

  • [in] lda: [rocblas_int] specifies the leading dimension of each A_i.

  • [in] B: device array of device pointers storing each matrix B_i.

  • [in] ldb: [rocblas_int] specifies the leading dimension of each B_i.

  • [in] beta: device pointer or host pointer specifying the scalar beta.

  • [inout] C: device array of device pointers storing each matrix C_i.

  • [in] ldc: [rocblas_int] specifies the leading dimension of each C_i.

  • [in] batch_count: [rocblas_int] number of gemm operations in the batch

rocblas_status rocblas_dgemm_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *const A[], rocblas_int lda, const double *const B[], rocblas_int ldb, const double *beta, double *const C[], rocblas_int ldc, rocblas_int batch_count)
rocblas_status rocblas_hgemm_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *const A[], rocblas_int lda, const rocblas_half *const B[], rocblas_int ldb, const rocblas_half *beta, rocblas_half *const C[], rocblas_int ldc, rocblas_int batch_count)
rocblas_status rocblas_cgemm_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_float_complex *alpha, const rocblas_float_complex *const A[], rocblas_int lda, const rocblas_float_complex *const B[], rocblas_int ldb, const rocblas_float_complex *beta, rocblas_float_complex *const C[], rocblas_int ldc, rocblas_int batch_count)
rocblas_status rocblas_zgemm_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_double_complex *alpha, const rocblas_double_complex *const A[], rocblas_int lda, const rocblas_double_complex *const B[], rocblas_int ldb, const rocblas_double_complex *beta, rocblas_double_complex *const C[], rocblas_int ldc, rocblas_int batch_count)
rocblas_<type>gemm_strided_batched()
rocblas_status rocblas_dgemm_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, rocblas_stride stride_a, const double *B, rocblas_int ldb, rocblas_stride stride_b, const double *beta, double *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)
rocblas_status rocblas_sgemm_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, rocblas_stride stride_a, const float *B, rocblas_int ldb, rocblas_stride stride_b, const float *beta, float *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)

BLAS Level 3 API.

xGEMM_STRIDED_BATCHED performs one of the strided batched matrix-matrix operations

C_i = alpha*op( A_i )*op( B_i ) + beta*C_i, for i = 1, ..., batch_count.

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B and C are strided batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) an k by n by batch_count strided_batched matrix and C an m by n by batch_count strided_batched matrix.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] transA: [rocblas_operation] specifies the form of op( A )

  • [in] transB: [rocblas_operation] specifies the form of op( B )

  • [in] m: [rocblas_int] matrix dimention m.

  • [in] n: [rocblas_int] matrix dimention n.

  • [in] k: [rocblas_int] matrix dimention k.

  • [in] alpha: device pointer or host pointer specifying the scalar alpha.

  • [in] A: device pointer pointing to the first matrix A_1.

  • [in] lda: [rocblas_int] specifies the leading dimension of each A_i.

  • [in] stride_a: [rocblas_stride] stride from the start of one A_i matrix to the next A_(i + 1).

  • [in] B: device pointer pointing to the first matrix B_1.

  • [in] ldb: [rocblas_int] specifies the leading dimension of each B_i.

  • [in] stride_b: [rocblas_stride] stride from the start of one B_i matrix to the next B_(i + 1).

  • [in] beta: device pointer or host pointer specifying the scalar beta.

  • [inout] C: device pointer pointing to the first matrix C_1.

  • [in] ldc: [rocblas_int] specifies the leading dimension of each C_i.

  • [in] stride_c: [rocblas_stride] stride from the start of one C_i matrix to the next C_(i + 1).

  • [in] batch_count: [rocblas_int] number of gemm operatons in the batch

rocblas_status rocblas_hgemm_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, rocblas_stride stride_a, const rocblas_half *B, rocblas_int ldb, rocblas_stride stride_b, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)
rocblas_status rocblas_cgemm_strided_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride stride_a, const rocblas_float_complex *B, rocblas_int ldb, rocblas_stride stride_b, const rocblas_float_complex *beta, rocblas_float_complex *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)
rocblas_status rocblas_zgemm_strided_batched(rocblas_handle handle, rocblas_operation transa, rocblas_operation transb, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride stride_a, const rocblas_double_complex *B, rocblas_int ldb, rocblas_stride stride_b, const rocblas_double_complex *beta, rocblas_double_complex *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)
rocblas_<type>gemm_kernel_name()
rocblas_status rocblas_dgemm_kernel_name(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const double *alpha, const double *A, rocblas_int lda, rocblas_stride stride_a, const double *B, rocblas_int ldb, rocblas_stride stride_b, const double *beta, double *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)
rocblas_status rocblas_sgemm_kernel_name(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const float *alpha, const float *A, rocblas_int lda, rocblas_stride stride_a, const float *B, rocblas_int ldb, rocblas_stride stride_b, const float *beta, float *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)
rocblas_status rocblas_hgemm_kernel_name(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const rocblas_half *alpha, const rocblas_half *A, rocblas_int lda, rocblas_stride stride_a, const rocblas_half *B, rocblas_int ldb, rocblas_stride stride_b, const rocblas_half *beta, rocblas_half *C, rocblas_int ldc, rocblas_stride stride_c, rocblas_int batch_count)
rocblas_<type>geam()
rocblas_status rocblas_dgeam(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, const double *beta, const double *B, rocblas_int ldb, double *C, rocblas_int ldc)
rocblas_status rocblas_sgeam(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, const float *beta, const float *B, rocblas_int ldb, float *C, rocblas_int ldc)

BLAS Level 3 API.

xGEAM performs one of the matrix-matrix operations

C = alpha*op( A ) + beta*op( B ),

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B and C are matrices, with op( A ) an m by n matrix, op( B ) an m by n matrix, and C an m by n matrix.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] transA: [rocblas_operation] specifies the form of op( A )

  • [in] transB: [rocblas_operation] specifies the form of op( B )

  • [in] m: [rocblas_int] matrix dimension m.

  • [in] n: [rocblas_int] matrix dimension n.

  • [in] alpha: device pointer or host pointer specifying the scalar alpha.

  • [in] A: device pointer storing matrix A.

  • [in] lda: [rocblas_int] specifies the leading dimension of A.

  • [in] beta: device pointer or host pointer specifying the scalar beta.

  • [in] B: device pointer storing matrix B.

  • [in] ldb: [rocblas_int] specifies the leading dimension of B.

  • [inout] C: device pointer storing matrix C.

  • [in] ldc: [rocblas_int] specifies the leading dimension of C.

BLAS Extensions
rocblas_gemm_ex()
rocblas_status rocblas_gemm_ex(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)

BLAS EX API.

GEMM_EX performs one of the matrix-matrix operations

D = alpha*op( A )*op( B ) + beta*C,

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B, C, and D are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C and D are m by n matrices.

Supported types are as follows:

  • rocblas_datatype_f64_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f32_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f16_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type

  • rocblas_datatype_bf16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type

  • rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type

  • rocblas_datatype_f32_c = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f64_c = a_type = b_type = c_type = d_type = compute_type

Below are restrictions for rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type:

  • k must be a multiple of 4

  • lda must be a multiple of 4 if transA == rocblas_operation_transpose

  • ldb must be a multiple of 4 if transB == rocblas_operation_none

  • for transA == rocblas_operation_transpose or transB == rocblas_operation_none the matrices A and B must have each 4 consecutive values in the k dimension packed. This packing can be achieved with the following pseudo-code. The code assumes the original matrices are in A and B, and the packed matrices are A_packed and B_packed. The size of the A_packed matrix is the same as the size of the A matrix, and the size of the B_packed matrix is the same as the size of the B matrix.

if(transA == rocblas_operation_none)
{
    int nb = 4;
    for(int i_m = 0; i_m < m; i_m++)
    {
        for(int i_k = 0; i_k < k; i_k++)
        {
            A_packed[i_k % nb + (i_m + (i_k / nb) * lda) * nb] = A[i_m + i_k * lda];
        }
    }
}
else
{
    A_packed = A;
}
if(transB == rocblas_operation_transpose)
{
    int nb = 4;
    for(int i_n = 0; i_n < m; i_n++)
    {
        for(int i_k = 0; i_k < k; i_k++)
        {
            B_packed[i_k % nb + (i_n + (i_k / nb) * lda) * nb] = B[i_n + i_k * lda];
        }
    }
}
else
{
    B_packed = B;
}

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] transA: [rocblas_operation] specifies the form of op( A ).

  • [in] transB: [rocblas_operation] specifies the form of op( B ).

  • [in] m: [rocblas_int] matrix dimension m.

  • [in] n: [rocblas_int] matrix dimension n.

  • [in] k: [rocblas_int] matrix dimension k.

  • [in] alpha: [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.

  • [in] a: [void *] device pointer storing matrix A.

  • [in] a_type: [rocblas_datatype] specifies the datatype of matrix A.

  • [in] lda: [rocblas_int] specifies the leading dimension of A.

  • [in] b: [void *] device pointer storing matrix B.

  • [in] b_type: [rocblas_datatype] specifies the datatype of matrix B.

  • [in] ldb: [rocblas_int] specifies the leading dimension of B.

  • [in] beta: [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.

  • [in] c: [void *] device pointer storing matrix C.

  • [in] c_type: [rocblas_datatype] specifies the datatype of matrix C.

  • [in] ldc: [rocblas_int] specifies the leading dimension of C.

  • [out] d: [void *] device pointer storing matrix D.

  • [in] d_type: [rocblas_datatype] specifies the datatype of matrix D.

  • [in] ldd: [rocblas_int] specifies the leading dimension of D.

  • [in] compute_type: [rocblas_datatype] specifies the datatype of computation.

  • [in] algo: [rocblas_gemm_algo] enumerant specifying the algorithm type.

  • [in] solution_index: [int32_t] reserved for future use.

  • [in] flags: [uint32_t] reserved for future use.

rocblas_gemm_strided_batched_ex()
rocblas_status rocblas_gemm_strided_batched_ex(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, rocblas_stride stride_a, const void *b, rocblas_datatype b_type, rocblas_int ldb, rocblas_stride stride_b, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, rocblas_stride stride_c, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_stride stride_d, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)

BLAS EX API.

GEMM_STRIDED_BATCHED_EX performs one of the strided_batched matrix-matrix operations

D_i = alpha*op(A_i)*op(B_i) + beta*C_i, for i = 1, ..., batch_count

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B, C, and D are strided_batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) a k by n by batch_count strided_batched matrix and C and D are m by n by batch_count strided_batched matrices.

The strided_batched matrices are multiple matrices separated by a constant stride. The number of matrices is batch_count.

Supported types are as follows:

  • rocblas_datatype_f64_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f32_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f16_r = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type

  • rocblas_datatype_bf16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type

  • rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type

  • rocblas_datatype_f32_c = a_type = b_type = c_type = d_type = compute_type

  • rocblas_datatype_f64_c = a_type = b_type = c_type = d_type = compute_type

Below are restrictions for rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type:

  • k must be a multiple of 4

  • lda must be a multiple of 4 if transA == rocblas_operation_transpose

  • ldb must be a multiple of 4 if transB == rocblas_operation_none

  • for transA == rocblas_operation_transpose or transB == rocblas_operation_none the matrices A and B must have each 4 consecutive values in the k dimension packed. This packing can be achieved with the following pseudo-code. The code assumes the original matrices are in A and B, and the packed matrices are A_packed and B_packed. The size of the A_packed matrix is the same as the size of the A matrix, and the size of the B_packed matrix is the same as the size of the B matrix.

if(transA == rocblas_operation_none)
{
    int nb = 4;
    for(int i_m = 0; i_m < m; i_m++)
    {
        for(int i_k = 0; i_k < k; i_k++)
        {
            A_packed[i_k % nb + (i_m + (i_k / nb) * lda) * nb] = A[i_m + i_k * lda];
        }
    }
}
else
{
    A_packed = A;
}
if(transB == rocblas_operation_transpose)
{
    int nb = 4;
    for(int i_n = 0; i_n < m; i_n++)
    {
        for(int i_k = 0; i_k < k; i_k++)
        {
            B_packed[i_k % nb + (i_n + (i_k / nb) * lda) * nb] = B[i_n + i_k * lda];
        }
    }
}
else
{
    B_packed = B;
}

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] transA: [rocblas_operation] specifies the form of op( A ).

  • [in] transB: [rocblas_operation] specifies the form of op( B ).

  • [in] m: [rocblas_int] matrix dimension m.

  • [in] n: [rocblas_int] matrix dimension n.

  • [in] k: [rocblas_int] matrix dimension k.

  • [in] alpha: [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.

  • [in] a: [void *] device pointer pointing to first matrix A_1.

  • [in] a_type: [rocblas_datatype] specifies the datatype of each matrix A_i.

  • [in] lda: [rocblas_int] specifies the leading dimension of each A_i.

  • [in] stride_a: [rocblas_stride] specifies stride from start of one A_i matrix to the next A_(i + 1).

  • [in] b: [void *] device pointer pointing to first matrix B_1.

  • [in] b_type: [rocblas_datatype] specifies the datatype of each matrix B_i.

  • [in] ldb: [rocblas_int] specifies the leading dimension of each B_i.

  • [in] stride_b: [rocblas_stride] specifies stride from start of one B_i matrix to the next B_(i + 1).

  • [in] beta: [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.

  • [in] c: [void *] device pointer pointing to first matrix C_1.

  • [in] c_type: [rocblas_datatype] specifies the datatype of each matrix C_i.

  • [in] ldc: [rocblas_int] specifies the leading dimension of each C_i.

  • [in] stride_c: [rocblas_stride] specifies stride from start of one C_i matrix to the next C_(i + 1).

  • [out] d: [void *] device pointer storing each matrix D_i.

  • [in] d_type: [rocblas_datatype] specifies the datatype of each matrix D_i.

  • [in] ldd: [rocblas_int] specifies the leading dimension of each D_i.

  • [in] stride_d: [rocblas_stride] specifies stride from start of one D_i matrix to the next D_(i + 1).

  • [in] batch_count: [rocblas_int] number of gemm operations in the batch.

  • [in] compute_type: [rocblas_datatype] specifies the datatype of computation.

  • [in] algo: [rocblas_gemm_algo] enumerant specifying the algorithm type.

  • [in] solution_index: [int32_t] reserved for future use.

  • [in] flags: [uint32_t] reserved for future use.

rocblas_trsm_ex()
rocblas_status rocblas_trsm_ex(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const void *alpha, const void *A, rocblas_int lda, void *B, rocblas_int ldb, const void *invA, rocblas_int invA_size, rocblas_datatype compute_type)

BLAS EX API

TRSM_EX solves

op(A)*X = alpha*B or X*op(A) = alpha*B,

where alpha is a scalar, X and B are m by n matrices, A is triangular matrix and op(A) is one of

op( A ) = A   or   op( A ) = A^T   or   op( A ) = A^H.

The matrix X is overwritten on B.

TRSM_EX gives the user the ability to reuse the invA matrix between runs. If invA == NULL, rocblas_trsm_ex will automatically calculate invA on every run.

Setting up invA: The accepted invA matrix consists of the packed 128x128 inverses of the diagonal blocks of matrix A, followed by any smaller diagonal block that remains. To set up invA it is recommended that rocblas_trtri_batched be used with matrix A as the input.

Device memory of size 128 x k should be allocated for invA ahead of time, where k is m when rocblas_side_left and is n when rocblas_side_right. The actual number of elements in invA should be passed as invA_size.

To begin, rocblas_trtri_batched must be called on the full 128x128 sized diagonal blocks of matrix A. Below are the restricted parameters:

  • n = 128

  • ldinvA = 128

  • stride_invA = 128x128

  • batch_count = k / 128,

Then any remaining block may be added:

  • n = k % 128

  • invA = invA + stride_invA * previous_batch_count

  • ldinvA = 128

  • batch_count = 1

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] side: [rocblas_side] rocblas_side_left: op(A)*X = alpha*B. rocblas_side_right: X*op(A) = alpha*B.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: A is an upper triangular matrix. rocblas_fill_lower: A is a lower triangular matrix.

  • [in] transA: [rocblas_operation] transB: op(A) = A. rocblas_operation_transpose: op(A) = A^T. rocblas_operation_conjugate_transpose: op(A) = A^H.

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: A is assumed to be unit triangular. rocblas_diagonal_non_unit: A is not assumed to be unit triangular.

  • [in] m: [rocblas_int] m specifies the number of rows of B. m >= 0.

  • [in] n: [rocblas_int] n specifies the number of columns of B. n >= 0.

  • [in] alpha: [void *] device pointer or host pointer specifying the scalar alpha. When alpha is &zero then A is not referenced, and B need not be set before entry.

  • [in] A: [void *] device pointer storing matrix A. of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.

  • [in] lda: [rocblas_int] lda specifies the first dimension of A. if side = rocblas_side_left, lda >= max( 1, m ), if side = rocblas_side_right, lda >= max( 1, n ).

  • [inout] B: [void *] device pointer storing matrix B. B is of dimension ( ldb, n ). Before entry, the leading m by n part of the array B must contain the right-hand side matrix B, and on exit is overwritten by the solution matrix X.

  • [in] ldb: [rocblas_int] ldb specifies the first dimension of B. ldb >= max( 1, m ).

  • [in] invA: [void *] device pointer storing the inverse diagonal blocks of A. invA is of dimension ( ld_invA, k ), where k is m when rocblas_side_left and is n when rocblas_side_right. ld_invA must be equal to 128.

  • [in] invA_size: [rocblas_int] invA_size specifies the number of elements of device memory in invA.

  • [in] compute_type: [rocblas_datatype] specifies the datatype of computation

rocblas_trsm_batched_ex()
rocblas_status rocblas_trsm_batched_ex(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const void *alpha, const void *A, rocblas_int lda, void *B, rocblas_int ldb, rocblas_int batch_count, const void *invA, rocblas_int invA_size, rocblas_datatype compute_type)

BLAS EX API

TRSM_BATCHED_EX solves

op(A_i)*X_i = alpha*B_i or X_i*op(A_i) = alpha*B_i,

for i = 1, …, batch_count; and where alpha is a scalar, X and B are arrays of m by n matrices, A is an array of triangular matrix and each op(A_i) is one of

op( A_i ) = A_i   or   op( A_i ) = A_i^T   or   op( A_i ) = A_i^H.

Each matrix X_i is overwritten on B_i.

TRSM_EX gives the user the ability to reuse the invA matrix between runs. If invA == NULL, rocblas_trsm_batched_ex will automatically calculate each invA_i on every run.

Setting up invA: Each accepted invA_i matrix consists of the packed 128x128 inverses of the diagonal blocks of matrix A_i, followed by any smaller diagonal block that remains. To set up each invA_i it is recommended that rocblas_trtri_batched be used with matrix A_i as the input. invA is an array of pointers of batch_count length holding each invA_i.

Device memory of size 128 x k should be allocated for each invA_i ahead of time, where k is m when rocblas_side_left and is n when rocblas_side_right. The actual number of elements in each invA_i should be passed as invA_size.

To begin, rocblas_trtri_batched must be called on the full 128x128 sized diagonal blocks of each matrix A_i. Below are the restricted parameters:

  • n = 128

  • ldinvA = 128

  • stride_invA = 128x128

  • batch_count = k / 128,

Then any remaining block may be added:

  • n = k % 128

  • invA = invA + stride_invA * previous_batch_count

  • ldinvA = 128

  • batch_count = 1

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] side: [rocblas_side] rocblas_side_left: op(A)*X = alpha*B. rocblas_side_right: X*op(A) = alpha*B.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: each A_i is an upper triangular matrix. rocblas_fill_lower: each A_i is a lower triangular matrix.

  • [in] transA: [rocblas_operation] transB: op(A) = A. rocblas_operation_transpose: op(A) = A^T. rocblas_operation_conjugate_transpose: op(A) = A^H.

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: each A_i is assumed to be unit triangular. rocblas_diagonal_non_unit: each A_i is not assumed to be unit triangular.

  • [in] m: [rocblas_int] m specifies the number of rows of each B_i. m >= 0.

  • [in] n: [rocblas_int] n specifies the number of columns of each B_i. n >= 0.

  • [in] alpha: [void *] device pointer or host pointer alpha specifying the scalar alpha. When alpha is &zero then A is not referenced, and B need not be set before entry.

  • [in] A: [void *] device array of device pointers storing each matrix A_i. each A_i is of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.

  • [in] lda: [rocblas_int] lda specifies the first dimension of each A_i. if side = rocblas_side_left, lda >= max( 1, m ), if side = rocblas_side_right, lda >= max( 1, n ).

  • [inout] B: [void *] device array of device pointers storing each matrix B_i. each B_i is of dimension ( ldb, n ). Before entry, the leading m by n part of the array B_i must contain the right-hand side matrix B_i, and on exit is overwritten by the solution matrix X_i

  • [in] ldb: [rocblas_int] ldb specifies the first dimension of each B_i. ldb >= max( 1, m ).

  • [in] batch_count: [rocblas_int] specifies how many batches.

  • [in] invA: [void *] device array of device pointers storing the inverse diagonal blocks of each A_i. each invA_i is of dimension ( ld_invA, k ), where k is m when rocblas_side_left and is n when rocblas_side_right. ld_invA must be equal to 128.

  • [in] invA_size: [rocblas_int] invA_size specifies the number of elements of device memory in each invA_i.

  • [in] compute_type: [rocblas_datatype] specifies the datatype of computation

rocblas_trsm_strided_batched_ex()
rocblas_status rocblas_trsm_strided_batched_ex(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const void *alpha, const void *A, rocblas_int lda, rocblas_stride stride_A, void *B, rocblas_int ldb, rocblas_stride stride_B, rocblas_int batch_count, const void *invA, rocblas_int invA_size, rocblas_stride stride_invA, rocblas_datatype compute_type)

BLAS EX API

TRSM_STRIDED_BATCHED_EX solves

op(A_i)*X_i = alpha*B_i or X_i*op(A_i) = alpha*B_i,

for i = 1, …, batch_count; and where alpha is a scalar, X and B are strided batched m by n matrices, A is a strided batched triangular matrix and op(A_i) is one of

op( A_i ) = A_i   or   op( A_i ) = A_i^T   or   op( A_i ) = A_i^H.

Each matrix X_i is overwritten on B_i.

TRSM_EX gives the user the ability to reuse each invA_i matrix between runs. If invA == NULL, rocblas_trsm_batched_ex will automatically calculate each invA_i on every run.

Setting up invA: Each accepted invA_i matrix consists of the packed 128x128 inverses of the diagonal blocks of matrix A_i, followed by any smaller diagonal block that remains. To set up invA_i it is recommended that rocblas_trtri_batched be used with matrix A_i as the input. invA is a contiguous piece of memory holding each invA_i.

Device memory of size 128 x k should be allocated for each invA_i ahead of time, where k is m when rocblas_side_left and is n when rocblas_side_right. The actual number of elements in each invA_i should be passed as invA_size.

To begin, rocblas_trtri_batched must be called on the full 128x128 sized diagonal blocks of each matrix A_i. Below are the restricted parameters:

  • n = 128

  • ldinvA = 128

  • stride_invA = 128x128

  • batch_count = k / 128,

Then any remaining block may be added:

  • n = k % 128

  • invA = invA + stride_invA * previous_batch_count

  • ldinvA = 128

  • batch_count = 1

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] side: [rocblas_side] rocblas_side_left: op(A)*X = alpha*B. rocblas_side_right: X*op(A) = alpha*B.

  • [in] uplo: [rocblas_fill] rocblas_fill_upper: each A_i is an upper triangular matrix. rocblas_fill_lower: each A_i is a lower triangular matrix.

  • [in] transA: [rocblas_operation] transB: op(A) = A. rocblas_operation_transpose: op(A) = A^T. rocblas_operation_conjugate_transpose: op(A) = A^H.

  • [in] diag: [rocblas_diagonal] rocblas_diagonal_unit: each A_i is assumed to be unit triangular. rocblas_diagonal_non_unit: each A_i is not assumed to be unit triangular.

  • [in] m: [rocblas_int] m specifies the number of rows of each B_i. m >= 0.

  • [in] n: [rocblas_int] n specifies the number of columns of each B_i. n >= 0.

  • [in] alpha: [void *] device pointer or host pointer specifying the scalar alpha. When alpha is &zero then A is not referenced, and B need not be set before entry.

  • [in] A: [void *] device pointer storing matrix A. of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.

  • [in] lda: [rocblas_int] lda specifies the first dimension of A. if side = rocblas_side_left, lda >= max( 1, m ), if side = rocblas_side_right, lda >= max( 1, n ).

  • [in] stride_A: [rocblas_stride] The stride between each A matrix.

  • [inout] B: [void *] device pointer pointing to first matrix B_i. each B_i is of dimension ( ldb, n ). Before entry, the leading m by n part of each array B_i must contain the right-hand side of matrix B_i, and on exit is overwritten by the solution matrix X_i.

  • [in] ldb: [rocblas_int] ldb specifies the first dimension of each B_i. ldb >= max( 1, m ).

  • [in] stride_B: [rocblas_stride] The stride between each B_i matrix.

  • [in] batch_count: [rocblas_int] specifies how many batches.

  • [in] invA: [void *] device pointer storing the inverse diagonal blocks of each A_i. invA points to the first invA_1. each invA_i is of dimension ( ld_invA, k ), where k is m when rocblas_side_left and is n when rocblas_side_right. ld_invA must be equal to 128.

  • [in] invA_size: [rocblas_int] invA_size specifies the number of elements of device memory in each invA_i.

  • [in] stride_invA: [rocblas_stride] The stride between each invA matrix.

  • [in] compute_type: [rocblas_datatype] specifies the datatype of computation

Build Information
rocblas_get_version_string()
rocblas_status rocblas_get_version_string(char *buf, size_t len)

loads char* buf with the rocblas library version. size_t len is the maximum length of char* buf.

Parameters
  • [inout] buf: pointer to buffer for version string

  • [in] len: length of buf

Auxiliary
rocblas_pointer_to_mode()
rocblas_pointer_mode rocblas_pointer_to_mode(void *ptr)

Indicates whether the pointer is on the host or device.

rocblas_create_handle()
rocblas_status rocblas_create_handle(rocblas_handle *handle)

create handle

rocblas_destroy_handle()
rocblas_status rocblas_destroy_handle(rocblas_handle handle)

destroy handle

rocblas_add_stream()
rocblas_status rocblas_add_stream(rocblas_handle handle, hipStream_t stream)

add stream to handle

rocblas_set_stream()
rocblas_status rocblas_set_stream(rocblas_handle handle, hipStream_t stream)

remove any streams from handle, and add one

rocblas_get_stream()
rocblas_status rocblas_get_stream(rocblas_handle handle, hipStream_t *stream)

get stream [0] from handle

rocblas_set_pointer_mode()
rocblas_status rocblas_set_pointer_mode(rocblas_handle handle, rocblas_pointer_mode pointer_mode)

set rocblas_pointer_mode

rocblas_get_pointer_mode()
rocblas_status rocblas_get_pointer_mode(rocblas_handle handle, rocblas_pointer_mode *pointer_mode)

get rocblas_pointer_mode

rocblas_set_vector()
rocblas_status rocblas_set_vector(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy)

copy vector from host to device

rocblas_set_vector_async()
rocblas_status rocblas_set_vector_async(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy, hipStream_t stream)

asynchronously copy vector from host to device

rocblas_set_vector_async copies a vector from pinned host memory to device memory asynchronously. Memory on the host must be allocated with hipHostMalloc or the transfer will be synchronous.

Parameters
  • [in] n: [rocblas_int] number of elements in the vector

  • [in] x: pointer to vector on the host

  • [in] incx: [rocblas_int] specifies the increment for the elements of the vector

  • [out] y: pointer to vector on the device

  • [in] incy: [rocblas_int] specifies the increment for the elements of the vector

  • [in] stream: specifies the stream into which this transfer request is queued

rocblas_get_vector()
rocblas_status rocblas_get_vector(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy)

copy vector from device to host

ocblas_get_vector_async()
rocblas_status rocblas_get_vector_async(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy, hipStream_t stream)

asynchronously copy vector from device to host

rocblas_get_vector_async copies a vector from pinned host memory to device memory asynchronously. Memory on the host must be allocated with hipHostMalloc or the transfer will be synchronous.

Parameters
  • [in] n: [rocblas_int] number of elements in the vector

  • [in] x: pointer to vector on the device

  • [in] incx: [rocblas_int] specifies the increment for the elements of the vector

  • [out] y: pointer to vector on the host

  • [in] incy: [rocblas_int] specifies the increment for the elements of the vector

  • [in] stream: specifies the stream into which this transfer request is queued

rocblas_set_matrix()
rocblas_status rocblas_set_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb)

copy matrix from host to device

rocblas_get_matrix()
rocblas_status rocblas_get_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb)

copy matrix from device to host

rocblas_get_matrix_async()
rocblas_status rocblas_get_matrix_async(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb, hipStream_t stream)

asynchronously copy matrix from device to host

rocblas_get_matrix_async copies a matrix from device memory to pinned host memory asynchronously. Memory on the host must be allocated with hipHostMalloc or the transfer will be synchronous.

Parameters
  • [in] rows: [rocblas_int] number of rows in matrices

  • [in] cols: [rocblas_int] number of columns in matrices

  • [in] elem_size: [rocblas_int] number of bytes per element in the matrix

  • [in] a: pointer to matrix on the GPU

  • [in] lda: [rocblas_int] specifies the leading dimension of A

  • [out] b: pointer to matrix on the host

  • [in] ldb: [rocblas_int] specifies the leading dimension of B

  • [in] stream: specifies the stream into which this transfer request is queued

rocblas_start_device_memory_size_query()
rocblas_status rocblas_start_device_memory_size_query(rocblas_handle handle)

Indicates that subsequent rocBLAS kernel calls should collect the optimal device memory size in bytes for their given kernel arguments, and keep track of the maximum. Each kernel call can reuse temporary device memory on the same stream, so the maximum is collected. Returns rocblas_status_size_query_mismatch if another size query is already in progress; returns rocblas_status_success otherwise.

Parameters
  • [in] handle: rocblas handle

rocblas_stop_device_memory_size_query()
rocblas_status rocblas_stop_device_memory_size_query(rocblas_handle handle, size_t *size)

Stops collecting optimal device memory size information Returns rocblas_status_size_query_mismatch if a collection is not underway; rocblas_status_invalid_handle if handle is nullptr; rocblas_status_invalid_pointer if size is nullptr; rocblas_status_success otherwise

Parameters
  • [in] handle: rocblas handle

  • [out] size: maximum of the optimal sizes collected

rocblas_get_device_memory_size()
rocblas_status rocblas_get_device_memory_size(rocblas_handle handle, size_t *size)

Gets the current device memory size for the handle Returns rocblas_status_invalid_handle if handle is nullptr; rocblas_status_invalid_pointer if size is nullptr; rocblas_status_success otherwise

Parameters
  • [in] handle: rocblas handle

  • [out] size: current device memory size for the handle

rocblas_set_device_memory_size()
rocblas_status rocblas_set_device_memory_size(rocblas_handle handle, size_t size)

Changes the size of allocated device memory at runtime.

Any previously allocated device memory is freed.

If size > 0 sets the device memory size to the specified size (in bytes) If size == 0 frees the memory allocated so far, and lets rocBLAS manage device memory in the future, expanding it when necessary Returns rocblas_status_invalid_handle if handle is nullptr; rocblas_status_invalid_pointer if size is nullptr; rocblas_status_success otherwise

Parameters
  • [in] handle: rocblas handle

  • [in] size: size of allocated device memory

rocblas_is_managing_device_memory()
bool rocblas_is_managing_device_memory(rocblas_handle handle)

Returns true when device memory in handle is managed by rocBLAS

Parameters
  • [in] handle: rocblas handle

All API

struct rocblas_half
#include <rocblas-types.h>

Represents a 16 bit floating point number.

Public Members

uint16_t data
namespace rocblas

Functions

void reinit_logs()
file rocblas-auxiliary.h
#include “rocblas-export.h”#include “rocblas-types.h”

rocblas-auxiliary.h provides auxilary functions in rocblas

Functions

rocblas_status rocblas_create_handle(rocblas_handle *handle)

create handle

rocblas_status rocblas_destroy_handle(rocblas_handle handle)

destroy handle

rocblas_status rocblas_add_stream(rocblas_handle handle, hipStream_t stream)

add stream to handle

rocblas_status rocblas_set_stream(rocblas_handle handle, hipStream_t stream)

remove any streams from handle, and add one

rocblas_status rocblas_get_stream(rocblas_handle handle, hipStream_t *stream)

get stream [0] from handle

rocblas_status rocblas_set_pointer_mode(rocblas_handle handle, rocblas_pointer_mode pointer_mode)

set rocblas_pointer_mode

rocblas_status rocblas_get_pointer_mode(rocblas_handle handle, rocblas_pointer_mode *pointer_mode)

get rocblas_pointer_mode

rocblas_pointer_mode rocblas_pointer_to_mode(void *ptr)

Indicates whether the pointer is on the host or device.

rocblas_status rocblas_set_vector(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy)

copy vector from host to device

rocblas_status rocblas_get_vector(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy)

copy vector from device to host

rocblas_status rocblas_set_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb)

copy matrix from host to device

rocblas_status rocblas_get_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb)

copy matrix from device to host

rocblas_status rocblas_set_vector_async(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy, hipStream_t stream)

asynchronously copy vector from host to device

rocblas_set_vector_async copies a vector from pinned host memory to device memory asynchronously. Memory on the host must be allocated with hipHostMalloc or the transfer will be synchronous.

Parameters
  • [in] n: [rocblas_int] number of elements in the vector

  • [in] x: pointer to vector on the host

  • [in] incx: [rocblas_int] specifies the increment for the elements of the vector

  • [out] y: pointer to vector on the device

  • [in] incy: [rocblas_int] specifies the increment for the elements of the vector

  • [in] stream: specifies the stream into which this transfer request is queued

rocblas_status rocblas_get_vector_async(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy, hipStream_t stream)

asynchronously copy vector from device to host

rocblas_get_vector_async copies a vector from pinned host memory to device memory asynchronously. Memory on the host must be allocated with hipHostMalloc or the transfer will be synchronous.

Parameters
  • [in] n: [rocblas_int] number of elements in the vector

  • [in] x: pointer to vector on the device

  • [in] incx: [rocblas_int] specifies the increment for the elements of the vector

  • [out] y: pointer to vector on the host

  • [in] incy: [rocblas_int] specifies the increment for the elements of the vector

  • [in] stream: specifies the stream into which this transfer request is queued

rocblas_status rocblas_set_matrix_async(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb, hipStream_t stream)

asynchronously copy matrix from host to device

rocblas_set_matrix_async copies a matrix from pinned host memory to device memory asynchronously. Memory on the host must be allocated with hipHostMalloc or the transfer will be synchronous.

Parameters
  • [in] rows: [rocblas_int] number of rows in matrices

  • [in] cols: [rocblas_int] number of columns in matrices

  • [in] elem_size: [rocblas_int] number of bytes per element in the matrix

  • [in] a: pointer to matrix on the host

  • [in] lda: [rocblas_int] specifies the leading dimension of A

  • [out] b: pointer to matrix on the GPU

  • [in] ldb: [rocblas_int] specifies the leading dimension of B

  • [in] stream: specifies the stream into which this transfer request is queued

rocblas_status rocblas_get_matrix_async(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb, hipStream_t stream)

asynchronously copy matrix from device to host

rocblas_get_matrix_async copies a matrix from device memory to pinned host memory asynchronously. Memory on the host must be allocated with hipHostMalloc or the transfer will be synchronous.

Parameters
  • [in] rows: [rocblas_int] number of rows in matrices

  • [in] cols: [rocblas_int] number of columns in matrices

  • [in] elem_size: [rocblas_int] number of bytes per element in the matrix

  • [in] a: pointer to matrix on the GPU

  • [in] lda: [rocblas_int] specifies the leading dimension of A

  • [out] b: pointer to matrix on the host

  • [in] ldb: [rocblas_int] specifies the leading dimension of B

  • [in] stream: specifies the stream into which this transfer request is queued

file rocblas-functions.h
#include “rocblas-export.h”#include “rocblas-types.h”

rocblas_functions.h provides Basic Linear Algebra Subprograms of Level 1, 2 and 3, using HIP optimized for AMD HCC-based GPU hardware. This library can also run on CUDA-based NVIDIA GPUs. This file exposes C89 BLAS interface

Defines

ROCBLAS_VA_OPT_3RD_ARG(_1, _2, _3, ...)
ROCBLAS_VA_OPT_SUPPORTED(...)
ROCBLAS_VA_OPT_COUNT_IMPL(X, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, N, ...)
ROCBLAS_VA_OPT_COUNT(...)
ROCBLAS_VA_OPT_PRAGMA_SELECT0(...)
ROCBLAS_VA_OPT_PRAGMA_SELECTN(pragma, ...)
ROCBLAS_VA_OPT_PRAGMA_IMPL2(pragma, count)
ROCBLAS_VA_OPT_PRAGMA_IMPL(pragma, count)
ROCBLAS_VA_OPT_PRAGMA(pragma, ...)
rocblas_gemm_ex(handle, transA, transB, m, n, k, alpha, a, a_type, lda, b, b_type, ldb, beta, c, c_type, ldc, d, d_type, ldd, compute_type, algo, solution_index, flags, ...)
rocblas_gemm_strided_batched_ex(handle, transA, transB, m, n, k, alpha, a, a_type, lda, stride_a, b, b_type, ldb, stride_b, beta, c, c_type, ldc, stride_c, d, d_type, ldd, stride_d, batch_count, compute_type, algo, solution_index, flags, ...)
rocblas_trsm_ex(handle, side, uplo, transA, diag, m, n, alpha, A, lda, B, ldb, invA, invA_size, compute_type, ...)

Functions

rocblas_status rocblas_sscal(rocblas_handle handle, rocblas_int n, const float *alpha, float *x, rocblas_int incx)

BLAS Level 1 API.

scal scales each element of vector x with scalar alpha.

x := alpha * x

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in x.

  • [in] alpha: device pointer or host pointer for the scalar alpha.

  • [inout] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

rocblas_status rocblas_dscal(rocblas_handle handle, rocblas_int n, const double *alpha, double *x, rocblas_int incx)
rocblas_status rocblas_cscal(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *alpha, rocblas_float_complex *x, rocblas_int incx)
rocblas_status rocblas_zscal(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *alpha, rocblas_double_complex *x, rocblas_int incx)
rocblas_status rocblas_csscal(rocblas_handle handle, rocblas_int n, const float *alpha, rocblas_float_complex *x, rocblas_int incx)
rocblas_status rocblas_zdscal(rocblas_handle handle, rocblas_int n, const double *alpha, rocblas_double_complex *x, rocblas_int incx)
rocblas_status rocblas_sscal_batched(rocblas_handle handle, rocblas_int n, const float *alpha, float *const x[], rocblas_int incx, rocblas_int batch_count)

BLAS Level 1 API.

scal_batched scales each element of vector x_i with scalar alpha, for i = 1, … , batch_count.

 x_i := alpha * x_i

where (x_i) is the i-th instance of the batch.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in each x_i.

  • [in] alpha: host pointer or device pointer for the scalar alpha.

  • [inout] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each x_i.

  • [in] batch_count: [rocblas_int] specifies the number of batches in x.

rocblas_status rocblas_dscal_batched(rocblas_handle handle, rocblas_int n, const double *alpha, double *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_cscal_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *alpha, rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_zscal_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *alpha, rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_csscal_batched(rocblas_handle handle, rocblas_int n, const float *alpha, rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_zdscal_batched(rocblas_handle handle, rocblas_int n, const double *alpha, rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_sscal_strided_batched(rocblas_handle handle, rocblas_int n, const float *alpha, float *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)

BLAS Level 1 API.

scal_strided_batched scales each element of vector x_i with scalar alpha, for i = 1, … , batch_count.

 x_i := alpha * x_i ,

where (x_i) is the i-th instance of the batch.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in each x_i.

  • [in] alpha: host pointer or device pointer for the scalar alpha.

  • [inout] x: device pointer to the first vector (x_1) in the batch.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

  • [in] stride_x: [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x, however the user should take care to ensure that stride_x is of appropriate size, for a typical case this means stride_x >= n * incx.

  • [in] batch_count: [rocblas_int] specifies the number of batches in x.

rocblas_status rocblas_dscal_strided_batched(rocblas_handle handle, rocblas_int n, const double *alpha, double *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_cscal_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *alpha, rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_zscal_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *alpha, rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_csscal_strided_batched(rocblas_handle handle, rocblas_int n, const float *alpha, rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_zdscal_strided_batched(rocblas_handle handle, rocblas_int n, const double *alpha, rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_scopy(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *y, rocblas_int incy)

BLAS Level 1 API.

copy copies each element x[i] into y[i], for i = 1 , … , n

y := x,

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in x to be copied to y.

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of x.

  • [out] y: device pointer storing vector y.

  • [in] incy: [rocblas_int] specifies the increment for the elements of y.

rocblas_status rocblas_dcopy(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *y, rocblas_int incy)
rocblas_status rocblas_ccopy(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_float_complex *y, rocblas_int incy)
rocblas_status rocblas_zcopy(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_double_complex *y, rocblas_int incy)
rocblas_status rocblas_scopy_batched(rocblas_handle handle, rocblas_int n, const float *const x[], rocblas_int incx, float *const y[], rocblas_int incy, rocblas_int batch_count)

BLAS Level 1 API.

copy_batched copies each element x_i[j] into y_i[j], for j = 1 , … , n; i = 1 , … , batch_count

y_i := x_i,

where (x_i, y_i) is the i-th instance of the batch. x_i and y_i are vectors.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in each x_i to be copied to y_i.

  • [in] x: device array of device pointers storing each vector x_i.

  • [in] incx: [rocblas_int] specifies the increment for the elements of each vector x_i.

  • [out] y: device array of device pointers storing each vector y_i.

  • [in] incy: [rocblas_int] specifies the increment for the elements of each vector y_i.

  • [in] batch_count: [rocblas_int] number of instances in the batch

rocblas_status rocblas_dcopy_batched(rocblas_handle handle, rocblas_int n, const double *const x[], rocblas_int incx, double *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_ccopy_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_float_complex *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_zcopy_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_double_complex *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_scopy_strided_batched(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_stride stridex, float *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)

BLAS Level 1 API.

copy_strided_batched copies each element x_i[j] into y_i[j], for j = 1 , … , n; i = 1 , … , batch_count

y_i := x_i,

where (x_i, y_i) is the i-th instance of the batch. x_i and y_i are vectors.

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in each x_i to be copied to y_i.

  • [in] x: device pointer to the first vector (x_1) in the batch.

  • [in] incx: [rocblas_int] specifies the increments for the elements of vectors x_i.

  • [in] stridex: [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x, however the user should take care to ensure that stride_x is of appropriate size, for a typical case this means stride_x >= n * incx.

  • [out] y: device pointer to the first vector (y_1) in the batch.

  • [in] incy: [rocblas_int] specifies the increment for the elements of vectors y_i.

  • [in] stridey: [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1). There are no restrictions placed on stride_y, however the user should take care to ensure that stride_y is of appropriate size, for a typical case this means stride_y >= n * incy. stridey should be non zero.

  • [in] incy: [rocblas_int] specifies the increment for the elements of y.

  • [in] batch_count: [rocblas_int] number of instances in the batch

rocblas_status rocblas_dcopy_strided_batched(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_stride stridex, double *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_ccopy_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_float_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_zcopy_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_double_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_sdot(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, const float *y, rocblas_int incy, float *result)

BLAS Level 1 API.

dot(u) performs the dot product of vectors x and y

result = x * y;

dotc performs the dot product of the conjugate of complex vector x and complex vector y

result = conjugate (x) * y;

Parameters
  • [in] handle: [rocblas_handle] handle to the rocblas library context queue.

  • [in] n: [rocblas_int] the number of elements in x and y.

  • [in] x: device pointer storing vector x.

  • [in] incx: [rocblas_int] specifies the increment for the elements of y.

  • [in] y: device pointer storing vector y.

  • [in] incy: [rocblas_int] specifies the increment for the elements of y.

  • [inout] result: device pointer or host pointer to store the dot product. return is 0.0 if n <= 0.

rocblas_status rocblas_ddot(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx,