LLM inference performance validation on AMD Instinct MI300X#
2025-02-05
12 min read time
The ROCm vLLM Docker image offers a prebuilt, optimized environment for validating large language model (LLM) inference performance on the AMD Instinct™ MI300X accelerator. This ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for the MI300X accelerator and includes the following components:
With this Docker image, you can quickly validate the expected inference performance numbers for the MI300X accelerator. This topic also provides tips on optimizing performance with popular AI models. For more information, see the lists of available models for MAD-integrated benchmarking and standalone benchmarking.
Note
vLLM is a toolkit and library for LLM inference and serving. AMD implements high-performance custom kernels and modules in vLLM to enhance performance. See vLLM inference and vLLM performance optimization for more information.
Getting started#
Use the following procedures to reproduce the benchmark results on an MI300X accelerator with the prebuilt vLLM Docker image.
Disable NUMA auto-balancing.
To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU might hang until the periodic balancing is finalized. For more information, see AMD Instinct MI300X system optimization.
# disable automatic NUMA balancing sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' # check if NUMA balancing is disabled (returns 0 if disabled) cat /proc/sys/kernel/numa_balancing 0
Download the ROCm vLLM Docker image.
Use the following command to pull the Docker image from Docker Hub.
docker pull rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6
Once the setup is complete, choose between two options to reproduce the benchmark results:
MAD-integrated benchmarking#
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
Use this command to run a performance benchmark test of the Llama 3.1 8B model
on one GPU with float16
data type in the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
python3 tools/run_models.py --tags pyt_vllm_llama-3.1-8b --keep-model-dir --live-output --timeout 28800
ROCm MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-3.1-8b
. The latency and throughput reports of the
model are collected in the following path: ~/MAD/reports_float16/
.
Although the following models are preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. Refer to the Standalone benchmarking section.
Available models#
Model name |
Tag |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Standalone benchmarking#
You can run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6
docker run -it --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 16G --security-opt seccomp=unconfined --security-opt apparmor=unconfined --cap-add=SYS_PTRACE -v $(pwd):/workspace --env HUGGINGFACE_HUB_CACHE=/workspace --name vllm_v0.6.6 rocm/vllm:rocm6.3.1_mi300_ubuntu22.04_py3.12_vllm_0.6.6
In the Docker container, clone the ROCm MAD repository and navigate to the
benchmark scripts directory at ~/MAD/scripts/vllm
.
git clone https://github.com/ROCm/MAD
cd MAD/scripts/vllm
Command#
To start the benchmark, use the following command with the appropriate options. See Options for the list of options and their descriptions.
./vllm_benchmark_report.sh -s $test_option -m $model_repo -g $num_gpu -d $datatype
See the examples for more information.
Note
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token
Options and available models#
Name |
Options |
Description |
---|---|---|
|
latency |
Measure decoding token latency |
throughput |
Measure token generation throughput |
|
all |
Measure both throughput and latency |
|
|
|
|
( |
|
|
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
|
|
( |
|
|
|
||
|
||
|
||
|
||
|
||
|
1 or 8 |
Number of GPUs |
|
|
Data type |
Running the benchmark on the MI300X accelerator#
Here are some examples of running the benchmark with various options. See Options for the list of options and their descriptions.
Example 1: latency benchmark#
Use this command to benchmark the latency of the Llama 3.1 70B model on eight GPUs with the float16
and float8
data types.
./vllm_benchmark_report.sh -s latency -m meta-llama/Llama-3.1-70B-Instruct -g 8 -d float16
./vllm_benchmark_report.sh -s latency -m amd/Llama-3.1-70B-Instruct-FP8-KV -g 8 -d float8
Find the latency reports at:
./reports_float16/summary/Llama-3.1-70B-Instruct_latency_report.csv
./reports_float8/summary/Llama-3.1-70B-Instruct-FP8-KV_latency_report.csv
Example 2: throughput benchmark#
Use this command to benchmark the throughput of the Llama 3.1 70B model on eight GPUs with the float16
and float8
data types.
./vllm_benchmark_report.sh -s throughput -m meta-llama/Llama-3.1-70B-Instruct -g 8 -d float16
./vllm_benchmark_report.sh -s throughput -m amd/Llama-3.1-70B-Instruct-FP8-KV -g 8 -d float8
Find the throughput reports at:
./reports_float16/summary/Llama-3.1-70B-Instruct_throughput_report.csv
./reports_float8/summary/Llama-3.1-70B-Instruct-FP8-KV_throughput_report.csv
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Further reading#
For application performance optimization strategies for HPC and AI workloads, including inference with vLLM, see AMD Instinct MI300X workload optimization.
To learn more about the options for latency and throughput benchmark scripts, see ROCm/vllm.
To learn more about system settings and management practices to configure your system for MI300X accelerators, see AMD Instinct MI300X system optimization.
To learn how to run LLM models from Hugging Face or your own model, see Running models from Hugging Face.
To learn how to optimize inference on LLMs, see Inference optimization.
To learn how to fine-tune LLMs, see Fine-tuning LLMs.
Previous versions#
This table lists previous versions of the ROCm vLLM Docker image for inference performance validation. For detailed information about available models for benchmarking, see the version-specific documentation.
ROCm version |
vLLM version |
PyTorch version |
Resources |
---|---|---|---|
6.2.1 |
0.6.4 |
2.5.0 |
|
6.2.0 |
0.4.3 |
2.4.0 |