vLLM inference performance testing#
2025-08-14
45 min read time
The ROCm vLLM Docker image offers a prebuilt, optimized environment for validating large language model (LLM) inference performance on AMD Instinct™ MI300X series accelerators. This ROCm vLLM Docker image integrates vLLM and PyTorch tailored specifically for MI300X series accelerators and includes the following components:
Software component |
Version |
---|---|
6.4.1 |
|
0.10.0 (0.10.1.dev395+g340ea86df.rocm641) |
|
2.7.0+gitf717b2a (2.7.0+gitf717b2a) |
|
0.15 |
With this Docker image, you can quickly test the expected inference performance numbers for MI300X series accelerators.
What’s new#
The following is summary of notable changes since the previous ROCm/vLLM Docker release.
Upgraded to vLLM v0.10.
FP8 KV cache support via AITER.
Full graph capture support via AITER.
Supported models#
The following models are supported for inference performance benchmarking with vLLM and ROCm. Some instructions, commands, and recommendations in this documentation might vary by model – select one to get started.
Note
See the Llama 3.1 8B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Llama 3.1 70B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Llama 3.1 405B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Llama 2 70B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Llama 3.1 8B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Llama 3.1 70B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Llama 3.1 405B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Mixtral MoE 8x7B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Mixtral MoE 8x22B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Mixtral MoE 8x7B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Mixtral MoE 8x22B FP8 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the QwQ-32B model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
See the Phi-4 model card on Hugging Face to learn more about your selected model. Some models require access authorization prior to use via an external license agreement through a third party.
Note
vLLM is a toolkit and library for LLM inference and serving. AMD implements high-performance custom kernels and modules in vLLM to enhance performance. See vLLM inference and vLLM performance optimization for more information.
Performance measurements#
To evaluate performance, the Performance results with AMD ROCm software page provides reference throughput and serving measurements for inferencing popular AI models.
Important
The performance data presented in Performance results with AMD ROCm software only reflects the latest version of this inference benchmarking environment. The listed measurements should not be interpreted as the peak performance achievable by AMD Instinct MI325X and MI300X accelerators or ROCm software.
System validation#
Before running AI workloads, it’s important to validate that your AMD hardware is configured correctly and performing optimally.
If you have already validated your system settings, including aspects like NUMA auto-balancing, you can skip this step. Otherwise, complete the procedures in the System validation and optimization guide to properly configure your system settings before starting training.
To test for optimal performance, consult the recommended System health benchmarks. This suite of tests will help you verify and fine-tune your system’s configuration.
Pull the Docker image#
Download the ROCm vLLM Docker image. Use the following command to pull the Docker image from Docker Hub.
docker pull rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
Benchmarking#
Once the setup is complete, choose between two options to reproduce the benchmark results:
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD cd MAD pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 3.1 8B model using one GPU with the
float16
data type on the host machine.export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" madengine run \ --tags pyt_vllm_llama-3.1-8b \ --keep-model-dir \ --live-output \ --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-3.1-8b
. The throughput and serving reports of the
model are collected in the following paths: pyt_vllm_llama-3.1-8b_throughput.csv
and pyt_vllm_llama-3.1-8b_serving.csv
.
Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Download the Docker image and required scripts
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 docker run -it \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --shm-size 16G \ --security-opt seccomp=unconfined \ --security-opt apparmor=unconfined \ --cap-add=SYS_PTRACE \ -v $(pwd):/workspace \ --env HUGGINGFACE_HUB_CACHE=/workspace \ --name test \ rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at
~/MAD/scripts/vllm
.git clone https://github.com/ROCm/MAD cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./run.sh \ --config $CONFIG_CSV \ --model_repo meta-llama/Llama-3.1-8B-Instruct \ <overrides>
Benchmark options
Name
Options
Description
--config
configs/default.csv
Run configs from the CSV for the chosen model repo and benchmark.
configs/extended.csv
configs/performance.csv
--benchmark
throughput
Measure offline end-to-end throughput.
serving
Measure online serving performance.
all
Measure both throughput and serving.
<overrides>
See run.sh for more info.
Additional overrides to the config CSV.
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
For best performance, it’s recommended to run with
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
.If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo. # pass your HF_TOKEN export HF_TOKEN=$your_personal_hf_token
Benchmarking examples
Here are some examples of running the benchmark with various options:
Throughput benchmark
Use this command to benchmark the throughput of the Llama 3.1 8B model on eight GPUs with
float16
precision.export MAD_MODEL_NAME=pyt_vllm_llama-3.1-8b ./run.sh \ --config configs/default.csv \ --model_repo meta-llama/Llama-3.1-8B-Instruct \ --benchmark throughput
Find the throughput benchmark report at
./pyt_vllm_llama-3.1-8b_throughput.csv
.Serving benchmark
Use this command to benchmark the serving performance of the Llama 3.1 8B model on eight GPUs with
float16
precision.export MAD_MODEL_NAME=pyt_vllm_llama-3.1-8b ./run.sh \ --config configs/default.csv \ --model_repo meta-llama/Llama-3.1-8B-Instruct \ --benchmark serving
Find the serving benchmark report at
./pyt_vllm_llama-3.1-8b_serving.csv
.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD cd MAD pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 3.1 70B model using one GPU with the
float16
data type on the host machine.export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" madengine run \ --tags pyt_vllm_llama-3.1-70b \ --keep-model-dir \ --live-output \ --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-3.1-70b
. The throughput and serving reports of the
model are collected in the following paths: pyt_vllm_llama-3.1-70b_throughput.csv
and pyt_vllm_llama-3.1-70b_serving.csv
.
Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Download the Docker image and required scripts
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 docker run -it \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --shm-size 16G \ --security-opt seccomp=unconfined \ --security-opt apparmor=unconfined \ --cap-add=SYS_PTRACE \ -v $(pwd):/workspace \ --env HUGGINGFACE_HUB_CACHE=/workspace \ --name test \ rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at
~/MAD/scripts/vllm
.git clone https://github.com/ROCm/MAD cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./run.sh \ --config $CONFIG_CSV \ --model_repo meta-llama/Llama-3.1-70B-Instruct \ <overrides>
Benchmark options
Name
Options
Description
--config
configs/default.csv
Run configs from the CSV for the chosen model repo and benchmark.
configs/extended.csv
configs/performance.csv
--benchmark
throughput
Measure offline end-to-end throughput.
serving
Measure online serving performance.
all
Measure both throughput and serving.
<overrides>
See run.sh for more info.
Additional overrides to the config CSV.
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
For best performance, it’s recommended to run with
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
.If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo. # pass your HF_TOKEN export HF_TOKEN=$your_personal_hf_token
Benchmarking examples
Here are some examples of running the benchmark with various options:
Throughput benchmark
Use this command to benchmark the throughput of the Llama 3.1 70B model on eight GPUs with
float16
precision.export MAD_MODEL_NAME=pyt_vllm_llama-3.1-70b ./run.sh \ --config configs/default.csv \ --model_repo meta-llama/Llama-3.1-70B-Instruct \ --benchmark throughput
Find the throughput benchmark report at
./pyt_vllm_llama-3.1-70b_throughput.csv
.Serving benchmark
Use this command to benchmark the serving performance of the Llama 3.1 70B model on eight GPUs with
float16
precision.export MAD_MODEL_NAME=pyt_vllm_llama-3.1-70b ./run.sh \ --config configs/default.csv \ --model_repo meta-llama/Llama-3.1-70B-Instruct \ --benchmark serving
Find the serving benchmark report at
./pyt_vllm_llama-3.1-70b_serving.csv
.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD cd MAD pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 3.1 405B model using one GPU with the
float16
data type on the host machine.export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" madengine run \ --tags pyt_vllm_llama-3.1-405b \ --keep-model-dir \ --live-output \ --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-3.1-405b
. The throughput and serving reports of the
model are collected in the following paths: pyt_vllm_llama-3.1-405b_throughput.csv
and pyt_vllm_llama-3.1-405b_serving.csv
.
Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Download the Docker image and required scripts
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 docker run -it \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --shm-size 16G \ --security-opt seccomp=unconfined \ --security-opt apparmor=unconfined \ --cap-add=SYS_PTRACE \ -v $(pwd):/workspace \ --env HUGGINGFACE_HUB_CACHE=/workspace \ --name test \ rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at
~/MAD/scripts/vllm
.git clone https://github.com/ROCm/MAD cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./run.sh \ --config $CONFIG_CSV \ --model_repo meta-llama/Llama-3.1-405B-Instruct \ <overrides>
Benchmark options
Name
Options
Description
--config
configs/default.csv
Run configs from the CSV for the chosen model repo and benchmark.
configs/extended.csv
configs/performance.csv
--benchmark
throughput
Measure offline end-to-end throughput.
serving
Measure online serving performance.
all
Measure both throughput and serving.
<overrides>
See run.sh for more info.
Additional overrides to the config CSV.
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
For best performance, it’s recommended to run with
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
.If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo. # pass your HF_TOKEN export HF_TOKEN=$your_personal_hf_token
Benchmarking examples
Here are some examples of running the benchmark with various options:
Throughput benchmark
Use this command to benchmark the throughput of the Llama 3.1 405B model on eight GPUs with
float16
precision.export MAD_MODEL_NAME=pyt_vllm_llama-3.1-405b ./run.sh \ --config configs/default.csv \ --model_repo meta-llama/Llama-3.1-405B-Instruct \ --benchmark throughput
Find the throughput benchmark report at
./pyt_vllm_llama-3.1-405b_throughput.csv
.Serving benchmark
Use this command to benchmark the serving performance of the Llama 3.1 405B model on eight GPUs with
float16
precision.export MAD_MODEL_NAME=pyt_vllm_llama-3.1-405b ./run.sh \ --config configs/default.csv \ --model_repo meta-llama/Llama-3.1-405B-Instruct \ --benchmark serving
Find the serving benchmark report at
./pyt_vllm_llama-3.1-405b_serving.csv
.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD cd MAD pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 2 70B model using one GPU with the
float16
data type on the host machine.export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" madengine run \ --tags pyt_vllm_llama-2-70b \ --keep-model-dir \ --live-output \ --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-2-70b
. The throughput and serving reports of the
model are collected in the following paths: pyt_vllm_llama-2-70b_throughput.csv
and pyt_vllm_llama-2-70b_serving.csv
.
Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Download the Docker image and required scripts
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 docker run -it \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --shm-size 16G \ --security-opt seccomp=unconfined \ --security-opt apparmor=unconfined \ --cap-add=SYS_PTRACE \ -v $(pwd):/workspace \ --env HUGGINGFACE_HUB_CACHE=/workspace \ --name test \ rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at
~/MAD/scripts/vllm
.git clone https://github.com/ROCm/MAD cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./run.sh \ --config $CONFIG_CSV \ --model_repo meta-llama/Llama-2-70b-chat-hf \ <overrides>
Benchmark options
Name
Options
Description
--config
configs/default.csv
Run configs from the CSV for the chosen model repo and benchmark.
configs/extended.csv
configs/performance.csv
--benchmark
throughput
Measure offline end-to-end throughput.
serving
Measure online serving performance.
all
Measure both throughput and serving.
<overrides>
See run.sh for more info.
Additional overrides to the config CSV.
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
For best performance, it’s recommended to run with
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
.If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo. # pass your HF_TOKEN export HF_TOKEN=$your_personal_hf_token
Benchmarking examples
Here are some examples of running the benchmark with various options:
Throughput benchmark
Use this command to benchmark the throughput of the Llama 2 70B model on eight GPUs with
float16
precision.export MAD_MODEL_NAME=pyt_vllm_llama-2-70b ./run.sh \ --config configs/default.csv \ --model_repo meta-llama/Llama-2-70b-chat-hf \ --benchmark throughput
Find the throughput benchmark report at
./pyt_vllm_llama-2-70b_throughput.csv
.Serving benchmark
Use this command to benchmark the serving performance of the Llama 2 70B model on eight GPUs with
float16
precision.export MAD_MODEL_NAME=pyt_vllm_llama-2-70b ./run.sh \ --config configs/default.csv \ --model_repo meta-llama/Llama-2-70b-chat-hf \ --benchmark serving
Find the serving benchmark report at
./pyt_vllm_llama-2-70b_serving.csv
.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD cd MAD pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 3.1 8B FP8 model using one GPU with the
float8
data type on the host machine.export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" madengine run \ --tags pyt_vllm_llama-3.1-8b_fp8 \ --keep-model-dir \ --live-output \ --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-3.1-8b_fp8
. The throughput and serving reports of the
model are collected in the following paths: pyt_vllm_llama-3.1-8b_fp8_throughput.csv
and pyt_vllm_llama-3.1-8b_fp8_serving.csv
.
Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Download the Docker image and required scripts
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 docker run -it \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --shm-size 16G \ --security-opt seccomp=unconfined \ --security-opt apparmor=unconfined \ --cap-add=SYS_PTRACE \ -v $(pwd):/workspace \ --env HUGGINGFACE_HUB_CACHE=/workspace \ --name test \ rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at
~/MAD/scripts/vllm
.git clone https://github.com/ROCm/MAD cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./run.sh \ --config $CONFIG_CSV \ --model_repo amd/Llama-3.1-8B-Instruct-FP8-KV \ <overrides>
Benchmark options
Name
Options
Description
--config
configs/default.csv
Run configs from the CSV for the chosen model repo and benchmark.
configs/extended.csv
configs/performance.csv
--benchmark
throughput
Measure offline end-to-end throughput.
serving
Measure online serving performance.
all
Measure both throughput and serving.
<overrides>
See run.sh for more info.
Additional overrides to the config CSV.
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
For best performance, it’s recommended to run with
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
.If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo. # pass your HF_TOKEN export HF_TOKEN=$your_personal_hf_token
Benchmarking examples
Here are some examples of running the benchmark with various options:
Throughput benchmark
Use this command to benchmark the throughput of the Llama 3.1 8B FP8 model on eight GPUs with
float8
precision.export MAD_MODEL_NAME=pyt_vllm_llama-3.1-8b_fp8 ./run.sh \ --config configs/default.csv \ --model_repo amd/Llama-3.1-8B-Instruct-FP8-KV \ --benchmark throughput
Find the throughput benchmark report at
./pyt_vllm_llama-3.1-8b_fp8_throughput.csv
.Serving benchmark
Use this command to benchmark the serving performance of the Llama 3.1 8B FP8 model on eight GPUs with
float8
precision.export MAD_MODEL_NAME=pyt_vllm_llama-3.1-8b_fp8 ./run.sh \ --config configs/default.csv \ --model_repo amd/Llama-3.1-8B-Instruct-FP8-KV \ --benchmark serving
Find the serving benchmark report at
./pyt_vllm_llama-3.1-8b_fp8_serving.csv
.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD cd MAD pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 3.1 70B FP8 model using one GPU with the
float8
data type on the host machine.export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" madengine run \ --tags pyt_vllm_llama-3.1-70b_fp8 \ --keep-model-dir \ --live-output \ --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-3.1-70b_fp8
. The throughput and serving reports of the
model are collected in the following paths: pyt_vllm_llama-3.1-70b_fp8_throughput.csv
and pyt_vllm_llama-3.1-70b_fp8_serving.csv
.
Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Download the Docker image and required scripts
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 docker run -it \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --shm-size 16G \ --security-opt seccomp=unconfined \ --security-opt apparmor=unconfined \ --cap-add=SYS_PTRACE \ -v $(pwd):/workspace \ --env HUGGINGFACE_HUB_CACHE=/workspace \ --name test \ rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at
~/MAD/scripts/vllm
.git clone https://github.com/ROCm/MAD cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./run.sh \ --config $CONFIG_CSV \ --model_repo amd/Llama-3.1-70B-Instruct-FP8-KV \ <overrides>
Benchmark options
Name
Options
Description
--config
configs/default.csv
Run configs from the CSV for the chosen model repo and benchmark.
configs/extended.csv
configs/performance.csv
--benchmark
throughput
Measure offline end-to-end throughput.
serving
Measure online serving performance.
all
Measure both throughput and serving.
<overrides>
See run.sh for more info.
Additional overrides to the config CSV.
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
For best performance, it’s recommended to run with
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
.If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo. # pass your HF_TOKEN export HF_TOKEN=$your_personal_hf_token
Benchmarking examples
Here are some examples of running the benchmark with various options:
Throughput benchmark
Use this command to benchmark the throughput of the Llama 3.1 70B FP8 model on eight GPUs with
float8
precision.export MAD_MODEL_NAME=pyt_vllm_llama-3.1-70b_fp8 ./run.sh \ --config configs/default.csv \ --model_repo amd/Llama-3.1-70B-Instruct-FP8-KV \ --benchmark throughput
Find the throughput benchmark report at
./pyt_vllm_llama-3.1-70b_fp8_throughput.csv
.Serving benchmark
Use this command to benchmark the serving performance of the Llama 3.1 70B FP8 model on eight GPUs with
float8
precision.export MAD_MODEL_NAME=pyt_vllm_llama-3.1-70b_fp8 ./run.sh \ --config configs/default.csv \ --model_repo amd/Llama-3.1-70B-Instruct-FP8-KV \ --benchmark serving
Find the serving benchmark report at
./pyt_vllm_llama-3.1-70b_fp8_serving.csv
.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD cd MAD pip install -r requirements.txt
Use this command to run the performance benchmark test on the Llama 3.1 405B FP8 model using one GPU with the
float8
data type on the host machine.export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" madengine run \ --tags pyt_vllm_llama-3.1-405b_fp8 \ --keep-model-dir \ --live-output \ --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_llama-3.1-405b_fp8
. The throughput and serving reports of the
model are collected in the following paths: pyt_vllm_llama-3.1-405b_fp8_throughput.csv
and pyt_vllm_llama-3.1-405b_fp8_serving.csv
.
Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Download the Docker image and required scripts
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 docker run -it \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --shm-size 16G \ --security-opt seccomp=unconfined \ --security-opt apparmor=unconfined \ --cap-add=SYS_PTRACE \ -v $(pwd):/workspace \ --env HUGGINGFACE_HUB_CACHE=/workspace \ --name test \ rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at
~/MAD/scripts/vllm
.git clone https://github.com/ROCm/MAD cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./run.sh \ --config $CONFIG_CSV \ --model_repo amd/Llama-3.1-405B-Instruct-FP8-KV \ <overrides>
Benchmark options
Name
Options
Description
--config
configs/default.csv
Run configs from the CSV for the chosen model repo and benchmark.
configs/extended.csv
configs/performance.csv
--benchmark
throughput
Measure offline end-to-end throughput.
serving
Measure online serving performance.
all
Measure both throughput and serving.
<overrides>
See run.sh for more info.
Additional overrides to the config CSV.
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
For best performance, it’s recommended to run with
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
.If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo. # pass your HF_TOKEN export HF_TOKEN=$your_personal_hf_token
Benchmarking examples
Here are some examples of running the benchmark with various options:
Throughput benchmark
Use this command to benchmark the throughput of the Llama 3.1 405B FP8 model on eight GPUs with
float8
precision.export MAD_MODEL_NAME=pyt_vllm_llama-3.1-405b_fp8 ./run.sh \ --config configs/default.csv \ --model_repo amd/Llama-3.1-405B-Instruct-FP8-KV \ --benchmark throughput
Find the throughput benchmark report at
./pyt_vllm_llama-3.1-405b_fp8_throughput.csv
.Serving benchmark
Use this command to benchmark the serving performance of the Llama 3.1 405B FP8 model on eight GPUs with
float8
precision.export MAD_MODEL_NAME=pyt_vllm_llama-3.1-405b_fp8 ./run.sh \ --config configs/default.csv \ --model_repo amd/Llama-3.1-405B-Instruct-FP8-KV \ --benchmark serving
Find the serving benchmark report at
./pyt_vllm_llama-3.1-405b_fp8_serving.csv
.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD cd MAD pip install -r requirements.txt
Use this command to run the performance benchmark test on the Mixtral MoE 8x7B model using one GPU with the
float16
data type on the host machine.export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" madengine run \ --tags pyt_vllm_mixtral-8x7b \ --keep-model-dir \ --live-output \ --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_mixtral-8x7b
. The throughput and serving reports of the
model are collected in the following paths: pyt_vllm_mixtral-8x7b_throughput.csv
and pyt_vllm_mixtral-8x7b_serving.csv
.
Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Download the Docker image and required scripts
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 docker run -it \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --shm-size 16G \ --security-opt seccomp=unconfined \ --security-opt apparmor=unconfined \ --cap-add=SYS_PTRACE \ -v $(pwd):/workspace \ --env HUGGINGFACE_HUB_CACHE=/workspace \ --name test \ rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at
~/MAD/scripts/vllm
.git clone https://github.com/ROCm/MAD cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./run.sh \ --config $CONFIG_CSV \ --model_repo mistralai/Mixtral-8x7B-Instruct-v0.1 \ <overrides>
Benchmark options
Name
Options
Description
--config
configs/default.csv
Run configs from the CSV for the chosen model repo and benchmark.
configs/extended.csv
configs/performance.csv
--benchmark
throughput
Measure offline end-to-end throughput.
serving
Measure online serving performance.
all
Measure both throughput and serving.
<overrides>
See run.sh for more info.
Additional overrides to the config CSV.
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
For best performance, it’s recommended to run with
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
.If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo. # pass your HF_TOKEN export HF_TOKEN=$your_personal_hf_token
Benchmarking examples
Here are some examples of running the benchmark with various options:
Throughput benchmark
Use this command to benchmark the throughput of the Mixtral MoE 8x7B model on eight GPUs with
float16
precision.export MAD_MODEL_NAME=pyt_vllm_mixtral-8x7b ./run.sh \ --config configs/default.csv \ --model_repo mistralai/Mixtral-8x7B-Instruct-v0.1 \ --benchmark throughput
Find the throughput benchmark report at
./pyt_vllm_mixtral-8x7b_throughput.csv
.Serving benchmark
Use this command to benchmark the serving performance of the Mixtral MoE 8x7B model on eight GPUs with
float16
precision.export MAD_MODEL_NAME=pyt_vllm_mixtral-8x7b ./run.sh \ --config configs/default.csv \ --model_repo mistralai/Mixtral-8x7B-Instruct-v0.1 \ --benchmark serving
Find the serving benchmark report at
./pyt_vllm_mixtral-8x7b_serving.csv
.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD cd MAD pip install -r requirements.txt
Use this command to run the performance benchmark test on the Mixtral MoE 8x22B model using one GPU with the
float16
data type on the host machine.export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" madengine run \ --tags pyt_vllm_mixtral-8x22b \ --keep-model-dir \ --live-output \ --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_mixtral-8x22b
. The throughput and serving reports of the
model are collected in the following paths: pyt_vllm_mixtral-8x22b_throughput.csv
and pyt_vllm_mixtral-8x22b_serving.csv
.
Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Download the Docker image and required scripts
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 docker run -it \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --shm-size 16G \ --security-opt seccomp=unconfined \ --security-opt apparmor=unconfined \ --cap-add=SYS_PTRACE \ -v $(pwd):/workspace \ --env HUGGINGFACE_HUB_CACHE=/workspace \ --name test \ rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at
~/MAD/scripts/vllm
.git clone https://github.com/ROCm/MAD cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./run.sh \ --config $CONFIG_CSV \ --model_repo mistralai/Mixtral-8x22B-Instruct-v0.1 \ <overrides>
Benchmark options
Name
Options
Description
--config
configs/default.csv
Run configs from the CSV for the chosen model repo and benchmark.
configs/extended.csv
configs/performance.csv
--benchmark
throughput
Measure offline end-to-end throughput.
serving
Measure online serving performance.
all
Measure both throughput and serving.
<overrides>
See run.sh for more info.
Additional overrides to the config CSV.
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
For best performance, it’s recommended to run with
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
.If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo. # pass your HF_TOKEN export HF_TOKEN=$your_personal_hf_token
Benchmarking examples
Here are some examples of running the benchmark with various options:
Throughput benchmark
Use this command to benchmark the throughput of the Mixtral MoE 8x22B model on eight GPUs with
float16
precision.export MAD_MODEL_NAME=pyt_vllm_mixtral-8x22b ./run.sh \ --config configs/default.csv \ --model_repo mistralai/Mixtral-8x22B-Instruct-v0.1 \ --benchmark throughput
Find the throughput benchmark report at
./pyt_vllm_mixtral-8x22b_throughput.csv
.Serving benchmark
Use this command to benchmark the serving performance of the Mixtral MoE 8x22B model on eight GPUs with
float16
precision.export MAD_MODEL_NAME=pyt_vllm_mixtral-8x22b ./run.sh \ --config configs/default.csv \ --model_repo mistralai/Mixtral-8x22B-Instruct-v0.1 \ --benchmark serving
Find the serving benchmark report at
./pyt_vllm_mixtral-8x22b_serving.csv
.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD cd MAD pip install -r requirements.txt
Use this command to run the performance benchmark test on the Mixtral MoE 8x7B FP8 model using one GPU with the
float8
data type on the host machine.export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" madengine run \ --tags pyt_vllm_mixtral-8x7b_fp8 \ --keep-model-dir \ --live-output \ --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_mixtral-8x7b_fp8
. The throughput and serving reports of the
model are collected in the following paths: pyt_vllm_mixtral-8x7b_fp8_throughput.csv
and pyt_vllm_mixtral-8x7b_fp8_serving.csv
.
Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Download the Docker image and required scripts
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 docker run -it \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --shm-size 16G \ --security-opt seccomp=unconfined \ --security-opt apparmor=unconfined \ --cap-add=SYS_PTRACE \ -v $(pwd):/workspace \ --env HUGGINGFACE_HUB_CACHE=/workspace \ --name test \ rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at
~/MAD/scripts/vllm
.git clone https://github.com/ROCm/MAD cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./run.sh \ --config $CONFIG_CSV \ --model_repo amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV \ <overrides>
Benchmark options
Name
Options
Description
--config
configs/default.csv
Run configs from the CSV for the chosen model repo and benchmark.
configs/extended.csv
configs/performance.csv
--benchmark
throughput
Measure offline end-to-end throughput.
serving
Measure online serving performance.
all
Measure both throughput and serving.
<overrides>
See run.sh for more info.
Additional overrides to the config CSV.
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
For best performance, it’s recommended to run with
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
.If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo. # pass your HF_TOKEN export HF_TOKEN=$your_personal_hf_token
Benchmarking examples
Here are some examples of running the benchmark with various options:
Throughput benchmark
Use this command to benchmark the throughput of the Mixtral MoE 8x7B FP8 model on eight GPUs with
float8
precision.export MAD_MODEL_NAME=pyt_vllm_mixtral-8x7b_fp8 ./run.sh \ --config configs/default.csv \ --model_repo amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV \ --benchmark throughput
Find the throughput benchmark report at
./pyt_vllm_mixtral-8x7b_fp8_throughput.csv
.Serving benchmark
Use this command to benchmark the serving performance of the Mixtral MoE 8x7B FP8 model on eight GPUs with
float8
precision.export MAD_MODEL_NAME=pyt_vllm_mixtral-8x7b_fp8 ./run.sh \ --config configs/default.csv \ --model_repo amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV \ --benchmark serving
Find the serving benchmark report at
./pyt_vllm_mixtral-8x7b_fp8_serving.csv
.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD cd MAD pip install -r requirements.txt
Use this command to run the performance benchmark test on the Mixtral MoE 8x22B FP8 model using one GPU with the
float8
data type on the host machine.export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" madengine run \ --tags pyt_vllm_mixtral-8x22b_fp8 \ --keep-model-dir \ --live-output \ --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_mixtral-8x22b_fp8
. The throughput and serving reports of the
model are collected in the following paths: pyt_vllm_mixtral-8x22b_fp8_throughput.csv
and pyt_vllm_mixtral-8x22b_fp8_serving.csv
.
Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Download the Docker image and required scripts
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 docker run -it \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --shm-size 16G \ --security-opt seccomp=unconfined \ --security-opt apparmor=unconfined \ --cap-add=SYS_PTRACE \ -v $(pwd):/workspace \ --env HUGGINGFACE_HUB_CACHE=/workspace \ --name test \ rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at
~/MAD/scripts/vllm
.git clone https://github.com/ROCm/MAD cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./run.sh \ --config $CONFIG_CSV \ --model_repo amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV \ <overrides>
Benchmark options
Name
Options
Description
--config
configs/default.csv
Run configs from the CSV for the chosen model repo and benchmark.
configs/extended.csv
configs/performance.csv
--benchmark
throughput
Measure offline end-to-end throughput.
serving
Measure online serving performance.
all
Measure both throughput and serving.
<overrides>
See run.sh for more info.
Additional overrides to the config CSV.
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
For best performance, it’s recommended to run with
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
.If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo. # pass your HF_TOKEN export HF_TOKEN=$your_personal_hf_token
Benchmarking examples
Here are some examples of running the benchmark with various options:
Throughput benchmark
Use this command to benchmark the throughput of the Mixtral MoE 8x22B FP8 model on eight GPUs with
float8
precision.export MAD_MODEL_NAME=pyt_vllm_mixtral-8x22b_fp8 ./run.sh \ --config configs/default.csv \ --model_repo amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV \ --benchmark throughput
Find the throughput benchmark report at
./pyt_vllm_mixtral-8x22b_fp8_throughput.csv
.Serving benchmark
Use this command to benchmark the serving performance of the Mixtral MoE 8x22B FP8 model on eight GPUs with
float8
precision.export MAD_MODEL_NAME=pyt_vllm_mixtral-8x22b_fp8 ./run.sh \ --config configs/default.csv \ --model_repo amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV \ --benchmark serving
Find the serving benchmark report at
./pyt_vllm_mixtral-8x22b_fp8_serving.csv
.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD cd MAD pip install -r requirements.txt
Use this command to run the performance benchmark test on the QwQ-32B model using one GPU with the
float16
data type on the host machine.export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" madengine run \ --tags pyt_vllm_qwq-32b \ --keep-model-dir \ --live-output \ --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_qwq-32b
. The throughput and serving reports of the
model are collected in the following paths: pyt_vllm_qwq-32b_throughput.csv
and pyt_vllm_qwq-32b_serving.csv
.
Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Note
For improved performance, consider enabling PyTorch TunableOp. TunableOp automatically explores different implementations and configurations of certain PyTorch operators to find the fastest one for your hardware.
By default, pyt_vllm_qwq-32b
runs with TunableOp disabled (see
ROCm/MAD). To enable it, include
the --tunableop on
argument in your run.
Enabling TunableOp triggers a two-pass run – a warm-up followed by the performance-collection run.
Download the Docker image and required scripts
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 docker run -it \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --shm-size 16G \ --security-opt seccomp=unconfined \ --security-opt apparmor=unconfined \ --cap-add=SYS_PTRACE \ -v $(pwd):/workspace \ --env HUGGINGFACE_HUB_CACHE=/workspace \ --name test \ rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at
~/MAD/scripts/vllm
.git clone https://github.com/ROCm/MAD cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./run.sh \ --config $CONFIG_CSV \ --model_repo Qwen/QwQ-32B \ <overrides>
Benchmark options
Name
Options
Description
--config
configs/default.csv
Run configs from the CSV for the chosen model repo and benchmark.
configs/extended.csv
configs/performance.csv
--benchmark
throughput
Measure offline end-to-end throughput.
serving
Measure online serving performance.
all
Measure both throughput and serving.
<overrides>
See run.sh for more info.
Additional overrides to the config CSV.
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
For best performance, it’s recommended to run with
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
.If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo. # pass your HF_TOKEN export HF_TOKEN=$your_personal_hf_token
Benchmarking examples
Here are some examples of running the benchmark with various options:
Throughput benchmark
Use this command to benchmark the throughput of the QwQ-32B model on eight GPUs with
float16
precision.export MAD_MODEL_NAME=pyt_vllm_qwq-32b ./run.sh \ --config configs/default.csv \ --model_repo Qwen/QwQ-32B \ --benchmark throughput
Find the throughput benchmark report at
./pyt_vllm_qwq-32b_throughput.csv
.Serving benchmark
Use this command to benchmark the serving performance of the QwQ-32B model on eight GPUs with
float16
precision.export MAD_MODEL_NAME=pyt_vllm_qwq-32b ./run.sh \ --config configs/default.csv \ --model_repo Qwen/QwQ-32B \ --benchmark serving
Find the serving benchmark report at
./pyt_vllm_qwq-32b_serving.csv
.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
git clone https://github.com/ROCm/MAD cd MAD pip install -r requirements.txt
Use this command to run the performance benchmark test on the Phi-4 model using one GPU with the :literal:`` data type on the host machine.
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models" madengine run \ --tags pyt_vllm_phi-4 \ --keep-model-dir \ --live-output \ --timeout 28800
MAD launches a Docker container with the name
container_ci-pyt_vllm_phi-4
. The throughput and serving reports of the
model are collected in the following paths: pyt_vllm_phi-4_throughput.csv
and pyt_vllm_phi-4_serving.csv
.
Although the available models are preconfigured to collect offline throughput and online serving performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.
Download the Docker image and required scripts
Run the vLLM benchmark tool independently by starting the Docker container as shown in the following snippet.
docker pull rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812 docker run -it \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --shm-size 16G \ --security-opt seccomp=unconfined \ --security-opt apparmor=unconfined \ --cap-add=SYS_PTRACE \ -v $(pwd):/workspace \ --env HUGGINGFACE_HUB_CACHE=/workspace \ --name test \ rocm/vllm:rocm6.4.1_vllm_0.10.0_20250812
In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at
~/MAD/scripts/vllm
.git clone https://github.com/ROCm/MAD cd MAD/scripts/vllm
To start the benchmark, use the following command with the appropriate options.
./run.sh \ --config $CONFIG_CSV \ --model_repo microsoft/phi-4 \ <overrides>
Benchmark options
Name
Options
Description
--config
configs/default.csv
Run configs from the CSV for the chosen model repo and benchmark.
configs/extended.csv
configs/performance.csv
--benchmark
throughput
Measure offline end-to-end throughput.
serving
Measure online serving performance.
all
Measure both throughput and serving.
<overrides>
See run.sh for more info.
Additional overrides to the config CSV.
The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.
Note
For best performance, it’s recommended to run with
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1
.If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.
OSError: You are trying to access a gated repo. # pass your HF_TOKEN export HF_TOKEN=$your_personal_hf_token
Benchmarking examples
Here are some examples of running the benchmark with various options:
Throughput benchmark
Use this command to benchmark the throughput of the Phi-4 model on eight GPUs with :literal:`` precision.
export MAD_MODEL_NAME=pyt_vllm_phi-4 ./run.sh \ --config configs/default.csv \ --model_repo microsoft/phi-4 \ --benchmark throughput
Find the throughput benchmark report at
./pyt_vllm_phi-4_throughput.csv
.Serving benchmark
Use this command to benchmark the serving performance of the Phi-4 model on eight GPUs with :literal:`` precision.
export MAD_MODEL_NAME=pyt_vllm_phi-4 ./run.sh \ --config configs/default.csv \ --model_repo microsoft/phi-4 \ --benchmark serving
Find the serving benchmark report at
./pyt_vllm_phi-4_serving.csv
.
Note
Throughput is calculated as:
- \[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
- \[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]
Advanced usage#
For information on experimental features and known issues related to ROCm optimization efforts on vLLM, see the developer’s guide at ROCm/vllm.
Reproducing the Docker image#
To reproduce this ROCm/vLLM Docker image release, follow these steps:
Clone the vLLM repository.
git clone https://github.com/ROCm/vllm.git
Checkout the specific release commit.
cd vllm git checkout 340ea86dfe5955d6f9a9e767d6abab5aacf2c978
Build the Docker image. Replace
vllm-rocm
with your desired image tag.docker build -f docker/Dockerfile.rocm -t vllm-rocm .
Further reading#
To learn more about the options for latency and throughput benchmark scripts, see ROCm/vllm.
To learn more about MAD and the
madengine
CLI, see the MAD usage guide.To learn more about system settings and management practices to configure your system for AMD Instinct MI300X series accelerators, see AMD Instinct MI300X system optimization.
For application performance optimization strategies for HPC and AI workloads, including inference with vLLM, see AMD Instinct MI300X workload optimization.
To learn how to run community models from Hugging Face on AMD GPUs, see Running models from Hugging Face.
To learn how to fine-tune LLMs and optimize inference, see Fine-tuning LLMs and inference optimization.
For a list of other ready-made Docker images for AI with ROCm, see AMD Infinity Hub.
Previous versions#
See vLLM inference performance testing version history to find documentation for previous releases
of the ROCm/vllm
Docker image.