SGLang inference performance testing DeepSeek-R1-Distill-Qwen-32B

SGLang inference performance testing DeepSeek-R1-Distill-Qwen-32B#

2025-10-20

7 min read time

Applies to Linux and Windows

SGLang is a high-performance inference and serving engine for large language models (LLMs) and vision models. The ROCm-enabled SGLang Docker image bundles SGLang with PyTorch, optimized for AMD Instinct MI300X Series GPUs. It includes the following software components:

Software component	Version
ROCm	6.3.0
SGLang	0.4.5 (0.4.5-rocm)
PyTorch	2.6.0a0+git8d4926e

System validation#

Before running AI workloads, it’s important to validate that your AMD hardware is configured correctly and performing optimally.

If you have already validated your system settings, including aspects like NUMA auto-balancing, you can skip this step. Otherwise, complete the procedures in the System validation and optimization guide to properly configure your system settings before starting training.

To test for optimal performance, consult the recommended System health benchmarks. This suite of tests will help you verify and fine-tune your system’s configuration.

Pull the Docker image#

Download the SGLang Docker image. Use the following command to pull the Docker image from Docker Hub.

docker pull lmsysorg/sglang:v0.4.5-rocm630

Benchmarking#

Once the setup is complete, choose one of the following methods to benchmark inference performance with DeepSeek-R1-Distill-Qwen-32B.

MAD-integrated benchmarking

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```

Use this command to run the performance benchmark test on the DeepSeek-R1-Distill-Qwen-32B model using one GPU with the bfloat16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags pyt_sglang_deepseek-r1-distill-qwen-32b \
    --keep-model-dir \
    --live-output \
    --timeout 28800

MAD launches a Docker container with the name container_ci-pyt_sglang_deepseek-r1-distill-qwen-32b. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf_DeepSeek-R1-Distill-Qwen-32B.csv.

Although the DeepSeek-R1-Distill-Qwen-32B is preconfigured to collect latency and throughput performance data, you can also change the benchmarking parameters. See the standalone benchmarking tab for more information.

Standalone benchmarking

Download the Docker image and required scripts

Run the SGLang benchmark script independently by starting the Docker container as shown in the following snippet.

docker pull lmsysorg/sglang:v0.4.5-rocm630
docker run -it \
    --device=/dev/kfd \
    --device=/dev/dri \
    --group-add video \
    --shm-size 16G \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --cap-add=SYS_PTRACE \
    -v $(pwd):/workspace \
    --env HUGGINGFACE_HUB_CACHE=/workspace \
    --name test \
    lmsysorg/sglang:v0.4.5-rocm630

In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at ~/MAD/scripts/sglang.
```
git clone https://github.com/ROCm/MAD
cd MAD/scripts/sglang
```

To start the benchmark, use the following command with the appropriate options.

Benchmark options

Name	Options	Description
`$test_option`	latency	Measure decoding token latency
	throughput	Measure token generation throughput
	all	Measure both throughput and latency
`$num_gpu`	8	Number of GPUs
`$datatype`	`bfloat16`	Data type
`$dataset`	random	Dataset

The input sequence length, output sequence length, and tensor parallel (TP) are already configured. You don’t need to specify them with this script.

Command:

./sglang_benchmark_report.sh -s $test_option -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B -g $num_gpu -d $datatype [-a $dataset]

Note

If you encounter the following error, pass your access-authorized Hugging Face token to the gated models.

OSError: You are trying to access a gated repo.
# pass your HF_TOKEN
export HF_TOKEN=$your_personal_hf_token

Benchmarking examples

Here are some examples of running the benchmark with various options:

Latency benchmark

Use this command to benchmark the latency of the DeepSeek-R1-Distill-Qwen-32B model on eight GPUs with bfloat16 precision.
```
./sglang_benchmark_report.sh \
    -s latency \
    -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    -g 8 \
    -d bfloat16
```
Find the latency report at ./reports_bfloat16/summary/DeepSeek-R1-Distill-Qwen-32B_latency_report.csv.
Throughput benchmark

Use this command to benchmark the throughput of the DeepSeek-R1-Distill-Qwen-32B model on eight GPUs with bfloat16 precision.
```
./sglang_benchmark_report.sh \
    -s throughput \
    -m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    -g 8 \
    -d bfloat16 \
    -a random
```
Find the throughput report at ./reports_bfloat16/summary/DeepSeek-R1-Distill-Qwen-32B_throughput_report.csv.

Note

Throughput is calculated as:

\[throughput\_tot = requests \times (\mathsf{\text{input lengths}} + \mathsf{\text{output lengths}}) / elapsed\_time\]
\[throughput\_gen = requests \times \mathsf{\text{output lengths}} / elapsed\_time\]

Previous versions#

See SGLang inference performance testing version history to find documentation for previous releases of SGLang inference performance testing.

SGLang inference performance testing DeepSeek-R1-Distill-Qwen-32B

Contents

SGLang inference performance testing DeepSeek-R1-Distill-Qwen-32B#

System validation#

Pull the Docker image#

Benchmarking#

Further reading#

Previous versions#