SGLang distributed inference with Mooncake#
2025-09-16
9 min read time
As LLM inference increasingly demands handling massive models and dynamic workloads, efficient distributed inference becomes essential. Traditional co-located architectures face bottlenecks due to tightly coupled memory and compute resources, which limits scalability and flexibility. Disaggregated inference refers to the process of splitting the inference of LLMs into distinct phases. This architecture, facilitated by libraries like Mooncake, uses high-bandwidth RDMA to transfer the Key-Value (KV) cache between prefill and decode nodes. This allows for independent resource scaling and optimization, resulting in improved efficiency and throughput.
SGLang is a high-performance inference and serving engine for large language models (LLMs) and vision models. The ROCm-enabled SGLang base Docker image bundles SGLang with PyTorch, which is optimized for AMD Instinct MI300X series accelerators. It includes the following software components:
Software component |
Version |
---|---|
ROCm |
7.0.0 |
SGLang |
v0.5.2rc1 |
pytorch-triton-rocm |
3.4.0+rocm7.0.0.gitf9e5bf54 |
The following guides on setting up and running SGLang and Mooncake for disaggregated distributed inference on a Slurm cluster using AMD Instinct MI300X series accelerators backed by Mellanox CX-7 NICs.
Prerequisites#
Before starting, ensure you have:
A Slurm cluster with at least three nodes: one for the proxy, one for prefill (
xP
), and one for decode (yD
).Nodes -> xP + yD + 1
A Dockerized environment with SGLang, Mooncake, etcd, and NIC drivers built in. See Build the Docker image for instructions.
A shared filesystem for storing models, scripts, and logs (cluster-specific).
Supported models#
The following models are supported for SGLang disaggregated prefill/decode inference. Some instructions, commands, and recommendations in this documentation might vary by selected model.
Note
See the Llama 3.1 8B Instruct model card on Hugging Face to learn more about this model. Some models require access authorization prior to use through an external license agreement with a third party.
Note
See the Llama 3.1 405B FP8 KV model card on Hugging Face to learn more about this model. Some models require access authorization prior to use through an external license agreement with a third party.
Note
See the Llama 3.3 70B FP8 KV model card on Hugging Face to learn more about this model. Some models require access authorization prior to use through an external license agreement with a third party.
Note
See the Qwen3 32B model card on Hugging Face to learn more about this model. Some models require access authorization prior to use through an external license agreement with a third party.
Note
See the DeepSeek V3 model card on Hugging Face to learn more about this model. Some models require access authorization prior to use through an external license agreement with a third party.
Note
See the Mixtral 8x7B v0.1 model card on Hugging Face to learn more about this model. Some models require access authorization prior to use through an external license agreement with a third party.
Build the Docker image#
Get the Dockerfile located in ROCm/MAD. It uses lmsysorg/sglang:v0.5.2rc1-rocm700-mi30x as the base Docker image and installs the necessary components for Mooncake, etcd, and Mellanox network drivers.
git clone https://github.com/ROCm/MAD.git
cd MAD/docker
docker build \
-t sglang_disagg_pd_image \
-f sglang_disagg_inference.ubuntu.amd.Dockerfile .
Benchmarking#
The ROCm/MAD repository contains scripts to launch SGLang inference with prefill/decode disaggregation via Mooncake for supported models.
scripts/sglang_dissag/run_xPyD_models.slurm – the main Slurm batch script to launch Docker containers on all nodes using
sbatch
orsalloc
.scripts/sglang_dissag/sglang_disagg_server.sh – the entrypoint script that runs inside each container to start the correct service – proxy, prefill, or decode.
scripts/sglang_dissag/benchmark_xPyD.sh – the benchmark script to run the GSM8K accuracy benchmark and the SGLang benchmarking tool for performance measurement.
scripts/sglang_dissag/benchmark_parser.py – the log parser script to be run on the concurrency benchmark log file to generate tabulated data.
Launch the service#
The service is deployed using a Slurm batch script that orchestrates the containers across the allocated nodes.
# Clone the MAD repo if you haven't already and
# navigate to the scripts directory
git clone https://github.com/ROCm/MAD.git
cd MAD/scripts/sglang_disagg/
# Slurm sbatch run command
export DOCKER_IMAGE_NAME=sglang_disagg_pd_image
export xP=<num_prefill_nodes>
export yD=<num_decode_nodes>
export MODEL_NAME=Llama-3.1-8B-Instruct
# num_nodes = xP + yD + 1
sbatch -N <num_nodes> -n <num_nodes> --nodelist=<Nodes> run_xPyD_models.slurm
# Clone the MAD repo if you haven't already and
# navigate to the scripts directory
git clone https://github.com/ROCm/MAD.git
cd MAD/scripts/sglang_disagg/
# Slurm sbatch run command
export DOCKER_IMAGE_NAME=sglang_disagg_pd_image
export xP=<num_prefill_nodes>
export yD=<num_decode_nodes>
export MODEL_NAME=Llama-3.1-405B-Instruct-FP8-KV
# num_nodes = xP + yD + 1
sbatch -N <num_nodes> -n <num_nodes> --nodelist=<Nodes> run_xPyD_models.slurm
# Clone the MAD repo if you haven't already and
# navigate to the scripts directory
git clone https://github.com/ROCm/MAD.git
cd MAD/scripts/sglang_disagg/
# Slurm sbatch run command
export DOCKER_IMAGE_NAME=sglang_disagg_pd_image
export xP=<num_prefill_nodes>
export yD=<num_decode_nodes>
export MODEL_NAME=amd-Llama-3.3-70B-Instruct-FP8-KV
# num_nodes = xP + yD + 1
sbatch -N <num_nodes> -n <num_nodes> --nodelist=<Nodes> run_xPyD_models.slurm
# Clone the MAD repo if you haven't already and
# navigate to the scripts directory
git clone https://github.com/ROCm/MAD.git
cd MAD/scripts/sglang_disagg/
# Slurm sbatch run command
export DOCKER_IMAGE_NAME=sglang_disagg_pd_image
export xP=<num_prefill_nodes>
export yD=<num_decode_nodes>
export MODEL_NAME=Qwen3-32B
# num_nodes = xP + yD + 1
sbatch -N <num_nodes> -n <num_nodes> --nodelist=<Nodes> run_xPyD_models.slurm
# Clone the MAD repo if you haven't already and
# navigate to the scripts directory
git clone https://github.com/ROCm/MAD.git
cd MAD/scripts/sglang_disagg/
# Slurm sbatch run command
export DOCKER_IMAGE_NAME=sglang_disagg_pd_image
export xP=<num_prefill_nodes>
export yD=<num_decode_nodes>
export MODEL_NAME=DeepSeek-V3
# num_nodes = xP + yD + 1
sbatch -N <num_nodes> -n <num_nodes> --nodelist=<Nodes> run_xPyD_models.slurm
# Clone the MAD repo if you haven't already and
# navigate to the scripts directory
git clone https://github.com/ROCm/MAD.git
cd MAD/scripts/sglang_disagg/
# Slurm sbatch run command
export DOCKER_IMAGE_NAME=sglang_disagg_pd_image
export xP=<num_prefill_nodes>
export yD=<num_decode_nodes>
export MODEL_NAME=Mixtral-8x7B-v0.1
# num_nodes = xP + yD + 1
sbatch -N <num_nodes> -n <num_nodes> --nodelist=<Nodes> run_xPyD_models.slurm
Post-run logs and testing#
Logs are stored in your shared filesystem in the directory specified by the LOG_PATH
variable in the Slurm script.
A new directory named after the Slurm job ID is created for each run.
Inside that directory, you can access various logs:
pd_sglang_bench_serving.sh_NODE<...>.log
– the main log for each server node.etcd_NODE<...>.log
– logs for etcd services.prefill_NODE<...>.log
– logs for the prefill services.decode_NODE<...>.log
– logs for the decode services.
Use the benchmark parser script for concurrency logs to tabulate different data.
python3 benchmark_parser.py <log_path/benchmark_XXX_CONCURRENCY.log>
To verify the service is responsive, you can try sending a curl
request to test the launched
server from the Docker container on the proxy node. For example:
curl -X POST http://127.0.0.1:30000/generate \
-H "Content-Type: application/json" \
-d '{ "text": "Let me tell you a story ", "sampling_params": { "temperature": 0.3 } }'
Known issues#
When running larger models, such as DeepSeek-V3 and Llama-3.1-405B-Instruct-FP8-KV, at higher concurrency levels (512+), the following error might occur:
<TransferEncodingError: 400, message:
Not enough data to satisfy transfer length header.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
...
This leads to dropping requests and lower throughput.
Further reading#
To learn about Mooncake, see Welcome to Mooncake.
To learn more about the options for latency and throughput benchmark scripts, see sgl-project/sglang.
See the base upstream Docker image on Docker Hub.
To learn more about system settings and management practices to configure your system for MI300X series accelerators, see AMD Instinct MI300X system optimization.
For application performance optimization strategies for HPC and AI workloads, including inference with vLLM, see AMD Instinct MI300X workload optimization.
To learn how to run community models from Hugging Face on AMD GPUs, see Running models from Hugging Face.
To learn how to fine-tune LLMs and optimize inference, see Fine-tuning LLMs and inference optimization.
For a list of other ready-made Docker images for AI with ROCm, see AMD Infinity Hub.
Previous versions#
See SGLang inference performance testing version history to find documentation for previous releases of SGLang inference performance testing.