SGLang distributed inference with Mooncake

SGLang distributed inference with Mooncake#

2025-10-16

9 min read time

Applies to Linux and Windows

As LLM inference increasingly demands handling massive models and dynamic workloads, efficient distributed inference becomes essential. Traditional co-located architectures face bottlenecks due to tightly coupled memory and compute resources, which limits scalability and flexibility. Disaggregated inference refers to the process of splitting the inference of LLMs into distinct phases. This architecture, facilitated by libraries like Mooncake, uses high-bandwidth RDMA to transfer the Key-Value (KV) cache between prefill and decode nodes. This allows for independent resource scaling and optimization, resulting in improved efficiency and throughput.

SGLang is a high-performance inference and serving engine for large language models (LLMs) and vision models. The ROCm-enabled SGLang base Docker image bundles SGLang with PyTorch, which is optimized for AMD Instinct MI300X Series GPUs. It includes the following software components:

Software component	Version
ROCm	7.0.0
SGLang	v0.5.2rc1
pytorch-triton-rocm	3.4.0+rocm7.0.0.gitf9e5bf54

The following guides on setting up and running SGLang and Mooncake for disaggregated distributed inference on a Slurm cluster using AMD Instinct MI300X Series GPUs backed by Mellanox CX-7 NICs.

Prerequisites#

Before starting, ensure you have:

A Slurm cluster with at least three nodes: one for the proxy, one for prefill (xP), and one for decode (yD).

Nodes -> xP + yD + 1
A Dockerized environment with SGLang, Mooncake, etcd, and NIC drivers built in. See Build the Docker image for instructions.
A shared filesystem for storing models, scripts, and logs (cluster-specific).

Supported models#

The following models are supported for SGLang disaggregated prefill/decode inference. Some instructions, commands, and recommendations in this documentation might vary by selected model.

Model type

Dense models

Small experts models

Model

Llama 3.1 8B Instruct

Llama 3.1 405B FP8 KV

Llama 3.3 70B FP8 KV

Qwen3 32B

DeepSeek V3

Mixtral 8x7B v0.1

Note

See the Llama 3.1 8B Instruct model card on Hugging Face to learn more about this model. Some models require access authorization prior to use through an external license agreement with a third party.

Note

See the Llama 3.1 405B FP8 KV model card on Hugging Face to learn more about this model. Some models require access authorization prior to use through an external license agreement with a third party.

Note

See the Llama 3.3 70B FP8 KV model card on Hugging Face to learn more about this model. Some models require access authorization prior to use through an external license agreement with a third party.

Note

See the Qwen3 32B model card on Hugging Face to learn more about this model. Some models require access authorization prior to use through an external license agreement with a third party.

Note

See the DeepSeek V3 model card on Hugging Face to learn more about this model. Some models require access authorization prior to use through an external license agreement with a third party.

Note

See the Mixtral 8x7B v0.1 model card on Hugging Face to learn more about this model. Some models require access authorization prior to use through an external license agreement with a third party.

Build the Docker image#

Get the Dockerfile located in ROCm/MAD. It uses lmsysorg/sglang:v0.5.2rc1-rocm700-mi30x as the base Docker image and installs the necessary components for Mooncake, etcd, and Mellanox network drivers.

git clone https://github.com/ROCm/MAD.git
cd MAD/docker
docker build \
    -t sglang_disagg_pd_image \
    -f sglang_disagg_inference.ubuntu.amd.Dockerfile .

Benchmarking#

The ROCm/MAD repository contains scripts to launch SGLang inference with prefill/decode disaggregation via Mooncake for supported models.

scripts/sglang_dissag/run_xPyD_models.slurm – the main Slurm batch script to launch Docker containers on all nodes using sbatch or salloc.
scripts/sglang_dissag/sglang_disagg_server.sh – the entrypoint script that runs inside each container to start the correct service – proxy, prefill, or decode.
scripts/sglang_dissag/benchmark_xPyD.sh – the benchmark script to run the GSM8K accuracy benchmark and the SGLang benchmarking tool for performance measurement.
scripts/sglang_dissag/benchmark_parser.py – the log parser script to be run on the concurrency benchmark log file to generate tabulated data.

Launch the service#

The service is deployed using a Slurm batch script that orchestrates the containers across the allocated nodes.

# Clone the MAD repo if you haven't already and
# navigate to the scripts directory
git clone https://github.com/ROCm/MAD.git
cd MAD/scripts/sglang_disagg/

# Slurm sbatch run command
export DOCKER_IMAGE_NAME=sglang_disagg_pd_image
export xP=<num_prefill_nodes>
export yD=<num_decode_nodes>
export MODEL_NAME=Llama-3.1-8B-Instruct
# num_nodes = xP + yD + 1
sbatch -N <num_nodes> -n <num_nodes> --nodelist=<Nodes> run_xPyD_models.slurm

# Clone the MAD repo if you haven't already and
# navigate to the scripts directory
git clone https://github.com/ROCm/MAD.git
cd MAD/scripts/sglang_disagg/

# Slurm sbatch run command
export DOCKER_IMAGE_NAME=sglang_disagg_pd_image
export xP=<num_prefill_nodes>
export yD=<num_decode_nodes>
export MODEL_NAME=Llama-3.1-405B-Instruct-FP8-KV
# num_nodes = xP + yD + 1
sbatch -N <num_nodes> -n <num_nodes> --nodelist=<Nodes> run_xPyD_models.slurm

# Clone the MAD repo if you haven't already and
# navigate to the scripts directory
git clone https://github.com/ROCm/MAD.git
cd MAD/scripts/sglang_disagg/

# Slurm sbatch run command
export DOCKER_IMAGE_NAME=sglang_disagg_pd_image
export xP=<num_prefill_nodes>
export yD=<num_decode_nodes>
export MODEL_NAME=amd-Llama-3.3-70B-Instruct-FP8-KV
# num_nodes = xP + yD + 1
sbatch -N <num_nodes> -n <num_nodes> --nodelist=<Nodes> run_xPyD_models.slurm

# Clone the MAD repo if you haven't already and
# navigate to the scripts directory
git clone https://github.com/ROCm/MAD.git
cd MAD/scripts/sglang_disagg/

# Slurm sbatch run command
export DOCKER_IMAGE_NAME=sglang_disagg_pd_image
export xP=<num_prefill_nodes>
export yD=<num_decode_nodes>
export MODEL_NAME=Qwen3-32B
# num_nodes = xP + yD + 1
sbatch -N <num_nodes> -n <num_nodes> --nodelist=<Nodes> run_xPyD_models.slurm

# Clone the MAD repo if you haven't already and
# navigate to the scripts directory
git clone https://github.com/ROCm/MAD.git
cd MAD/scripts/sglang_disagg/

# Slurm sbatch run command
export DOCKER_IMAGE_NAME=sglang_disagg_pd_image
export xP=<num_prefill_nodes>
export yD=<num_decode_nodes>
export MODEL_NAME=DeepSeek-V3
# num_nodes = xP + yD + 1
sbatch -N <num_nodes> -n <num_nodes> --nodelist=<Nodes> run_xPyD_models.slurm

# Clone the MAD repo if you haven't already and
# navigate to the scripts directory
git clone https://github.com/ROCm/MAD.git
cd MAD/scripts/sglang_disagg/

# Slurm sbatch run command
export DOCKER_IMAGE_NAME=sglang_disagg_pd_image
export xP=<num_prefill_nodes>
export yD=<num_decode_nodes>
export MODEL_NAME=Mixtral-8x7B-v0.1
# num_nodes = xP + yD + 1
sbatch -N <num_nodes> -n <num_nodes> --nodelist=<Nodes> run_xPyD_models.slurm

Post-run logs and testing#

Logs are stored in your shared filesystem in the directory specified by the LOG_PATH variable in the Slurm script. A new directory named after the Slurm job ID is created for each run.

Inside that directory, you can access various logs:

pd_sglang_bench_serving.sh_NODE<...>.log – the main log for each server node.
etcd_NODE<...>.log – logs for etcd services.
prefill_NODE<...>.log – logs for the prefill services.
decode_NODE<...>.log – logs for the decode services.

Use the benchmark parser script for concurrency logs to tabulate different data.

python3 benchmark_parser.py <log_path/benchmark_XXX_CONCURRENCY.log>

To verify the service is responsive, you can try sending a curl request to test the launched server from the Docker container on the proxy node. For example:

curl -X POST http://127.0.0.1:30000/generate \
    -H "Content-Type: application/json" \
    -d '{ "text": "Let me tell you a story ", "sampling_params": { "temperature": 0.3 } }'

Known issues#

When running larger models, such as DeepSeek-V3 and Llama-3.1-405B-Instruct-FP8-KV, at higher concurrency levels (512+), the following error might occur:

<TransferEncodingError: 400, message:
 Not enough data to satisfy transfer length header.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
...

This leads to dropping requests and lower throughput.

Previous versions#

See SGLang inference performance testing version history to find documentation for previous releases of SGLang inference performance testing.

SGLang distributed inference with Mooncake

Contents

SGLang distributed inference with Mooncake#

Prerequisites#

Supported models#

Build the Docker image#

Benchmarking#

Launch the service#

Post-run logs and testing#

Known issues#

Further reading#

Previous versions#