Training a model with JAX MaxText on ROCm

Training a model with JAX MaxText on ROCm#

2025-12-18

27 min read time

Applies to Linux

The MaxText for ROCm training Docker image provides a prebuilt environment for training on AMD Instinct MI355X, MI350X, MI325X, and MI300X GPUs, including essential components like JAX, XLA, ROCm libraries, and MaxText utilities. It includes the following software components:

rocm/jax-training:maxtext-v25.11

Software component	Version
ROCm	7.1.0
JAX	0.7.1
Python	3.12
Transformer Engine	2.4.0.dev0+281042de
hipBLASLt	1.2.x

Note

The rocm/jax-training:maxtext-v25.9 has been updated to rocm/jax-training:maxtext-v25.9.1. This revision should include a fix to address segmentation fault issues during launch. See the versioned documentation.

MaxText with on ROCm provides the following key features to train large language models efficiently:

Transformer Engine (TE)
Flash Attention (FA) 3 – with or without sequence input packing
GEMM tuning
Multi-node support
NANOO FP8 (for MI300X series GPUs) and FP8 (for MI355X and MI350X) quantization support

Supported models#

The following models are pre-optimized for performance on AMD Instinct GPUs. Some instructions, commands, and available training configurations in this documentation might vary by model – select one to get started.

Model

Meta Llama

DeepSeek

Mistral AI

Variant

Llama 2 7B

Llama 2 70B

Llama 3 8B (multi-node)

Llama 3 70B (multi-node)

Llama 3.1 8B

Llama 3.1 70B

Llama 3.3 70B

DeepSeek-V2-Lite (16B)

Mixtral 8x7B

Note

Some models, such as Llama 3, require an external license agreement through a third party (for example, Meta).

System validation#

Before running AI workloads, it’s important to validate that your AMD hardware is configured correctly and performing optimally.

If you have already validated your system settings, including aspects like NUMA auto-balancing, you can skip this step. Otherwise, complete the procedures in the System validation and optimization guide to properly configure your system settings before starting training.

To test for optimal performance, consult the recommended System health benchmarks. This suite of tests will help you verify and fine-tune your system’s configuration.

Environment setup#

This Docker image is optimized for specific model configurations outlined as follows. Performance can vary for other training workloads, as AMD doesn’t validate configurations and run conditions outside those described.

Pull the Docker image#

Use the following command to pull the Docker image from Docker Hub.

docker pull rocm/jax-training:maxtext-v25.11

Multi-node configuration#

See Multi-node setup for AI workloads to configure your environment for multi-node training.

Benchmarking#

Once the setup is complete, choose between two options to reproduce the benchmark results:

MAD-integrated benchmarking

The following run command is tailored to Llama 2 7B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```

Use this command to run the performance benchmark test on the Llama 2 7B model using one GPU with the bf16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags jax_maxtext_train_llama-2-7b \
    --keep-model-dir \
    --live-output \
    --timeout 28800

MAD launches a Docker container with the name container_ci-jax_maxtext_train_llama-2-7b. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv/.

Standalone benchmarking

The following commands are optimized for Llama 2 7B. See Supported models to switch to another available model. Some instructions and resources might not be available for all models and configurations.

Download the Docker image and required scripts

Run the JAX MaxText benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/jax-training:maxtext-v25.11

Single node training

Set up environment variables.
```
export MAD_SECRETS_HFTOKEN=<Your Hugging Face token>
export HF_HOME=<Location of saved/cached Hugging Face models>
```
MAD_SECRETS_HFTOKEN is your Hugging Face access token to access models, tokenizers, and data. See User access tokens.

HF_HOME is where huggingface_hub will store local data. See huggingface_hub CLI. If you already have downloaded or cached Hugging Face artifacts, set this variable to that path. Downloaded files typically get cached to ~/.cache/huggingface.

Launch the Docker container.

docker run -it \
    --device=/dev/dri \
    --device=/dev/kfd \
    --network host \
    --ipc host \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --privileged \
    -v $HOME:$HOME \
    -v $HOME/.ssh:/root/.ssh \
    -v $HF_HOME:/hf_cache \
    -e HF_HOME=/hf_cache \
    -e MAD_SECRETS_HFTOKEN=$MAD_SECRETS_HFTOKEN
    --shm-size 64G \
    --name training_env \
    rocm/jax-training:maxtext-v25.11

In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at MAD/scripts/jax-maxtext.
```
git clone https://github.com/ROCm/MAD
cd MAD/scripts/jax-maxtext
```
Run the setup scripts to install libraries and datasets needed for benchmarking.
```
./jax-maxtext_benchmark_setup.sh -m Llama-2-7B
```
To run the training benchmark without quantization, use the following command:
```
./jax-maxtext_benchmark_report.sh -m Llama-2-7B
```
For quantized training, run the script with the appropriate option for your Instinct GPU.
MI355X and MI350X
For fp8 quantized training on MI355X and MI350X GPUs, use the following command:
./jax-maxtext_benchmark_report.sh -m Llama-2-7B -q fp8
MI325X and MI300X
For nanoo_fp8 quantized training on MI300X series GPUs, use the following command:
./jax-maxtext_benchmark_report.sh -m Llama-2-7B -q nanoo_fp8

Multi-node training

The following examples use SLURM to run on multiple nodes.

Note

The following scripts will launch the Docker container and run the benchmark. Run them outside of any Docker container.

Make sure $HF_HOME is set before running the test. See ROCm benchmarking for more details on downloading the Llama models before running the benchmark.
To run multi-node training for Llama 2 7B, use the multi-node training script under the scripts/jax-maxtext/gpu-rocm/ directory.

Run the multi-node training benchmark script.

sbatch -N <num_nodes> llama2_7b_multinode.sh

Profiling with rocprofv3

If you need to collect a trace and the JAX profiler isn’t working, use rocprofv3 provided by the ROCprofiler-SDK as a workaround. For example:

rocprofv3 \
    --hip-trace \
    --kernel-trace \
    --memory-copy-trace \
    --rccl-trace \
    --output-format pftrace \
    -d ./v3_traces \ # output directory
    -- ./jax-maxtext_benchmark_report.sh -m Llama-2-7B # or desired command

You can set the directory where you want the .json traces to be saved using -d <TRACE_DIRECTORY>. The resulting traces can be opened in Perfetto: https://ui.perfetto.dev/.

MAD-integrated benchmarking

The following run command is tailored to Llama 2 70B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```

Use this command to run the performance benchmark test on the Llama 2 70B model using one GPU with the bf16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags jax_maxtext_train_llama-2-70b \
    --keep-model-dir \
    --live-output \
    --timeout 28800

MAD launches a Docker container with the name container_ci-jax_maxtext_train_llama-2-70b. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv/.

Standalone benchmarking

The following commands are optimized for Llama 2 70B. See Supported models to switch to another available model. Some instructions and resources might not be available for all models and configurations.

Download the Docker image and required scripts

Run the JAX MaxText benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/jax-training:maxtext-v25.11

Single node training

Set up environment variables.
```
export MAD_SECRETS_HFTOKEN=<Your Hugging Face token>
export HF_HOME=<Location of saved/cached Hugging Face models>
```
MAD_SECRETS_HFTOKEN is your Hugging Face access token to access models, tokenizers, and data. See User access tokens.

HF_HOME is where huggingface_hub will store local data. See huggingface_hub CLI. If you already have downloaded or cached Hugging Face artifacts, set this variable to that path. Downloaded files typically get cached to ~/.cache/huggingface.

Launch the Docker container.

docker run -it \
    --device=/dev/dri \
    --device=/dev/kfd \
    --network host \
    --ipc host \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --privileged \
    -v $HOME:$HOME \
    -v $HOME/.ssh:/root/.ssh \
    -v $HF_HOME:/hf_cache \
    -e HF_HOME=/hf_cache \
    -e MAD_SECRETS_HFTOKEN=$MAD_SECRETS_HFTOKEN
    --shm-size 64G \
    --name training_env \
    rocm/jax-training:maxtext-v25.11

In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at MAD/scripts/jax-maxtext.
```
git clone https://github.com/ROCm/MAD
cd MAD/scripts/jax-maxtext
```
Run the setup scripts to install libraries and datasets needed for benchmarking.
```
./jax-maxtext_benchmark_setup.sh -m Llama-2-70B
```
To run the training benchmark without quantization, use the following command:
```
./jax-maxtext_benchmark_report.sh -m Llama-2-70B
```
For quantized training, run the script with the appropriate option for your Instinct GPU.
MI355X and MI350X
For fp8 quantized training on MI355X and MI350X GPUs, use the following command:
./jax-maxtext_benchmark_report.sh -m Llama-2-70B -q fp8
MI325X and MI300X
For nanoo_fp8 quantized training on MI300X series GPUs, use the following command:
./jax-maxtext_benchmark_report.sh -m Llama-2-70B -q nanoo_fp8

Multi-node training

The following examples use SLURM to run on multiple nodes.

Note

The following scripts will launch the Docker container and run the benchmark. Run them outside of any Docker container.

Make sure $HF_HOME is set before running the test. See ROCm benchmarking for more details on downloading the Llama models before running the benchmark.
To run multi-node training for Llama 2 70B, use the multi-node training script under the scripts/jax-maxtext/gpu-rocm/ directory.

Run the multi-node training benchmark script.

sbatch -N <num_nodes> llama2_70b_multinode.sh

Profiling with rocprofv3

If you need to collect a trace and the JAX profiler isn’t working, use rocprofv3 provided by the ROCprofiler-SDK as a workaround. For example:

rocprofv3 \
    --hip-trace \
    --kernel-trace \
    --memory-copy-trace \
    --rccl-trace \
    --output-format pftrace \
    -d ./v3_traces \ # output directory
    -- ./jax-maxtext_benchmark_report.sh -m Llama-2-70B # or desired command

You can set the directory where you want the .json traces to be saved using -d <TRACE_DIRECTORY>. The resulting traces can be opened in Perfetto: https://ui.perfetto.dev/.

Standalone benchmarking

The following commands are optimized for Llama 3 8B (multi-node). See Supported models to switch to another available model. Some instructions and resources might not be available for all models and configurations.

Download the Docker image and required scripts

Run the JAX MaxText benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/jax-training:maxtext-v25.11

Multi-node training

The following examples use SLURM to run on multiple nodes.

Note

The following scripts will launch the Docker container and run the benchmark. Run them outside of any Docker container.

Make sure $HF_HOME is set before running the test. See ROCm benchmarking for more details on downloading the Llama models before running the benchmark.
To run multi-node training for Llama 3 8B (multi-node), use the multi-node training script under the scripts/jax-maxtext/gpu-rocm/ directory.

Run the multi-node training benchmark script.

sbatch -N <num_nodes> llama3_8b_multinode.sh

Profiling with rocprofv3

If you need to collect a trace and the JAX profiler isn’t working, use rocprofv3 provided by the ROCprofiler-SDK as a workaround. For example:

rocprofv3 \
    --hip-trace \
    --kernel-trace \
    --memory-copy-trace \
    --rccl-trace \
    --output-format pftrace \
    -d ./v3_traces \ # output directory
    -- ./jax-maxtext_benchmark_report.sh -m  # or desired command

You can set the directory where you want the .json traces to be saved using -d <TRACE_DIRECTORY>. The resulting traces can be opened in Perfetto: https://ui.perfetto.dev/.

Standalone benchmarking

The following commands are optimized for Llama 3 70B (multi-node). See Supported models to switch to another available model. Some instructions and resources might not be available for all models and configurations.

Download the Docker image and required scripts

Run the JAX MaxText benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/jax-training:maxtext-v25.11

Multi-node training

The following examples use SLURM to run on multiple nodes.

Note

The following scripts will launch the Docker container and run the benchmark. Run them outside of any Docker container.

Make sure $HF_HOME is set before running the test. See ROCm benchmarking for more details on downloading the Llama models before running the benchmark.
To run multi-node training for Llama 3 70B (multi-node), use the multi-node training script under the scripts/jax-maxtext/gpu-rocm/ directory.

Run the multi-node training benchmark script.

sbatch -N <num_nodes> llama3_70b_multinode.sh

Profiling with rocprofv3

If you need to collect a trace and the JAX profiler isn’t working, use rocprofv3 provided by the ROCprofiler-SDK as a workaround. For example:

rocprofv3 \
    --hip-trace \
    --kernel-trace \
    --memory-copy-trace \
    --rccl-trace \
    --output-format pftrace \
    -d ./v3_traces \ # output directory
    -- ./jax-maxtext_benchmark_report.sh -m  # or desired command

You can set the directory where you want the .json traces to be saved using -d <TRACE_DIRECTORY>. The resulting traces can be opened in Perfetto: https://ui.perfetto.dev/.

MAD-integrated benchmarking

The following run command is tailored to Llama 3.1 8B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```

Use this command to run the performance benchmark test on the Llama 3.1 8B model using one GPU with the bf16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags jax_maxtext_train_llama-3.1-8b \
    --keep-model-dir \
    --live-output \
    --timeout 28800

MAD launches a Docker container with the name container_ci-jax_maxtext_train_llama-3.1-8b. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv/.

Standalone benchmarking

The following commands are optimized for Llama 3.1 8B. See Supported models to switch to another available model. Some instructions and resources might not be available for all models and configurations.

Download the Docker image and required scripts

Run the JAX MaxText benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/jax-training:maxtext-v25.11

Single node training

Set up environment variables.
```
export MAD_SECRETS_HFTOKEN=<Your Hugging Face token>
export HF_HOME=<Location of saved/cached Hugging Face models>
```
MAD_SECRETS_HFTOKEN is your Hugging Face access token to access models, tokenizers, and data. See User access tokens.

HF_HOME is where huggingface_hub will store local data. See huggingface_hub CLI. If you already have downloaded or cached Hugging Face artifacts, set this variable to that path. Downloaded files typically get cached to ~/.cache/huggingface.

Launch the Docker container.

docker run -it \
    --device=/dev/dri \
    --device=/dev/kfd \
    --network host \
    --ipc host \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --privileged \
    -v $HOME:$HOME \
    -v $HOME/.ssh:/root/.ssh \
    -v $HF_HOME:/hf_cache \
    -e HF_HOME=/hf_cache \
    -e MAD_SECRETS_HFTOKEN=$MAD_SECRETS_HFTOKEN
    --shm-size 64G \
    --name training_env \
    rocm/jax-training:maxtext-v25.11

In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at MAD/scripts/jax-maxtext.
```
git clone https://github.com/ROCm/MAD
cd MAD/scripts/jax-maxtext
```
Run the setup scripts to install libraries and datasets needed for benchmarking.
```
./jax-maxtext_benchmark_setup.sh -m Llama-3.1-8B
```
To run the training benchmark without quantization, use the following command:
```
./jax-maxtext_benchmark_report.sh -m Llama-3.1-8B
```
For quantized training, run the script with the appropriate option for your Instinct GPU.
MI355X and MI350X
For fp8 quantized training on MI355X and MI350X GPUs, use the following command:
./jax-maxtext_benchmark_report.sh -m Llama-3.1-8B -q fp8
MI325X and MI300X
For nanoo_fp8 quantized training on MI300X series GPUs, use the following command:
./jax-maxtext_benchmark_report.sh -m Llama-3.1-8B -q nanoo_fp8

Multi-node training

For multi-node training examples, choose a model from Supported models with an available multi-node training script.

MAD-integrated benchmarking

The following run command is tailored to Llama 3.1 70B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```

Use this command to run the performance benchmark test on the Llama 3.1 70B model using one GPU with the bf16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags jax_maxtext_train_llama-3.1-70b \
    --keep-model-dir \
    --live-output \
    --timeout 28800

MAD launches a Docker container with the name container_ci-jax_maxtext_train_llama-3.1-70b. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv/.

Standalone benchmarking

The following commands are optimized for Llama 3.1 70B. See Supported models to switch to another available model. Some instructions and resources might not be available for all models and configurations.

Download the Docker image and required scripts

Run the JAX MaxText benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/jax-training:maxtext-v25.11

Single node training

Set up environment variables.
```
export MAD_SECRETS_HFTOKEN=<Your Hugging Face token>
export HF_HOME=<Location of saved/cached Hugging Face models>
```
MAD_SECRETS_HFTOKEN is your Hugging Face access token to access models, tokenizers, and data. See User access tokens.

HF_HOME is where huggingface_hub will store local data. See huggingface_hub CLI. If you already have downloaded or cached Hugging Face artifacts, set this variable to that path. Downloaded files typically get cached to ~/.cache/huggingface.

Launch the Docker container.

docker run -it \
    --device=/dev/dri \
    --device=/dev/kfd \
    --network host \
    --ipc host \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --privileged \
    -v $HOME:$HOME \
    -v $HOME/.ssh:/root/.ssh \
    -v $HF_HOME:/hf_cache \
    -e HF_HOME=/hf_cache \
    -e MAD_SECRETS_HFTOKEN=$MAD_SECRETS_HFTOKEN
    --shm-size 64G \
    --name training_env \
    rocm/jax-training:maxtext-v25.11

In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at MAD/scripts/jax-maxtext.
```
git clone https://github.com/ROCm/MAD
cd MAD/scripts/jax-maxtext
```
Run the setup scripts to install libraries and datasets needed for benchmarking.
```
./jax-maxtext_benchmark_setup.sh -m Llama-3.1-70B
```
To run the training benchmark without quantization, use the following command:
```
./jax-maxtext_benchmark_report.sh -m Llama-3.1-70B
```
For quantized training, run the script with the appropriate option for your Instinct GPU.
MI355X and MI350X
For fp8 quantized training on MI355X and MI350X GPUs, use the following command:
./jax-maxtext_benchmark_report.sh -m Llama-3.1-70B -q fp8

Multi-node training

For multi-node training examples, choose a model from Supported models with an available multi-node training script.

MAD-integrated benchmarking

The following run command is tailored to Llama 3.3 70B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```

Use this command to run the performance benchmark test on the Llama 3.3 70B model using one GPU with the bf16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags jax_maxtext_train_llama-3.3-70b \
    --keep-model-dir \
    --live-output \
    --timeout 28800

MAD launches a Docker container with the name container_ci-jax_maxtext_train_llama-3.3-70b. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv/.

Standalone benchmarking

The following commands are optimized for Llama 3.3 70B. See Supported models to switch to another available model. Some instructions and resources might not be available for all models and configurations.

Download the Docker image and required scripts

Run the JAX MaxText benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/jax-training:maxtext-v25.11

Single node training

Set up environment variables.
```
export MAD_SECRETS_HFTOKEN=<Your Hugging Face token>
export HF_HOME=<Location of saved/cached Hugging Face models>
```
MAD_SECRETS_HFTOKEN is your Hugging Face access token to access models, tokenizers, and data. See User access tokens.

HF_HOME is where huggingface_hub will store local data. See huggingface_hub CLI. If you already have downloaded or cached Hugging Face artifacts, set this variable to that path. Downloaded files typically get cached to ~/.cache/huggingface.

Launch the Docker container.

docker run -it \
    --device=/dev/dri \
    --device=/dev/kfd \
    --network host \
    --ipc host \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --privileged \
    -v $HOME:$HOME \
    -v $HOME/.ssh:/root/.ssh \
    -v $HF_HOME:/hf_cache \
    -e HF_HOME=/hf_cache \
    -e MAD_SECRETS_HFTOKEN=$MAD_SECRETS_HFTOKEN
    --shm-size 64G \
    --name training_env \
    rocm/jax-training:maxtext-v25.11

In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at MAD/scripts/jax-maxtext.
```
git clone https://github.com/ROCm/MAD
cd MAD/scripts/jax-maxtext
```
Run the setup scripts to install libraries and datasets needed for benchmarking.
```
./jax-maxtext_benchmark_setup.sh -m Llama-3.3-70B
```
To run the training benchmark without quantization, use the following command:
```
./jax-maxtext_benchmark_report.sh -m Llama-3.3-70B
```
For quantized training, run the script with the appropriate option for your Instinct GPU.
MI355X and MI350X
For fp8 quantized training on MI355X and MI350X GPUs, use the following command:
./jax-maxtext_benchmark_report.sh -m Llama-3.3-70B -q fp8

Multi-node training

For multi-node training examples, choose a model from Supported models with an available multi-node training script.

MAD-integrated benchmarking

The following run command is tailored to DeepSeek-V2-Lite (16B). See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```

Use this command to run the performance benchmark test on the DeepSeek-V2-Lite (16B) model using one GPU with the bf16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags jax_maxtext_train_deepseek-v2-lite-16b \
    --keep-model-dir \
    --live-output \
    --timeout 28800

MAD launches a Docker container with the name container_ci-jax_maxtext_train_deepseek-v2-lite-16b. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv/.

Standalone benchmarking

The following commands are optimized for DeepSeek-V2-Lite (16B). See Supported models to switch to another available model. Some instructions and resources might not be available for all models and configurations.

Download the Docker image and required scripts

Run the JAX MaxText benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/jax-training:maxtext-v25.11

Single node training

Set up environment variables.
```
export MAD_SECRETS_HFTOKEN=<Your Hugging Face token>
export HF_HOME=<Location of saved/cached Hugging Face models>
```
MAD_SECRETS_HFTOKEN is your Hugging Face access token to access models, tokenizers, and data. See User access tokens.

HF_HOME is where huggingface_hub will store local data. See huggingface_hub CLI. If you already have downloaded or cached Hugging Face artifacts, set this variable to that path. Downloaded files typically get cached to ~/.cache/huggingface.

Launch the Docker container.

docker run -it \
    --device=/dev/dri \
    --device=/dev/kfd \
    --network host \
    --ipc host \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --privileged \
    -v $HOME:$HOME \
    -v $HOME/.ssh:/root/.ssh \
    -v $HF_HOME:/hf_cache \
    -e HF_HOME=/hf_cache \
    -e MAD_SECRETS_HFTOKEN=$MAD_SECRETS_HFTOKEN
    --shm-size 64G \
    --name training_env \
    rocm/jax-training:maxtext-v25.11

In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at MAD/scripts/jax-maxtext.
```
git clone https://github.com/ROCm/MAD
cd MAD/scripts/jax-maxtext
```
Run the setup scripts to install libraries and datasets needed for benchmarking.
```
./jax-maxtext_benchmark_setup.sh -m DeepSeek-V2-lite
```
To run the training benchmark without quantization, use the following command:
```
./jax-maxtext_benchmark_report.sh -m DeepSeek-V2-lite
```
For quantized training, run the script with the appropriate option for your Instinct GPU.
MI355X and MI350X
For fp8 quantized training on MI355X and MI350X GPUs, use the following command:
./jax-maxtext_benchmark_report.sh -m DeepSeek-V2-lite -q fp8
MI325X and MI300X
For nanoo_fp8 quantized training on MI300X series GPUs, use the following command:
./jax-maxtext_benchmark_report.sh -m DeepSeek-V2-lite -q nanoo_fp8

Multi-node training

For multi-node training examples, choose a model from Supported models with an available multi-node training script.

MAD-integrated benchmarking

The following run command is tailored to Mixtral 8x7B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```

Use this command to run the performance benchmark test on the Mixtral 8x7B model using one GPU with the bf16 data type on the host machine.

export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags jax_maxtext_train_mixtral-8x7b \
    --keep-model-dir \
    --live-output \
    --timeout 28800

MAD launches a Docker container with the name container_ci-jax_maxtext_train_mixtral-8x7b. The latency and throughput reports of the model are collected in the following path: ~/MAD/perf.csv/.

Standalone benchmarking

The following commands are optimized for Mixtral 8x7B. See Supported models to switch to another available model. Some instructions and resources might not be available for all models and configurations.

Download the Docker image and required scripts

Run the JAX MaxText benchmark tool independently by starting the Docker container as shown in the following snippet.

docker pull rocm/jax-training:maxtext-v25.11

Single node training

Set up environment variables.
```
export MAD_SECRETS_HFTOKEN=<Your Hugging Face token>
export HF_HOME=<Location of saved/cached Hugging Face models>
```
MAD_SECRETS_HFTOKEN is your Hugging Face access token to access models, tokenizers, and data. See User access tokens.

HF_HOME is where huggingface_hub will store local data. See huggingface_hub CLI. If you already have downloaded or cached Hugging Face artifacts, set this variable to that path. Downloaded files typically get cached to ~/.cache/huggingface.

Launch the Docker container.

docker run -it \
    --device=/dev/dri \
    --device=/dev/kfd \
    --network host \
    --ipc host \
    --group-add video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --privileged \
    -v $HOME:$HOME \
    -v $HOME/.ssh:/root/.ssh \
    -v $HF_HOME:/hf_cache \
    -e HF_HOME=/hf_cache \
    -e MAD_SECRETS_HFTOKEN=$MAD_SECRETS_HFTOKEN
    --shm-size 64G \
    --name training_env \
    rocm/jax-training:maxtext-v25.11

In the Docker container, clone the ROCm MAD repository and navigate to the benchmark scripts directory at MAD/scripts/jax-maxtext.
```
git clone https://github.com/ROCm/MAD
cd MAD/scripts/jax-maxtext
```
Run the setup scripts to install libraries and datasets needed for benchmarking.
```
./jax-maxtext_benchmark_setup.sh -m Mixtral-8x7B
```
To run the training benchmark without quantization, use the following command:
```
./jax-maxtext_benchmark_report.sh -m Mixtral-8x7B
```
For quantized training, run the script with the appropriate option for your Instinct GPU.
MI355X and MI350X
For fp8 quantized training on MI355X and MI350X GPUs, use the following command:
./jax-maxtext_benchmark_report.sh -m Mixtral-8x7B -q fp8
MI325X and MI300X
For nanoo_fp8 quantized training on MI300X series GPUs, use the following command:
./jax-maxtext_benchmark_report.sh -m Mixtral-8x7B -q nanoo_fp8

Multi-node training

For multi-node training examples, choose a model from Supported models with an available multi-node training script.

Known issues#

Minor performance regression (< 4%) for BF16 quantization in Llama models and Mixtral 8x7b.
You might see minor loss spikes, or loss curve may have slightly higher convergence end values compared to the previous jax-training image.
For FP8 training on MI355, many models will display a warning message like: Warning: Latency not found for MI_M=16, MI_N=16, MI_K=128, mi_input_type=BFloat8Float8_fnuz. Returning latency value of 32 (really slow). The compile step may take longer than usual, but training will run. This will be fixed in a future release.
The built-in JAX profiler isn’t working. If you need to collect a trace and the JAX profiler isn’t working, use rocprofv3 provided by the ROCprofiler-SDK as a workaround. For example:
```
rocprofv3 \
    --hip-trace \
    --kernel-trace \
    --memory-copy-trace \
    --rccl-trace \
    --output-format pftrace \
    -d ./v3_traces \ # output directory
    -- ./jax-maxtext_benchmark_report.sh -m {{ model.model_repo }} # or desired command
```
You can set the directory where you want the .json traces to be saved using -d <TRACE_DIRECTORY>. The resulting traces can be opened in Perfetto: https://ui.perfetto.dev/.

Previous versions#

See JAX MaxText training performance testing version history to find documentation for previous releases of the ROCm/jax-training Docker image.

Training a model with JAX MaxText on ROCm

Contents

Training a model with JAX MaxText on ROCm#

Supported models#

System validation#

Environment setup#

Pull the Docker image#

Multi-node configuration#

Benchmarking#

Known issues#

Further reading#

Previous versions#