Training a model with Primus and PyTorch

Training a model with Primus and PyTorch#

2025-12-29

15 min read time

Applies to Linux

Primus is a unified and flexible LLM training framework designed to streamline training. It streamlines LLM training on AMD Instinct GPUs using a modular, reproducible configuration paradigm. Primus now supports the PyTorch torchtitan backend.

Note

For a unified training solution on AMD GPUs with ROCm, the rocm/pytorch-training Docker Hub registry will be deprecated soon in favor of rocm/primus. The rocm/primus Docker containers will cover PyTorch training ecosystem frameworks, including torchtitan and Megatron-LM.

Primus with the PyTorch torchtitan backend is designed to replace the ROCm PyTorch training workflow. See Training a model with PyTorch on ROCm to see steps to run workloads without Primus.

AMD provides a ready-to-use Docker image for MI355X, MI350X, MI325X, and MI300X GPUs containing essential components for Primus and PyTorch training with Primus Turbo optimizations.

rocm/primus:v25.11

Software component	Version
ROCm	7.1.0
PyTorch	2.10.0.dev20251112+rocm7.1
Python	3.10
Transformer Engine	2.4.0.dev0+32e2d1d4
Flash Attention	2.8.3
hipBLASLt	1.2.0-09ab7153e2

Supported models#

The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X GPUs. Some instructions, commands, and training recommendations in this documentation might vary by model – select one to get started.

Model

Meta Llama

DeepSeek

Variant

Llama 3.1 8B

Llama 3.1 70B

DeepSeek V3 16B

System validation#

Before running AI workloads, it’s important to validate that your AMD hardware is configured correctly and performing optimally.

If you have already validated your system settings, including aspects like NUMA auto-balancing, you can skip this step. Otherwise, complete the procedures in the System validation and optimization guide to properly configure your system settings before starting training.

To test for optimal performance, consult the recommended System health benchmarks. This suite of tests will help you verify and fine-tune your system’s configuration.

This Docker image is optimized for specific model configurations outlined below. Performance can vary for other training workloads, as AMD doesn’t test configurations and run conditions outside those described.

Pull the Docker image#

Use the following command to pull the Docker image from Docker Hub.

docker pull rocm/primus:v25.11

Run training#

Once the setup is complete, choose between the following two workflows to start benchmarking training. For fine-tuning workloads and multi-node training examples, see Training a model with PyTorch on ROCm (without Primus). For best performance on MI325X, MI350X, and MI355X GPUs, you might need to tweak some configurations (such as batch sizes).

MAD-integrated benchmarking

The following run command is tailored to Llama 3.1 8B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 3.1 8B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags primus_pyt_train_llama-3.1-8b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-primus_pyt_train_llama-3.1-8b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 3.1 70B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the Llama 3.1 70B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags primus_pyt_train_llama-3.1-70b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-primus_pyt_train_llama-3.1-70b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to DeepSeek V3 16B. See Supported models to switch to another available model.

Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.
```
git clone https://github.com/ROCm/MAD
cd MAD
pip install -r requirements.txt
```
For example, use this command to run the performance benchmark test on the DeepSeek V3 16B model using one node with the BF16 data type on the host machine.
```
export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
madengine run \
    --tags primus_pyt_train_deepseek-v3-16b \
    --keep-model-dir \
    --live-output \
    --timeout 28800
```
MAD launches a Docker container with the name container_ci-primus_pyt_train_deepseek-v3-16b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

Primus benchmarking

Previous versions#

See PyTorch training performance testing version history to find documentation for previous releases of the ROCm/pytorch-training Docker image.

Training a model with Primus and PyTorch

Contents

Training a model with Primus and PyTorch#

Supported models#

System validation#

Pull the Docker image#

Run training#

Further reading#

Previous versions#