Training a model with Primus and PyTorch#

2025-09-16

9 min read time

Applies to Linux

Primus is a unified and flexible LLM training framework designed to streamline training. It streamlines LLM training on AMD Instinct accelerators using a modular, reproducible configuration paradigm. Primus now supports the PyTorch torchtitan backend.

Note

Primus with the PyTorch torchtitan backend is intended to supersede the ROCm PyTorch training workflow. See Training a model with PyTorch on ROCm to see steps to run workloads without Primus.

For ease of use, AMD provides a ready-to-use Docker image – rocm/pytorch-training:v25.8 – for MI300X series accelerators containing essential components for Primus and PyTorch training with Primus Turbo optimizations.

Software component

Version

ROCm

6.4.3

PyTorch

2.8.0a0+gitd06a406

Python

3.10.18

Transformer Engine

2.2.0.dev0+a1e66aae

Flash Attention

3.0.0.post1

hipBLASLt

1.1.0-d1b517fc7a

Supported models#

The following models are pre-optimized for performance on the AMD Instinct MI325X and MI300X accelerators. Some instructions, commands, and training recommendations in this documentation might vary by model – select one to get started.

Model
Llama 3.1 8B
Llama 3.1 70B

See also

For additional workloads, including Llama 3.3, Llama 3.2, Llama 2, GPT OSS, Qwen, and Flux models, see the documentation Training a model with PyTorch on ROCm (without Primus)

System validation#

Before running AI workloads, it’s important to validate that your AMD hardware is configured correctly and performing optimally.

If you have already validated your system settings, including aspects like NUMA auto-balancing, you can skip this step. Otherwise, complete the procedures in the System validation and optimization guide to properly configure your system settings before starting training.

To test for optimal performance, consult the recommended System health benchmarks. This suite of tests will help you verify and fine-tune your system’s configuration.

This Docker image is optimized for specific model configurations outlined below. Performance can vary for other training workloads, as AMD doesn’t test configurations and run conditions outside those described.

Pull the Docker image#

Use the following command to pull the Docker image from Docker Hub.

docker pull rocm/pytorch-training:v25.8

Run training#

Once the setup is complete, choose between the following two workflows to start benchmarking training. For fine-tuning workloads and multi-node training examples, see Training a model with PyTorch on ROCm (without Primus).

The following run command is tailored to Llama 3.1 8B. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. For example, use this command to run the performance benchmark test on the Llama 3.1 8B model using one node with the BF16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags primus_pyt_train_llama-3.1-8b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

    MAD launches a Docker container with the name container_ci-primus_pyt_train_llama-3.1-8b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run command is tailored to Llama 3.1 70B. See Supported models to switch to another available model.

  1. Clone the ROCm Model Automation and Dashboarding (ROCm/MAD) repository to a local directory and install the required packages on the host machine.

    git clone https://github.com/ROCm/MAD
    cd MAD
    pip install -r requirements.txt
    
  2. For example, use this command to run the performance benchmark test on the Llama 3.1 70B model using one node with the BF16 data type on the host machine.

    export MAD_SECRETS_HFTOKEN="your personal Hugging Face token to access gated models"
    madengine run \
        --tags primus_pyt_train_llama-3.1-70b \
        --keep-model-dir \
        --live-output \
        --timeout 28800
    

    MAD launches a Docker container with the name container_ci-primus_pyt_train_llama-3.1-70b. The latency and throughput reports of the model are collected in ~/MAD/perf.csv.

The following run commands are tailored to Llama 3.1 8B. See Supported models to switch to another available model.

Download the Docker image and required packages

  1. Use the following command to pull the Docker image from Docker Hub.

    docker pull rocm/pytorch-training:v25.8
    
  2. Run the Docker container.

    docker run -it \
        --device /dev/dri \
        --device /dev/kfd \
        --network host \
        --ipc host \
        --group-add video \
        --cap-add SYS_PTRACE \
        --security-opt seccomp=unconfined \
        --privileged \
        -v $HOME:$HOME \
        -v $HOME/.ssh:/root/.ssh \
        --shm-size 64G \
        --name training_env \
        rocm/pytorch-training:v25.8
    

    Use these commands if you exit the training_env container and need to return to it.

    docker start training_env
    docker exec -it training_env bash
    
  3. In the Docker container, clone the ROCm/MAD repository and navigate to the benchmark scripts directory /workspace/MAD/scripts/pytorch_train.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/pytorch_train
    

Prepare training datasets and dependencies

  1. The following benchmarking examples require downloading models and datasets from Hugging Face. To ensure successful access to gated repos, set your HF_TOKEN.

    export HF_TOKEN=$your_personal_hugging_face_access_token
    
  2. Run the setup script to install libraries and datasets needed for benchmarking.

    ./pytorch_benchmark_setup.sh
    

Pretraining

To start the pretraining benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t pretrain \
    -m Llama-3.1-8B \
    -p $datatype \
    -s $sequence_length

Name

Options

Description

$datatype

BF16 or FP8

Currently, only Llama 3.1 8B supports FP8 precision.

$sequence_length

Sequence length for the language model.

Between 2048 and 8192. 8192 by default.

Benchmarking examples

Use the following command to run train Llama 3.1 8B with BF16 precision using Primus torchtitan.

./pytorch_benchmark_report.sh -m Llama-3.1-8B

To train Llama 3.1 8B with FP8 precision, use the following command.

./pytorch_benchmark_report.sh -m Llama-3.1-8B -p FP8

The following run commands are tailored to Llama 3.1 70B. See Supported models to switch to another available model.

Download the Docker image and required packages

  1. Use the following command to pull the Docker image from Docker Hub.

    docker pull rocm/pytorch-training:v25.8
    
  2. Run the Docker container.

    docker run -it \
        --device /dev/dri \
        --device /dev/kfd \
        --network host \
        --ipc host \
        --group-add video \
        --cap-add SYS_PTRACE \
        --security-opt seccomp=unconfined \
        --privileged \
        -v $HOME:$HOME \
        -v $HOME/.ssh:/root/.ssh \
        --shm-size 64G \
        --name training_env \
        rocm/pytorch-training:v25.8
    

    Use these commands if you exit the training_env container and need to return to it.

    docker start training_env
    docker exec -it training_env bash
    
  3. In the Docker container, clone the ROCm/MAD repository and navigate to the benchmark scripts directory /workspace/MAD/scripts/pytorch_train.

    git clone https://github.com/ROCm/MAD
    cd MAD/scripts/pytorch_train
    

Prepare training datasets and dependencies

  1. The following benchmarking examples require downloading models and datasets from Hugging Face. To ensure successful access to gated repos, set your HF_TOKEN.

    export HF_TOKEN=$your_personal_hugging_face_access_token
    
  2. Run the setup script to install libraries and datasets needed for benchmarking.

    ./pytorch_benchmark_setup.sh
    

Pretraining

To start the pretraining benchmark, use the following command with the appropriate options. See the following list of options and their descriptions.

./pytorch_benchmark_report.sh -t pretrain \
    -m Llama-3.1-70B \
    -p $datatype \
    -s $sequence_length

Name

Options

Description

$datatype

BF16

Currently, only Llama 3.1 8B supports FP8 precision.

$sequence_length

Sequence length for the language model.

Between 2048 and 8192. 8192 by default.

Benchmarking examples

Use the following command to run train Llama 3.1 70B with BF16 precision using Primus torchtitan.

./pytorch_benchmark_report.sh -m Llama-3.1-70B

To train Llama 3.1 70B with FP8 precision, use the following command.

./pytorch_benchmark_report.sh -m Llama-3.1-70B -p FP8

Further reading#

Previous versions#

See PyTorch training performance testing version history to find documentation for previous releases of the ROCm/pytorch-training Docker image.