Training a model with Primus and Megatron-LM#
2025-09-19
15 min read time
Primus is a unified and flexible LLM training framework designed to streamline training. It streamlines LLM training on AMD Instinct accelerators using a modular, reproducible configuration paradigm. Primus is backend-agnostic and supports multiple training engines – including Megatron.
Note
Primus with Megatron supersedes the ROCm Megatron-LM training workflow. To learn how to migrate workloads from Megatron-LM to Primus with Megatron, see Migrating workloads to Primus (Megatron backend) from Megatron-LM.
For ease of use, AMD provides a ready-to-use Docker image for MI300 series accelerators containing essential components for Primus and Megatron-LM. This Docker is powered by Primus Turbo optimizations for performance; this release adds support for Primus Turbo with optimized attention and grouped GEMM kernels.
Note
This Docker environment is based on Python 3.10 and Ubuntu 22.04. For an alternative environment with Python 3.12 and Ubuntu 24.04, see the previous ROCm Megatron-LM v25.6 Docker release.
Software component |
Version |
---|---|
ROCm |
6.4.3 |
Primus |
927a717 |
PyTorch |
2.8.0a0+gitd06a406 |
Python |
3.10 |
Transformer Engine |
2.2.0.dev0+54dd2bdc |
hipBLASLt |
d1b517fc7a |
Triton |
3.3.0 |
RCCL |
2.22.3 |
Supported models#
The following models are pre-optimized for performance on AMD Instinct MI300X series accelerators. Some instructions, commands, and training examples in this documentation might vary by model – select one to get started.
Note
Some models, such as Llama, require an external license agreement through a third party (for example, Meta).
System validation#
Before running AI workloads, it’s important to validate that your AMD hardware is configured correctly and performing optimally.
If you have already validated your system settings, including aspects like NUMA auto-balancing, you can skip this step. Otherwise, complete the procedures in the System validation and optimization guide to properly configure your system settings before starting training.
To test for optimal performance, consult the recommended System health benchmarks. This suite of tests will help you verify and fine-tune your system’s configuration.
Environment setup#
Use the following instructions to set up the environment, configure the script to train models, and
reproduce the benchmark results on MI300X series accelerators with the rocm/megatron-lm:v25.8_py310
image.
Download the Docker image#
Use the following command to pull the Docker image from Docker Hub.
docker pull rocm/megatron-lm:v25.8_py310
Launch the Docker container.
docker run -it \ --device /dev/dri \ --device /dev/kfd \ --device /dev/infiniband \ --network host --ipc host \ --group-add video \ --cap-add SYS_PTRACE \ --security-opt seccomp=unconfined \ --privileged \ -v $HOME:$HOME \ --shm-size 128G \ --name primus_training_env \ rocm/megatron-lm:v25.8_py310
Use these commands if you exit the
primus_training_env
container and need to return to it.docker start primus_training_env docker exec -it primus_training_env bash
The Docker container hosts verified commit 927a717
of the Primus repository.
Configuration#
Primus defines a training configuration in YAML for each model in examples/megatron/configs.
To update training parameters for Llama 3.3 70B, you can update examples/megatron/configs/llama3.3_70B-pretrain.yaml
.
Note that training configuration YAML files for other models follow this naming convention.
To update training parameters for Llama 3.1 70B, you can update examples/megatron/configs/llama3.1_70B-pretrain.yaml
.
Note that training configuration YAML files for other models follow this naming convention.
To update training parameters for Llama 3.1 8B, you can update examples/megatron/configs/llama3.1_8B-pretrain.yaml
.
Note that training configuration YAML files for other models follow this naming convention.
To update training parameters for Llama 2 7B, you can update examples/megatron/configs/llama2_7B-pretrain.yaml
.
Note that training configuration YAML files for other models follow this naming convention.
To update training parameters for Llama 2 70B, you can update examples/megatron/configs/llama2_70B-pretrain.yaml
.
Note that training configuration YAML files for other models follow this naming convention.
To update training parameters for DeepSeek-V3 (proxy), you can update examples/megatron/configs/deepseek_v3-pretrain.yaml
.
Note that training configuration YAML files for other models follow this naming convention.
To update training parameters for DeepSeek-V2-Lite, you can update examples/megatron/configs/deepseek_v2_lite-pretrain.yaml
.
Note that training configuration YAML files for other models follow this naming convention.
To update training parameters for Mixtral 8x7B, you can update examples/megatron/configs/mixtral_8x7B_v0.1-pretrain.yaml
.
Note that training configuration YAML files for other models follow this naming convention.
To update training parameters for Mixtral 8x22B (proxy), you can update examples/megatron/configs/mixtral_8x22B_v0.1-pretrain.yaml
.
Note that training configuration YAML files for other models follow this naming convention.
To update training parameters for Qwen 2.5 7B, you can update examples/megatron/configs/primus_qwen2.5_7B-pretrain.yaml
.
Note that training configuration YAML files for other models follow this naming convention.
To update training parameters for Qwen 2.5 72B, you can update examples/megatron/configs/qwen2.5_72B-pretrain.yaml
.
Note that training configuration YAML files for other models follow this naming convention.
Note
See Key options for more information on configuration options.
Dataset options#
You can use either mock data or real data for training.
Mock data can be useful for testing and validation. Use the
mock_data
field to toggle between mock and real data. The default value istrue
for enabled.mock_data: true
If you’re using a real dataset, update the
train_data_path
field to point to the location of your dataset.mock_data: false train_data_path: /path/to/your/dataset
Ensure that the files are accessible inside the Docker container.
Tokenizer#
Set the HF_TOKEN
environment variable with
right permissions to access the tokenizer for each model.
# Export your HF_TOKEN in the workspace
export HF_TOKEN=<your_hftoken>
Note
In Primus, each model uses a tokenizer from Hugging Face. For example, Llama
3.1 8B model uses tokenizer_model: meta-llama/Llama-3.1-8B
and
tokenizer_type: Llama3Tokenizer
defined in the llama3.1-8B model
definition.
Run training#
Use the following example commands to set up the environment, configure key options, and run training on MI300X series accelerators with the AMD Megatron-LM environment.
Single node training#
To run training on a single node, navigate to /workspace/Primus
and use the following setup command:
pip install -r requirements.txt
export HSA_NO_SCRATCH_RECLAIM=1
export NVTE_CK_USES_BWD_V3=1
Once setup is complete, run the appropriate training command. The following run commands are tailored to Llama 3.3 70B. See Supported models to switch to another available model.
To run pre-training for Llama 3.3 70B BF16, run:
EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \
bash ./examples/run_pretrain.sh \
--micro_batch_size 2 \
--global_batch_size 16 \
--train_iters 50
Once setup is complete, run the appropriate training command. The following run commands are tailored to Llama 3.1 8B. See Supported models to switch to another available model.
To run pre-training for Llama 3.1 8B FP8, run:
EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \
bash ./examples/run_pretrain.sh \
--train_iters 50 \
--fp8 hybrid
For Llama 3.1 8B BF16, use the following command:
EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \
bash ./examples/run_pretrain.sh --train_iters 50
Once setup is complete, run the appropriate training command. The following run commands are tailored to Llama 3.1 70B. See Supported models to switch to another available model.
To run pre-training for Llama 3.1 70B BF16, run:
EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \
bash ./examples/run_pretrain.sh \
--train_iters 50
To run the training on a single node for Llama 3.1 70B FP8 with proxy, use the following command:
EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \
bash ./examples/run_pretrain.sh \
--train_iters 50 \
--num_layers 40 \
--fp8 hybrid
Note
Use two or more nodes to run the full Llama 70B model with FP8 precision.
Once setup is complete, run the appropriate training command. The following run commands are tailored to Llama 2 7B. See Supported models to switch to another available model.
To run pre-training for Llama 2 7B FP8, run:
EXP=examples/megatron/configs/llama2_7B-pretrain.yaml \
bash ./examples/run_pretrain.sh \
--train_iters 50 \
--fp8 hybrid
To run pre-training for Llama 2 7B BF16, run:
EXP=examples/megatron/configs/llama2_7B-pretrain.yaml \
bash ./examples/run_pretrain.sh --train_iters 50
Once setup is complete, run the appropriate training command. The following run commands are tailored to Llama 2 70B. See Supported models to switch to another available model.
To run pre-training for Llama 2 70B BF16, run:
EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \
bash ./examples/run_pretrain.sh --train_iters 50
Once setup is complete, run the appropriate training command. The following run commands are tailored to DeepSeek-V3. See Supported models to switch to another available model.
To run training on a single node for DeepSeek-V3 (MoE with expert parallel) with 3-layer proxy, use the following command:
EXP=examples/megatron/configs/deepseek_v3-pretrain.yaml \
bash examples/run_pretrain.sh \
--num_layers 3 \
--moe_layer_freq 1 \
--train_iters 50
Once setup is complete, run the appropriate training command. The following run commands are tailored to DeepSeek-V2-Lite. See Supported models to switch to another available model.
To run training on a single node for DeepSeek-V2-Lite (MoE with expert parallel), use the following command:
EXP=examples/megatron/configs/deepseek_v2_lite-pretrain.yaml \
bash examples/run_pretrain.sh \
--global_batch_size 256 \
--train_iters 50
Once setup is complete, run the appropriate training command. The following run commands are tailored to Mixtral 8x7B. See Supported models to switch to another available model.
To run training on a single node for Mixtral 8x7B (MoE with expert parallel), use the following command:
EXP=examples/megatron/configs/mixtral_8x7B_v0.1-pretrain.yaml \
bash examples/run_pretrain.sh --train_iters 50
Once setup is complete, run the appropriate training command. The following run commands are tailored to Mixtral 8x22B. See Supported models to switch to another available model.
To run training on a single node for Mixtral 8x22B (MoE with expert parallel) with 4-layer proxy, use the following command:
EXP=examples/megatron/configs/mixtral_8x22B_v0.1-pretrain.yaml \
bash examples/run_pretrain.sh \
--num_layers 4 \
--pipeline_model_parallel_size 1 \
--micro_batch_size 1 \
--global_batch_size 16 \
--train_iters 50
Once setup is complete, run the appropriate training command. The following run commands are tailored to Qwen 2.5 7B. See Supported models to switch to another available model.
To run training on a single node for Qwen 2.5 7B BF16, use the following command:
EXP=examples/megatron/configs/qwen2.5_7B-pretrain.yaml \
bash examples/run_pretrain.sh --train_iters 50
For FP8, use the following command.
EXP=examples/megatron/configs/qwen2.5_7B-pretrain.yaml \
bash examples/run_pretrain.sh \
--train_iters 50 \
--fp8 hybrid
Once setup is complete, run the appropriate training command. The following run commands are tailored to Qwen 2.5 72B. See Supported models to switch to another available model.
To run the training on a single node for Qwen 2.5 72B BF16, use the following command.
EXP=examples/megatron/configs/qwen2.5_72B-pretrain.yaml \
bash examples/run_pretrain.sh --train_iters 50
Multi-node training examples#
To run training on multiple nodes, you can use the run_slurm_pretrain.sh to launch the multi-node workload. Use the following steps to setup your environment:
cd /workspace/Primus/
export DOCKER_IMAGE=rocm/megatron-lm:v25.8_py310
export HF_TOKEN=<your_HF_token>
export HSA_NO_SCRATCH_RECLAIM=1
export NVTE_CK_USES_BWD_V3=1
export NCCL_IB_HCA=<your_NCCL_IB_HCA> # specify which RDMA interfaces to use for communication
export NCCL_SOCKET_IFNAME=<your_NCCL_SOCKET_IFNAME> # your Network Interface
export GLOO_SOCKET_IFNAME=<your_GLOO_SOCKET_IFNAME> # your Network Interface
export NCCL_IB_GID_INDEX=3 # Set InfiniBand GID index for NCCL communication. Default is 3 for ROCE
Note
Make sure correct network drivers are installed on the nodes. If inside a Docker, either install the drivers inside the Docker container or pass the network drivers from the host while creating Docker container.
If
NCCL_IB_HCA
andNCCL_SOCKET_IFNAME
are not set, Primus will try to auto-detect. However, since NICs can vary accross different cluster, it is encouraged to explicitly export your NCCL parameters for the cluster.To find your network interface, you can use
ip a
.To find RDMA interfaces, you can use
ibv_devices
to get the list of all the RDMA/IB devices.
To train Llama 3.3 70B FP8 on 8 nodes, run:
NNODES=8 EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 1 \
--global_batch_size 256 \
--recompute_num_layers 80 \
--fp8 hybrid
To train Llama 3.3 70B BF16 on 8 nodes, run:
NNODES=8 EXP=examples/megatron/configs/llama3.3_70B-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 1 \
--global_batch_size 256 \
--recompute_num_layers 12
To train Llama 3.1 8B FP8 on 8 nodes, run:
# Adjust the training parameters. For e.g., `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case
NNODES=8 EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml \
bash ./examples/run_slurm_pretrain.sh \
--global_batch_size 1024 \
--fp8 hybrid
To train Llama 3.1 70B FP8 on 8 nodes, run:
NNODES=8 EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 1 \
--global_batch_size 256 \
--recompute_num_layers 80 \
--fp8 hybrid
To train Llama 3.1 70B BF16 on 8 nodes, run:
NNODES=8 EXP=examples/megatron/configs/llama3.1_70B-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 1 \
--global_batch_size 256 \
--recompute_num_layers 12
To train Llama 2 8B FP8 on 8 nodes, run:
# Adjust the training parameters. For e.g., `global_batch_size: 8 * #single_node_bs` for 8 nodes in this case
NNODES=8 EXP=examples/megatron/configs/llama2_7B-pretrain.yaml bash ./examples/run_slurm_pretrain.sh --global_batch_size 2048 --fp8 hybrid
To train Llama 2 70B FP8 on 8 nodes, run:
NNODES=8 EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 2 \
--global_batch_size 256 \
--recompute_num_layers 80 \
--fp8 hybrid
To train Llama 2 70B BF16 on 8 nodes, run:
NNODES=8 EXP=examples/megatron/configs/llama2_70B-pretrain.yaml \
bash ./examples/run_slurm_pretrain.sh \
--micro_batch_size 2 \
--global_batch_size 1536 \
--recompute_num_layers 12
To train Mixtral 8x7B BF16 on 8 nodes, run:
NNODES=8 EXP=examples/megatron/configs/mixtral_8x7B_v0.1-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 2 \
--global_batch_size 256
To train Qwen2.5 72B FP8 on 8 nodes, run:
NNODES=8 EXP=examples/megatron/configs/qwen2.5_72B-pretrain.yaml \
bash examples/run_slurm_pretrain.sh \
--micro_batch_size 4 \
--global_batch_size 256 \
--recompute_num_layers 80 \
--fp8 hybrid
Key options#
The following are key options to take note of
- fp8
hybrid
enables FP8 GEMMs.- use_torch_fsdp2
use_torch_fsdp2: 1
enables torch fsdp-v2. If FSDP is enabled, setuse_distributed_optimizer
andoverlap_param_gather
tofalse
.- profile
To enable PyTorch profiling, set these parameters:
profile: true use_pytorch_profiler: true profile_step_end: 7 profile_step_start: 6
- train_iters
The total number of iterations (default: 50).
- mock_data
True by default.
- micro_batch_size
Micro batch size.
- global_batch_size
Global batch size.
- recompute_granularity
For activation checkpointing.
- num_layers
For using a reduced number of layers as with proxy models.
Further reading#
For an introduction to Primus, see Primus: A Lightweight, Unified Training Framework for Large Models on AMD GPUs.
To learn more about system settings and management practices to configure your system for AMD Instinct MI300X series accelerators, see AMD Instinct MI300X system optimization.
For a list of other ready-made Docker images for AI with ROCm, see AMD Infinity Hub.
Previous versions#
See Megatron-LM training performance testing version history to find documentation for previous releases
of the ROCm/megatron-lm
Docker image.
This training environment now uses Primus with Megatron as the primary configuration. Limited support for the legacy ROCm Megatron-LM is still available; see the Training a model with Megatron-LM on ROCm documentation.