FlashInfer compatibility#

2026-02-04

4 min read time

Applies to Linux

FlashInfer is a library and kernel generator for Large Language Models (LLMs) that provides a high-performance implementation of graphics processing units (GPUs) kernels. FlashInfer focuses on LLM serving and inference, as well as advanced performance across diverse scenarios.

FlashInfer features highly efficient attention kernels, load-balanced scheduling, and memory-optimized techniques, while supporting customized attention variants. It’s compatible with torch.compile, and offers high-performance LLM-specific operators, with easy integration through PyTorch, and C++ APIs.

Note

The ROCm port of FlashInfer is under active development, and some features are not yet available. For the latest feature compatibility matrix, refer to the README of the ROCm/flashinfer repository.

Support overview#

Compatibility matrix#

AMD validates and publishes FlashInfer images with ROCm backends on Docker Hub. The following Docker image tag and associated inventories represent the latest available FlashInfer version from the official Docker Hub. Click to view the image on Docker Hub.

Docker image

ROCm

FlashInfer

PyTorch

Ubuntu

Python

GPU

rocm/flashinfer

7.1.1

v0.2.5

2.8.0

24.04

3.12

MI325X, MI300X

rocm/flashinfer

6.4.1

v0.2.5

2.7.1

24.04

3.12

MI300X

Use cases and recommendations#

FlashInfer on ROCm enables you to perform LLM inference for both prefill and decode: during prefill, your model efficiently processes input prompts to build KV caches and internal activations; during decode, it generates tokens sequentially based on prior outputs and context. Use the attention mode supported upstream (Multi-Head Attention, Grouped-Query Attention, or Multi-Query Attention) that matches your model configuration.

FlashInfer on ROCm also includes capabilities such as load balancing, sparse and dense attention optimizations, and single and batch decode, alongside prefill for high‑performance execution on MI300X GPUs.

For currently supported use cases and recommendations, refer to the AMD ROCm blog, where you can search for examples and best practices to optimize your workloads on AMD GPUs.

Previous versions#

See rocm-install-on-linux:install/3rd-party/previous-versions/flashinfer-history to find documentation for previous releases of the ROCm/flashinfer Docker image.