FlashInfer compatibility#
2026-02-04
4 min read time
FlashInfer is a library and kernel generator for Large Language Models (LLMs) that provides a high-performance implementation of graphics processing units (GPUs) kernels. FlashInfer focuses on LLM serving and inference, as well as advanced performance across diverse scenarios.
FlashInfer features highly efficient attention kernels, load-balanced scheduling, and memory-optimized
techniques, while supporting customized attention variants. It’s compatible with torch.compile, and
offers high-performance LLM-specific operators, with easy integration through PyTorch, and C++ APIs.
Note
The ROCm port of FlashInfer is under active development, and some features are not yet available.
For the latest feature compatibility matrix, refer to the README of the
ROCm/flashinfer repository.
Support overview#
The ROCm-supported version of FlashInfer is maintained in the official ROCm/flashinfer repository, which differs from the flashinfer-ai/flashinfer upstream repository.
To get started and install FlashInfer on ROCm, use the prebuilt Docker images, which include ROCm, FlashInfer, and all required dependencies.
See the ROCm FlashInfer installation guide for installation and setup instructions.
You can also consult the upstream Installation guide for additional context.
Compatibility matrix#
AMD validates and publishes FlashInfer images with ROCm backends on Docker Hub. The following Docker image tag and associated inventories represent the latest available FlashInfer version from the official Docker Hub. Click to view the image on Docker Hub.
Docker image |
ROCm |
FlashInfer |
PyTorch |
Ubuntu |
Python |
GPU |
|---|---|---|---|---|---|---|
| rocm/flashinfer | 24.04 |
MI325X, MI300X |
||||
| rocm/flashinfer | 24.04 |
MI300X |
Use cases and recommendations#
FlashInfer on ROCm enables you to perform LLM inference for both prefill and decode: during prefill, your model efficiently processes input prompts to build KV caches and internal activations; during decode, it generates tokens sequentially based on prior outputs and context. Use the attention mode supported upstream (Multi-Head Attention, Grouped-Query Attention, or Multi-Query Attention) that matches your model configuration.
FlashInfer on ROCm also includes capabilities such as load balancing, sparse and dense attention optimizations, and single and batch decode, alongside prefill for high‑performance execution on MI300X GPUs.
For currently supported use cases and recommendations, refer to the AMD ROCm blog, where you can search for examples and best practices to optimize your workloads on AMD GPUs.
Previous versions#
See rocm-install-on-linux:install/3rd-party/previous-versions/flashinfer-history to find documentation for previous releases
of the ROCm/flashinfer Docker image.