[ROCm] Enable FP8 inference on gfx1201 AMD RDNA4 (Radeon AI PRO R9700) with aiter kernels#36659
[ROCm] Enable FP8 inference on gfx1201 AMD RDNA4 (Radeon AI PRO R9700) with aiter kernels#36659vllmellm wants to merge 7 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request extends vLLM's support for ROCm gfx12x (RDNA4) GPUs by introducing specific detection for gfx12x architectures and integrating it into various components. Key changes include enabling AITER support for gfx12x, implementing a conditional mha_v3 Triton kernel for flash_attn_varlen_func on gfx12x for performance, and extending FP8 quantization support to gfx12x for fused and batched Mixture of Experts (MoE) layers. Additionally, new Triton tuning configurations for fused MoE on AMD_Radeon_AI_PRO_R9700 (a gfx12x device) have been added. A potential issue was identified where the condition on_gfx1x() and on_gfx12x() in get_vit_attn_backend will always be false, inadvertently disabling the Triton Flash Attention backend for both gfx11 and gfx12x devices in that specific function.
Cherry-picked and adapted from 4 open PRs: - vllm-project#34740 (laudney): Replace on_gfx9()/on_mi3xx() FP8 gates with supports_fp8(), unblocking FP8 on RDNA4/gfx12 - vllm-project#34709 (laudney): Enable wvSplitK/wvSplitKQ skinny GEMM kernels for RDNA4 decode (~15% improvement), wave32 DPP reduction - vllm-project#34741 (laudney): FP8 KV-cache for RDNA4 custom paged attention via software dequantization - vllm-project#36659 (vllmellm): Tuned FP8 MoE Triton configs for AMD Radeon AI PRO R9700, AITER mha_v3 attention on gfx12x
|
if someone is interested to validate also MoE tuning for Int4, find attached the configuration files. E=128,N=768,device_name=AMD_Radeon_AI_PRO_R9700,dtype=int4_w4a16.json @ |
Purpose
No tuned Triton FP8 MoE configuration existed for the AMD Radeon AI PRO R9700 (gfx1201, RDNA4). vLLM selects fused MoE tiling parameters by
(E, N, device_name, dtype)key at runtime — without an R9700 entry, it falls back to untuned defaults, leaving significant performance on the table.Related: #28649.
Changes:
1. FP8 MoE Path for gfx12x (
vllm/model_executor/layers/fused_moe/fused_moe.py)device_supports_fp8inTritonExperts._supports_quant_scheme()was gated onis_rocm_on_gfx9. Extended to includeis_rocm_on_gfx12(via the existingon_gfx12x()platform helper), enabling the FP8 expert linear kernel path for RDNA4. Without this, FP8 weight quantization is skipped and the MoE falls back to BF16/FP16 computation.2. Tuned Triton FP8 MoE Config for R9700 (
vllm/model_executor/layers/fused_moe/configs/)Added
E=64,N=768,device_name=AMD_Radeon_AI_PRO_R9700,dtype=fp8_w8a8,block_shape=[128,128].json.vLLM selects fused MoE tiling parameters by
(E, N, device_name, dtype)key at runtime. No R9700 entry existed previously, causing fallback to untuned defaults. This config covers Qwen3-30B-A3B-FP8 (E=64 routed experts, moe_intermediate_size=768).Test Plan
Test Result
Comparison: Default (no tuned config, fallback tiling) vs Tuned MoE (with this config).
Qwen3-30B-A3B-FP8 (MoE)
Mean TTFT (s) — lower is better
Mean TPOT (s) — lower is better
Total Token Throughput (tok/s) — higher is better
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.