[ROCm][Perf] Add MXFP4 linear method and enable shared expert fusion#37800
[ROCm][Perf] Add MXFP4 linear method and enable shared expert fusion#37800ChuanLi1101 wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces performance optimizations for ROCm by implementing a dedicated Mxfp4LinearMethod for MXFP4 quantized linear layers and enabling shared expert fusion by default for MoE models. The new linear method leverages AITER's Triton FP4 GEMM on ROCm and the Marlin FP4 kernel on CUDA, replacing the previous fallback to unquantized methods. The changes appear correct and well-aligned with the goal of improving performance on ROCm. I have one suggestion to improve the clarity of a comment in the new Mxfp4LinearMethod to enhance maintainability.
| # Transpose scale so that triton_fp4_gemm_dynamic_qaunt's | ||
| # internal .T produces the [N, K/32] layout the kernel expects. |
There was a problem hiding this comment.
The comment is confusing. It refers to an "internal .T" in triton_fp4_gemm_dynamic_qaunt. This function is defined in vllm/_aiter_ops.py and explicitly transposes weight_scale before passing it to the gemm_afp4wfp4 kernel. The current implementation correctly pre-transposes the scale to cancel out this operation, but the comment is misleading and could cause confusion during future maintenance. A clearer comment would improve maintainability and prevent potential bugs.
| # Transpose scale so that triton_fp4_gemm_dynamic_qaunt's | |
| # internal .T produces the [N, K/32] layout the kernel expects. | |
| # The `triton_fp4_gemm_dynamic_qaunt` function transposes `weight_scale`. | |
| # We pre-transpose it here to cancel that out. |
Implement Mxfp4LinearMethod to replace the UnquantizedLinearMethod fallback for MXFP4-quantized linear layers, addressing the TODO in the existing code. On ROCm, this uses AITER Triton FP4 GEMM (gemm_afp4wfp4) with dynamic activation quantization (matching the ATOM kernel path). On CUDA, it uses the Marlin FP4 kernel. Also enable VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS by default to match ATOM optimized defaults for MoE model performance. Made-with: Cursor Signed-off-by: Li <chuali@amd.com> Made-with: Cursor Signed-off-by: Li <chuali@amd.com> Made-with: Cursor
943f2f6 to
7018937
Compare
|
cc @zejunchen-zejun Could you take a look at this PR? It implements MXFP4 linear method using AITER Triton FP4 GEMM on ROCm (matching the ATOM kernel path) and enables shared expert fusion by default. |
|
cc @wuhuikx Could you also help review this PR? It leverages the same AITER FP4 GEMM kernel path as ATOM for MXFP4 linear layers and enables shared expert fusion by default on ROCm. |
|
Hi reviewers, could someone please help add the |
|
hi @ChuanLi1101 will that help for gpt oss 20b model using rtx 6000 pro? |
|
Hi @ChuanLi1101, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
this PR does not make sense. You should be updating the The gpt-oss model does not quantize the linear layers, so this PR will break. |
robertgshaw2-redhat
left a comment
There was a problem hiding this comment.
see comment above
|
@ChuanLi1101 echoing with @robertgshaw2-redhat , |
|
This pull request has merge conflicts that must be resolved before it can be |
Summary
Mxfp4LinearMethodto replace theUnquantizedLinearMethodfallback for MXFP4-quantized linear layers (addressing the existing TODO inmxfp4.py). On ROCm, this uses AITER's Triton FP4 GEMM (gemm_afp4wfp4) with dynamic activation quantization, matching the same kernel path used by ROCm/ATOM. On CUDA, it uses the Marlin FP4 kernel.VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTSby default to match ATOM's optimized configuration for MoE models (DeepSeek, GPT-OSS). This fuses shared expert computation with routed experts for reduced kernel launch overhead.Motivation
Comparing vLLM's ROCm/AITER integration with ROCm/ATOM revealed several performance gaps:
UnquantizedLinearMethod()for all linear layers underMxfp4Config, meaning attention projections and other dense layers ran in full BF16 precision even when the model checkpoint contained MXFP4-quantized weights. ATOM usesaiter.gemm_a4w4/gemm_afp4wfp4for these layers.VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=False) while ATOM enables it, reducing kernel launch overhead in MoE inference.Changes
vllm/model_executor/layers/quantization/mxfp4.pyMxfp4LinearMethod(LinearMethodBase)class with:create_weights(): Allocates packed MXFP4 weights (uint8, 2 values/byte) and E8M0 group scalesprocess_weights_after_loading(): Prepares weights for AITER Triton kernel (ROCm) or Marlin (CUDA)apply(): Routes torocm_aiter_ops.triton_fp4_gemm_dynamic_qaunton ROCm orapply_fp4_marlin_linearon CUDAMxfp4Config.get_quant_method()to returnMxfp4LinearMethod()on ROCm (with AITER) and CUDA, instead of always falling back toUnquantizedLinearMethod()vllm/envs.pyVLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTSdefault fromFalsetoTrueTest plan
amd/DeepSeek-R1-0528-MXFP4)