Skip to content

[ROCm][Perf] Add MXFP4 linear method and enable shared expert fusion#37800

Draft
ChuanLi1101 wants to merge 2 commits into
vllm-project:mainfrom
ChuanLi1101:feature/rocm-mxfp4-linear-perf
Draft

[ROCm][Perf] Add MXFP4 linear method and enable shared expert fusion#37800
ChuanLi1101 wants to merge 2 commits into
vllm-project:mainfrom
ChuanLi1101:feature/rocm-mxfp4-linear-perf

Conversation

@ChuanLi1101
Copy link
Copy Markdown
Collaborator

@ChuanLi1101 ChuanLi1101 commented Mar 22, 2026

Summary

  • Implement Mxfp4LinearMethod to replace the UnquantizedLinearMethod fallback for MXFP4-quantized linear layers (addressing the existing TODO in mxfp4.py). On ROCm, this uses AITER's Triton FP4 GEMM (gemm_afp4wfp4) with dynamic activation quantization, matching the same kernel path used by ROCm/ATOM. On CUDA, it uses the Marlin FP4 kernel.
  • Enable VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS by default to match ATOM's optimized configuration for MoE models (DeepSeek, GPT-OSS). This fuses shared expert computation with routed experts for reduced kernel launch overhead.

Motivation

Comparing vLLM's ROCm/AITER integration with ROCm/ATOM revealed several performance gaps:

  1. MXFP4 Linear Layers: vLLM fell back to UnquantizedLinearMethod() for all linear layers under Mxfp4Config, meaning attention projections and other dense layers ran in full BF16 precision even when the model checkpoint contained MXFP4-quantized weights. ATOM uses aiter.gemm_a4w4 / gemm_afp4wfp4 for these layers.
  2. Shared Expert Fusion: vLLM had this disabled by default (VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=False) while ATOM enables it, reducing kernel launch overhead in MoE inference.

Changes

vllm/model_executor/layers/quantization/mxfp4.py

  • Added Mxfp4LinearMethod(LinearMethodBase) class with:
    • create_weights(): Allocates packed MXFP4 weights (uint8, 2 values/byte) and E8M0 group scales
    • process_weights_after_loading(): Prepares weights for AITER Triton kernel (ROCm) or Marlin (CUDA)
    • apply(): Routes to rocm_aiter_ops.triton_fp4_gemm_dynamic_qaunt on ROCm or apply_fp4_marlin_linear on CUDA
  • Updated Mxfp4Config.get_quant_method() to return Mxfp4LinearMethod() on ROCm (with AITER) and CUDA, instead of always falling back to UnquantizedLinearMethod()

vllm/envs.py

  • Changed VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS default from False to True

Test plan

  • Verify MXFP4-quantized model loading with linear layers on ROCm (e.g., amd/DeepSeek-R1-0528-MXFP4)
  • Verify MXFP4-quantized model loading on CUDA with Marlin backend
  • Benchmark MoE models (DeepSeek, GPT-OSS) with shared expert fusion enabled vs disabled
  • Verify existing MXFP4 MoE path is unaffected
  • Run existing quantization unit tests

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces performance optimizations for ROCm by implementing a dedicated Mxfp4LinearMethod for MXFP4 quantized linear layers and enabling shared expert fusion by default for MoE models. The new linear method leverages AITER's Triton FP4 GEMM on ROCm and the Marlin FP4 kernel on CUDA, replacing the previous fallback to unquantized methods. The changes appear correct and well-aligned with the goal of improving performance on ROCm. I have one suggestion to improve the clarity of a comment in the new Mxfp4LinearMethod to enhance maintainability.

Comment on lines +183 to +184
# Transpose scale so that triton_fp4_gemm_dynamic_qaunt's
# internal .T produces the [N, K/32] layout the kernel expects.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The comment is confusing. It refers to an "internal .T" in triton_fp4_gemm_dynamic_qaunt. This function is defined in vllm/_aiter_ops.py and explicitly transposes weight_scale before passing it to the gemm_afp4wfp4 kernel. The current implementation correctly pre-transposes the scale to cancel out this operation, but the comment is misleading and could cause confusion during future maintenance. A clearer comment would improve maintainability and prevent potential bugs.

Suggested change
# Transpose scale so that triton_fp4_gemm_dynamic_qaunt's
# internal .T produces the [N, K/32] layout the kernel expects.
# The `triton_fp4_gemm_dynamic_qaunt` function transposes `weight_scale`.
# We pre-transpose it here to cancel that out.

Implement Mxfp4LinearMethod to replace the UnquantizedLinearMethod fallback for MXFP4-quantized linear layers, addressing the TODO in the existing code. On ROCm, this uses AITER Triton FP4 GEMM (gemm_afp4wfp4) with dynamic activation quantization (matching the ATOM kernel path). On CUDA, it uses the Marlin FP4 kernel.

Also enable VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS by default to match ATOM optimized defaults for MoE model performance.

Made-with: Cursor
Signed-off-by: Li <chuali@amd.com>
Made-with: Cursor
Signed-off-by: Li <chuali@amd.com>
Made-with: Cursor
@ChuanLi1101 ChuanLi1101 force-pushed the feature/rocm-mxfp4-linear-perf branch from 943f2f6 to 7018937 Compare March 22, 2026 10:52
@ChuanLi1101
Copy link
Copy Markdown
Collaborator Author

cc @zejunchen-zejun Could you take a look at this PR? It implements MXFP4 linear method using AITER Triton FP4 GEMM on ROCm (matching the ATOM kernel path) and enables shared expert fusion by default.

@ChuanLi1101
Copy link
Copy Markdown
Collaborator Author

cc @wuhuikx Could you also help review this PR? It leverages the same AITER FP4 GEMM kernel path as ATOM for MXFP4 linear layers and enables shared expert fusion by default on ROCm.

@ChuanLi1101
Copy link
Copy Markdown
Collaborator Author

Hi reviewers, could someone please help add the ready label to this PR? The pre-run-check CI gate requires it (author has < 4 merged PRs). Thanks! cc @mgoin @robertgshaw2-redhat @tlrmchlsmth @zejunchen-zejun @wuhuikx

@geraldstanje
Copy link
Copy Markdown

geraldstanje commented Mar 24, 2026

hi @ChuanLi1101 will that help for gpt oss 20b model using rtx 6000 pro?

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 1, 2026

Hi @ChuanLi1101, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@robertgshaw2-redhat robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 5, 2026
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

robertgshaw2-redhat commented Apr 5, 2026

this PR does not make sense. quantization=mxfp4 is the quantization format for gpt-oss

You should be updating the quark integration if you want to run models like amd/DeepSeek-R1-0528-MXFP4

The gpt-oss model does not quantize the linear layers, so this PR will break.

Copy link
Copy Markdown
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above

@BowenBao
Copy link
Copy Markdown
Contributor

BowenBao commented Apr 7, 2026

@ChuanLi1101 echoing with @robertgshaw2-redhat , quantization=quark supports mxfp4 w4a4 linear. You can prepare the model with quark to have the linear layers quantized to mxfp4.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ChuanLi1101.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 23, 2026
@tjtanaa tjtanaa marked this pull request as draft May 29, 2026 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

4 participants