[ROCm][Perf] Add MXFP4 linear method and enable shared expert fusion by ChuanLi1101 · Pull Request #37800 · vllm-project/vllm

ChuanLi1101 · 2026-03-22T10:46:54Z

Summary

Implement Mxfp4LinearMethod to replace the UnquantizedLinearMethod fallback for MXFP4-quantized linear layers (addressing the existing TODO in mxfp4.py). On ROCm, this uses AITER's Triton FP4 GEMM (gemm_afp4wfp4) with dynamic activation quantization, matching the same kernel path used by ROCm/ATOM. On CUDA, it uses the Marlin FP4 kernel.
Enable VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS by default to match ATOM's optimized configuration for MoE models (DeepSeek, GPT-OSS). This fuses shared expert computation with routed experts for reduced kernel launch overhead.

Motivation

Comparing vLLM's ROCm/AITER integration with ROCm/ATOM revealed several performance gaps:

MXFP4 Linear Layers: vLLM fell back to UnquantizedLinearMethod() for all linear layers under Mxfp4Config, meaning attention projections and other dense layers ran in full BF16 precision even when the model checkpoint contained MXFP4-quantized weights. ATOM uses aiter.gemm_a4w4 / gemm_afp4wfp4 for these layers.
Shared Expert Fusion: vLLM had this disabled by default (VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=False) while ATOM enables it, reducing kernel launch overhead in MoE inference.

Changes

`vllm/model_executor/layers/quantization/mxfp4.py`

Added Mxfp4LinearMethod(LinearMethodBase) class with:
- create_weights(): Allocates packed MXFP4 weights (uint8, 2 values/byte) and E8M0 group scales
- process_weights_after_loading(): Prepares weights for AITER Triton kernel (ROCm) or Marlin (CUDA)
- apply(): Routes to rocm_aiter_ops.triton_fp4_gemm_dynamic_qaunt on ROCm or apply_fp4_marlin_linear on CUDA
Updated Mxfp4Config.get_quant_method() to return Mxfp4LinearMethod() on ROCm (with AITER) and CUDA, instead of always falling back to UnquantizedLinearMethod()

`vllm/envs.py`

Changed VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS default from False to True

Test plan

Verify MXFP4-quantized model loading with linear layers on ROCm (e.g., amd/DeepSeek-R1-0528-MXFP4)
Verify MXFP4-quantized model loading on CUDA with Marlin backend
Benchmark MoE models (DeepSeek, GPT-OSS) with shared expert fusion enabled vs disabled
Verify existing MXFP4 MoE path is unaffected
Run existing quantization unit tests

gemini-code-assist

Code Review

This pull request introduces performance optimizations for ROCm by implementing a dedicated Mxfp4LinearMethod for MXFP4 quantized linear layers and enabling shared expert fusion by default for MoE models. The new linear method leverages AITER's Triton FP4 GEMM on ROCm and the Marlin FP4 kernel on CUDA, replacing the previous fallback to unquantized methods. The changes appear correct and well-aligned with the goal of improving performance on ROCm. I have one suggestion to improve the clarity of a comment in the new Mxfp4LinearMethod to enhance maintainability.

gemini-code-assist · 2026-03-22T10:49:31Z

+            # Transpose scale so that triton_fp4_gemm_dynamic_qaunt's
+            # internal .T produces the [N, K/32] layout the kernel expects.


The comment is confusing. It refers to an "internal .T" in triton_fp4_gemm_dynamic_qaunt. This function is defined in vllm/_aiter_ops.py and explicitly transposes weight_scale before passing it to the gemm_afp4wfp4 kernel. The current implementation correctly pre-transposes the scale to cancel out this operation, but the comment is misleading and could cause confusion during future maintenance. A clearer comment would improve maintainability and prevent potential bugs.

Suggested change

# Transpose scale so that triton_fp4_gemm_dynamic_qaunt's

# internal .T produces the [N, K/32] layout the kernel expects.

# The `triton_fp4_gemm_dynamic_qaunt` function transposes `weight_scale`.

# We pre-transpose it here to cancel that out.

Implement Mxfp4LinearMethod to replace the UnquantizedLinearMethod fallback for MXFP4-quantized linear layers, addressing the TODO in the existing code. On ROCm, this uses AITER Triton FP4 GEMM (gemm_afp4wfp4) with dynamic activation quantization (matching the ATOM kernel path). On CUDA, it uses the Marlin FP4 kernel. Also enable VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS by default to match ATOM optimized defaults for MoE model performance. Made-with: Cursor Signed-off-by: Li <chuali@amd.com> Made-with: Cursor Signed-off-by: Li <chuali@amd.com> Made-with: Cursor

ChuanLi1101 · 2026-03-22T10:58:03Z

cc @zejunchen-zejun Could you take a look at this PR? It implements MXFP4 linear method using AITER Triton FP4 GEMM on ROCm (matching the ATOM kernel path) and enables shared expert fusion by default.

ChuanLi1101 · 2026-03-22T11:02:33Z

cc @wuhuikx Could you also help review this PR? It leverages the same AITER FP4 GEMM kernel path as ATOM for MXFP4 linear layers and enables shared expert fusion by default on ROCm.

ChuanLi1101 · 2026-03-22T11:05:35Z

Hi reviewers, could someone please help add the ready label to this PR? The pre-run-check CI gate requires it (author has < 4 merged PRs). Thanks! cc @mgoin @robertgshaw2-redhat @tlrmchlsmth @zejunchen-zejun @wuhuikx

geraldstanje · 2026-03-24T01:49:29Z

hi @ChuanLi1101 will that help for gpt oss 20b model using rtx 6000 pro?

mergify · 2026-04-01T07:16:40Z

Hi @ChuanLi1101, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

robertgshaw2-redhat · 2026-04-05T14:05:02Z

this PR does not make sense. quantization=mxfp4 is the quantization format for gpt-oss

You should be updating the quark integration if you want to run models like amd/DeepSeek-R1-0528-MXFP4

The gpt-oss model does not quantize the linear layers, so this PR will break.

robertgshaw2-redhat

see comment above

BowenBao · 2026-04-07T23:50:44Z

@ChuanLi1101 echoing with @robertgshaw2-redhat , quantization=quark supports mxfp4 w4a4 linear. You can prepare the model with quark to have the linear layers quantized to mxfp4.

mergify · 2026-05-23T07:30:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ChuanLi1101.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ChuanLi1101 requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners March 22, 2026 10:46

mergify Bot added the rocm Related to AMD ROCm label Mar 22, 2026

github-project-automation Bot moved this to Todo in AMD Mar 22, 2026

github-project-automation Bot added this to AMD Mar 22, 2026

gemini-code-assist Bot reviewed Mar 22, 2026

View reviewed changes

ChuanLi1101 force-pushed the feature/rocm-mxfp4-linear-perf branch from 943f2f6 to 7018937 Compare March 22, 2026 10:52

Merge branch 'main' into feature/rocm-mxfp4-linear-perf

581f885

ChuanLi1101 mentioned this pull request Apr 3, 2026

Request for triage permission — active ROCm contributor #38954

Closed

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 5, 2026

robertgshaw2-redhat requested changes Apr 5, 2026

View reviewed changes

This was referenced May 20, 2026

[Performance]: Triton fusion for Qwen2/3-MoE shared-expert gate (Qwen2MoeMLP/Qwen3MoeMLP) #43187

Open

[Kernel] Fuse Qwen2/3-MoE shared-expert sigmoid gate into a Triton kernel #43190

Open

peymanr mentioned this pull request May 20, 2026

[RFC] [Kernel] Fuse Qwen2/3-MoE shared-expert sigmoid gate into a Triton kernel peymanr/vllm#25

Open

mergify Bot added the needs-rebase label May 23, 2026

tjtanaa marked this pull request as draft May 29, 2026 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][Perf] Add MXFP4 linear method and enable shared expert fusion#37800

[ROCm][Perf] Add MXFP4 linear method and enable shared expert fusion#37800
ChuanLi1101 wants to merge 2 commits into
vllm-project:mainfrom
ChuanLi1101:feature/rocm-mxfp4-linear-perf

ChuanLi1101 commented Mar 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 22, 2026

Uh oh!

ChuanLi1101 commented Mar 22, 2026

Uh oh!

ChuanLi1101 commented Mar 22, 2026

Uh oh!

ChuanLi1101 commented Mar 22, 2026

Uh oh!

geraldstanje commented Mar 24, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Apr 1, 2026

Uh oh!

robertgshaw2-redhat commented Apr 5, 2026 •

edited

Loading

Uh oh!

robertgshaw2-redhat left a comment

Uh oh!

BowenBao commented Apr 7, 2026

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		# Transpose scale so that triton_fp4_gemm_dynamic_qaunt's
		# internal .T produces the [N, K/32] layout the kernel expects.

Uh oh!

Conversation

ChuanLi1101 commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

vllm/model_executor/layers/quantization/mxfp4.py

vllm/envs.py

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

ChuanLi1101 commented Mar 22, 2026

Uh oh!

ChuanLi1101 commented Mar 22, 2026

Uh oh!

ChuanLi1101 commented Mar 22, 2026

Uh oh!

geraldstanje commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Apr 1, 2026

Uh oh!

robertgshaw2-redhat commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

Uh oh!

BowenBao commented Apr 7, 2026

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ChuanLi1101 commented Mar 22, 2026 •

edited

Loading

`vllm/model_executor/layers/quantization/mxfp4.py`

`vllm/envs.py`

geraldstanje commented Mar 24, 2026 •

edited

Loading

robertgshaw2-redhat commented Apr 5, 2026 •

edited

Loading