[MoE Refactor] Convert mxfp4 moe quant method into oracle by zyongye · Pull Request #34983 · vllm-project/vllm

zyongye · 2026-02-20T19:44:23Z

Purpose

Ongoing MXFP4 MoE refactor

This PR can be greatly improved once #32564 is merged.

Test Plan

gpt-oss-20b gpqa score with medium reason effort.

Blackwell (gb200):

Flashinfer mx4 x bf16
Flashinfer mx4 x mx8 trtllm
Flashinfer mx4 x mx8 cutlass
Triton kernel

Hopper (h200):

Triton kernel
Flashinfer mx4 x bf16 cutlass
marlin

All of them are test with TP=2 and DP/EP=2

Test Result

Blackwell:

Kernel	TP=2	DP/EP=2
Flashinfer mx4 x bf16	0.674	0.6628
Flashinfer mx4 x mx8 trtllm	0.66161	0.653
Flashinfer mx4 x mx8 cutlass	0.6767	N/A
Triton	0.66098	fail on main

Hopper

Kernel	TP=2	DP/EP=2
Triton	0.6609	fail on main
Flashinfer mx4 x bf16 cutlass		N/A
Marlin	0.6761	0.671

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2026-02-20T19:45:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a significant and well-designed refactoring by creating an "oracle" for MXFP4 MoE backend selection and configuration. This greatly improves the structure by centralizing backend-specific logic, which was previously scattered across multiple files. The changes to the testing framework to use configuration files are also a welcome improvement for maintainability and scalability.

However, I've found a few critical bugs that need to be addressed:

An incorrect condition in the backend selection logic.
Typos in weight loading that will cause runtime errors.
A duplicated enum value in a conditional check.

Additionally, there's a recurring use of hardcoded magic numbers specific to one model, which should be made configurable to improve maintainability and support for other models.

After these issues are fixed, this PR will be a great step forward for the codebase.

gemini-code-assist · 2026-02-20T19:46:51Z

vllm/model_executor/layers/fused_moe/oracle/mxfp4.py

+        #    n = hidden_size
+        #    k = intermediate_size_per_partition_after_pad
+        intermediate_size = round_up(intermediate_size, 128)
+        if backend == current_platform.is_xpu():


The condition if backend == current_platform.is_xpu(): is incorrect. backend is an enum member, while current_platform.is_xpu() returns a boolean. This will always evaluate to False and lead to incorrect padding logic for XPU devices. The check should be if current_platform.is_xpu():.

Suggested change

if backend == current_platform.is_xpu():

if current_platform.is_xpu():

gemini-code-assist · 2026-02-20T19:46:51Z

vllm/model_executor/layers/quantization/mxfp4.py

+        w2_scale = layer.w2_weight
+        w13_bias = layer.w1_bias
+        w2_bias = layer.w2_bias


There are a few typos here that will likely cause an AttributeError at runtime:

w2_scale is incorrectly assigned from layer.w2_weight. It should be layer.w2_weight_scale.

w13_bias is assigned from layer.w1_bias, but the parameter is named w13_bias.

w2_bias is assigned from layer.w2_bias, but the parameter is named w2_bias.

Since biases are optional, it's safer to use getattr with a default value of None.

Suggested change

w2_scale = layer.w2_weight

w13_bias = layer.w1_bias

w2_bias = layer.w2_bias

w2_scale = layer.w2_weight_scale

w13_bias = getattr(layer, "w13_bias", None)

w2_bias = getattr(layer, "w2_bias", None)

gemini-code-assist · 2026-02-20T19:46:51Z

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py

+            self.gemm1_alpha = torch.tensor(
+                [1.702] * self.num_experts, dtype=torch.float32, device=self.device
+            )
+            self.gemm1_beta = torch.tensor(
+                [1.0] * self.num_experts, dtype=torch.float32, device=self.device
+            )
+            self.gemm1_clamp_limit = torch.tensor(
+                [7.0] * self.num_experts, dtype=torch.float32, device=self.device
+            )


The values for gemm1_alpha (1.702) and gemm1_clamp_limit (7.0) are hardcoded. The comment indicates these are specific to gpt-oss. Hardcoding model-specific parameters makes the code less maintainable and harder to extend to other models. These values should be passed in through the model configuration rather than being hardcoded in the kernel implementation. This issue is also present in trtllm_moe.py and the new oracle/mxfp4.py.

gemini-code-assist · 2026-02-20T19:46:51Z

vllm/model_executor/layers/fused_moe/oracle/mxfp4.py

+        Mxfp4MoeBackend.FLASHINFER_CUTLASS_MXFP4_BF16,
+        Mxfp4MoeBackend.FLASHINFER_CUTLASS_MXFP4_BF16,


There's a duplicate Mxfp4MoeBackend.FLASHINFER_CUTLASS_MXFP4_BF16 in the elif condition. This is likely a copy-paste error and should probably be Mxfp4MoeBackend.FLASHINFER_CUTLASS_MXFP4_MXFP8 to cover both CUTLASS backends.

Suggested change

Mxfp4MoeBackend.FLASHINFER_CUTLASS_MXFP4_BF16,

Mxfp4MoeBackend.FLASHINFER_CUTLASS_MXFP4_BF16,

Mxfp4MoeBackend.FLASHINFER_CUTLASS_MXFP4_BF16,

Mxfp4MoeBackend.FLASHINFER_CUTLASS_MXFP4_MXFP8,

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye · 2026-02-23T05:21:40Z

/gemini review

gemini-code-assist

Code Review

This pull request is a significant and valuable refactoring that introduces an oracle for MXFP4 MoE quantization, centralizing backend selection and processing logic. This improves the architecture by making it more modular and extensible, similar to how FP8 and NVFP4 are handled. The addition of comprehensive tests for various backends and hardware is also a great enhancement. My review focuses on a few areas to improve the robustness and maintainability of the new oracle, including fixing a logic bug, correcting a typo in an enum, and removing hardcoded model-specific values from generic kernels.

gemini-code-assist · 2026-02-23T05:23:52Z

vllm/model_executor/layers/fused_moe/oracle/mxfp4.py

+        #    n = hidden_size
+        #    k = intermediate_size_per_partition_after_pad
+        intermediate_size = round_up(intermediate_size, 128)
+        if backend == current_platform.is_xpu():


The condition if backend == current_platform.is_xpu(): is incorrect. backend is an enum member, while current_platform.is_xpu() returns a boolean. This comparison will always evaluate to False, leading to incorrect behavior on XPU platforms. The condition should likely be if current_platform.is_xpu(): to check the platform type directly.

if current_platform.is_xpu():

gemini-code-assist · 2026-02-23T05:23:52Z

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py

+            self.gemm1_alpha = torch.tensor(
+                [1.702] * self.num_experts, dtype=torch.float32, device=self.device
+            )
+            self.gemm1_beta = torch.tensor(
+                [1.0] * self.num_experts, dtype=torch.float32, device=self.device
+            )
+            self.gemm1_clamp_limit = torch.tensor(
+                [7.0] * self.num_experts, dtype=torch.float32, device=self.device
+            )


The values for gemm1_alpha, gemm1_beta, and gemm1_clamp_limit are hardcoded. The comment indicates these are specific to the gpt-oss model. Hardcoding model-specific parameters within a general-purpose kernel reduces its reusability and makes it harder to maintain. These values should be made configurable and passed in, for example, through the model's configuration, to make the kernel more generic.

gemini-code-assist · 2026-02-23T05:23:52Z

vllm/model_executor/layers/fused_moe/oracle/mxfp4.py

+    FLASHINFER_TRTLLM_MXFP4_MXFP8_MONOLITHIC = (
+        "FLASHINFER_TRTLLM_MXFP4_MXFP8_MONOLITHIC"
+    )
+    FLASHINFER_CUTLASS_MXFP4_MXFP8 = "FLASHINFER_MXFP4_MXFP8_CUTLASS"
+    FLASHINFER_TRTLLM_MXFP4_BF16 = "FLASHINFER_MXFP4_BF16"
+    FLASHINFER_TRTLLM_MXFP4_BF16_MONOLOTHIC = "FLASHINFER_MXFP4_BF16_MONOLOTHIC"


There's a recurring typo MONOLOTHIC which should be MONOLITHIC. This appears in the enum member names and their string values. While it's used consistently, it's best to correct it for clarity and to prevent future confusion. This typo is present on lines 55, 56, 60, and several other places in this file and others where these enum members are used.

Suggested change

FLASHINFER_TRTLLM_MXFP4_MXFP8_MONOLITHIC = (

"FLASHINFER_TRTLLM_MXFP4_MXFP8_MONOLITHIC"

)

FLASHINFER_CUTLASS_MXFP4_MXFP8 = "FLASHINFER_MXFP4_MXFP8_CUTLASS"

FLASHINFER_TRTLLM_MXFP4_BF16 = "FLASHINFER_MXFP4_BF16"

FLASHINFER_TRTLLM_MXFP4_BF16_MONOLOTHIC = "FLASHINFER_MXFP4_BF16_MONOLOTHIC"

FLASHINFER_TRTLLM_MXFP4_MXFP8_MONOLITHIC = (

"FLASHINFER_TRTLLM_MXFP4_MXFP8_MONOLITHIC"

)

FLASHINFER_CUTLASS_MXFP4_MXFP8 = "FLASHINFER_MXFP4_MXFP8_CUTLASS"

FLASHINFER_TRTLLM_MXFP4_BF16 = "FLASHINFER_MXFP4_BF16"

FLASHINFER_TRTLLM_MXFP4_BF16_MONOLITHIC = "FLASHINFER_MXFP4_BF16_MONOLITHIC"

gemini-code-assist · 2026-02-23T05:23:52Z

vllm/model_executor/layers/fused_moe/oracle/mxfp4.py

+        Mxfp4MoeBackend.FLASHINFER_CUTLASS_MXFP4_BF16,
+        Mxfp4MoeBackend.FLASHINFER_CUTLASS_MXFP4_BF16,
+    ):


The backend Mxfp4MoeBackend.FLASHINFER_CUTLASS_MXFP4_BF16 is listed twice in this condition. This duplication should be removed.

elif backend == Mxfp4MoeBackend.FLASHINFER_CUTLASS_MXFP4_BF16:

gemini-code-assist · 2026-02-23T05:23:52Z

vllm/model_executor/layers/fused_moe/trtllm_moe.py

+        self.gemm1_alpha = torch.tensor(
+            [1.702] * self.num_experts, dtype=torch.float32, device=self.device
+        )
+        self.gemm1_beta = torch.tensor(
+            [1.0] * self.num_experts, dtype=torch.float32, device=self.device
+        )
+        self.gemm1_clamp_limit = torch.tensor(
+            [7.0] * self.num_experts, dtype=torch.float32, device=self.device
+        )


The values for gemm1_alpha, gemm1_beta, and gemm1_clamp_limit are hardcoded here. As noted in a similar file, these values appear to be specific to a particular model (gpt-oss). To improve the reusability and generality of this kernel, these parameters should be made configurable and passed in, rather than being hardcoded.

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

mergify · 2026-02-26T05:21:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

mergify · 2026-02-26T22:47:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

zyongye · 2026-03-16T02:14:38Z

close for #37128

mergify bot added ci/build gpt-oss Related to GPT-OSS models nvidia labels Feb 20, 2026

github-project-automation bot added this to NVIDIA and gpt-oss Issues & Enhancements Feb 20, 2026

mergify bot added the needs-rebase label Feb 20, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Feb 20, 2026

gemini-code-assist bot reviewed Feb 20, 2026

View reviewed changes

zyongye added 21 commits February 20, 2026 20:03

adding mxfp4 quant key

0a872d1

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

runnable but not correct

20ff584

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

remove unused variable

d9b14fe

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

bug fix

b26944d

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

convert bf16

4fd583d

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

revert back scalar dtype

0fcedd9

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

fix trtllm moe

614aa94

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

add tune size to flashinfer experts

c56b55f

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

move kernel setup to process_weight

4124fcc

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

only cast when act is fp8

6451020

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

add topk_ids contiguous assertion

5781493

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

add testing infrastructure

c9704d2

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

fix pre-commit

b4bb93f

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

change parameter inside the kernels

19a2bec

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

change ci to h100

977387b

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

add back quant function parameters

49534f6

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

add back dep interface

3fa87c9

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

add back dep interface

413fcb1

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

fixing trtllm moe and pre commit

6ed9a44

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

assert not using dep

3658e86

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

bring back dep

88d3be1

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

update compressed_tensors

6abf521

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye force-pushed the mxfp4_oracle branch from 392a502 to 6abf521 Compare February 20, 2026 20:04

mergify bot removed the needs-rebase label Feb 20, 2026

zyongye added 10 commits February 20, 2026 20:09

fix quark

318e0fa

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

update config

14f89ef

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

trtllm working

fe3dee7

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

pre-commit

ade83af

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

fixing trtllm experts

35dc29e

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

add oai silu to support activation

7625260

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

update selection and triton experts attribute

3339ecc

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

triton backend working

c9c430c

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

change type annotation

6bbcf2a

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

fixing type

0b7a2ac

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

gemini-code-assist bot reviewed Feb 23, 2026

View reviewed changes

zyongye added 2 commits February 23, 2026 19:48

fixing typo

17120aa

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

fixing more typos

045335c

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

mergify bot added the needs-rebase label Feb 26, 2026

Klaud-Cold mentioned this pull request Feb 26, 2026

[NVIDIA] update H100, H200, B200 GPT OSS vLLM image to latest 0.16.0 SemiAnalysisAI/InferenceX#798

Closed

Merge branch 'main' into mxfp4_oracle

98cd346

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye force-pushed the mxfp4_oracle branch from 981ea82 to 98cd346 Compare February 26, 2026 18:02

mergify bot removed the needs-rebase label Feb 26, 2026

mergify bot added the needs-rebase label Feb 26, 2026

zyongye mentioned this pull request Mar 15, 2026

[MoE Refactor] Mxfp4 oracle rebased #37128

Merged

5 tasks

zyongye closed this Mar 16, 2026

github-project-automation bot moved this to Done in NVIDIA Mar 16, 2026

github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Mar 16, 2026

	if backend == current_platform.is_xpu():
	if current_platform.is_xpu():

		Mxfp4MoeBackend.FLASHINFER_CUTLASS_MXFP4_BF16,
		Mxfp4MoeBackend.FLASHINFER_CUTLASS_MXFP4_BF16,

Uh oh!

Conversation

zyongye commented Feb 20, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Feb 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

zyongye commented Feb 23, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

zyongye commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zyongye commented Feb 20, 2026 •

edited by github-actions bot

Loading