[MoE Refactor][17/N] Apply Refactor to Bf16 by zyongye · Pull Request #31827 · vllm-project/vllm

zyongye · 2026-01-06T19:46:21Z

Purpose

Test Plan

gsm8k result for Triton kernel, flashinfer_cutlass kernel and aiter rocm kernel for Qwen/Qwen3-30B-A3B in TP(triton), TEP (flashinfer cutlass) and rocm.

lm_eval \
                --model local-completions \
                --tasks gsm8k \
                --model_args "model=Qwen/Qwen3-30B-A3B,base_url=http://localhost:8000/v1/completions,num_concurrent=1000,tokenized_requests=False"

Test Result

Triton

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8643|±  |0.0094|
|     |       |strict-match    |     5|exact_match|↑  |0.9045|±  |0.0081|

Flashinfer

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8544|±  |0.0097|
|     |       |strict-match    |     5|exact_match|↑  |0.8984|±  |0.0083|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

^{Cursor Bugbot is generating a summary for commit f354ef2eabe320528a5815978bc8b87ee9fb505d. Configure here.}

Note

Refactors unquantized MoE to centralize backend selection and kernel setup; adds BF16 GSM8K eval configs for MoE models and updates test config lists.

Introduces UnquantizedMoeBackend enum with select_unquantized_moe_backend and convert_to_unquantized_kernel_format, consolidating Aiter/FlashInfer/Triton selection, weight shuffling (swap_w13_to_w31), and kernel init via _setup_kernel in unquantized_fused_moe_method.py.
Simplifies process_weights_after_loading and maybe_make_prepare_finalize; adjusts DP/EP checks (uses dp_rank > 1) and logs; preserves inplace behavior differences per backend.
Adds BF16 GSM8K configs for Llama-4-Scout, Mixtral-8x7B, and Qwen3-30B-A3B (both fi-cutlass and triton) and registers them in config-b200.txt and config-h100.txt.

^{Written by Cursor Bugbot for commit f354ef2eabe320528a5815978bc8b87ee9fb505d. This will update automatically on new commits. Configure here.}

Note

Streamlines unquantized MoE execution and adds BF16 evaluation configs.

Refactors unquantized MoE: introduces vllm/.../oracle/unquantized.py with UnquantizedMoeBackend, select_unquantized_moe_backend, convert_to_unquantized_kernel_format, and make_unquantized_moe_kernel; updates unquantized_fused_moe_method.py to use centralized backend selection, DP/EP checks (dp_rank > 1), weight shuffling/swapping, and kernel setup; simplifies maybe_make_prepare_finalize and CPU/XPU handling; adds kernel assertion.
Adds BF16 GSM8K configs for Llama-4-Scout, Mixtral-8x7B, and Qwen3-30B-A3B (Triton and FlashInfer CUTLASS variants with VLLM_USE_FLASHINFER_MOE_FP16 for CUTLASS) and registers them in config-b200.txt and config-h100.txt.

^{Written by Cursor Bugbot for commit f5fe3788a4f4933c64046e5d6f338ddce5c0fc16. This will update automatically on new commits. Configure here.}

Note

Streamlines unquantized MoE and expands BF16 test coverage.

New oracle/unquantized.py: defines UnquantizedMoeBackend, backend selection (select_unquantized_moe_backend), weight layout conversion, and kernel construction
Refactors unquantized_fused_moe_method.py to use centralized backend logic, simplify process_weights_after_loading/maybe_make_prepare_finalize, adjust DP/EP checks, and assert kernel presence
Adds BF16 GSM8K configs (Triton and FlashInfer CUTLASS variants) for Llama-4-Scout, Mixtral-8x7B, Qwen3-30B-A3B; registers them in config-b200.txt and config-h100.txt

^{Written by Cursor Bugbot for commit bfe742bc808327197c0c05c7a6c98cfa65c35f73. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit 3759b5ff799a6b35dbba16a42b8ef85671fb1e8e. Configure here.}

Note

Streamlines unquantized MoE execution and expands BF16 test coverage.

New oracle/unquantized.py defines UnquantizedMoeBackend, backend selection (select_unquantized_moe_backend), weight layout conversion, and kernel construction for AITER, FlashInfer CUTLASS, and Triton
Refactors unquantized_fused_moe_method.py to use centralized backend logic, simplifying maybe_make_prepare_finalize, CPU/XPU handling, and kernel setup; adds weight shuffling/swapping and kernel assertion
Adds BF16 GSM8K configs (Triton and FlashInfer CUTLASS variants) for Llama-4-Scout, Mixtral-8x7B, and Qwen3-30B-A3B, plus a DP+EP Triton config for Qwen3-30B-A3B; registers them in config-b200.txt and config-h100.txt

^{Written by Cursor Bugbot for commit 522cb1c327bae5b46778a18f03f616f68ec582b3. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit e0752934dec45fd6c34520fcc16e71254c84615d. Configure here.}

robertgshaw2-redhat · 2026-01-06T19:49:38Z

Can you add to the CI jobs?

robertgshaw2-redhat · 2026-01-06T19:52:58Z

vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py

+    TRITON = 3
+
+
+def get_unquantized_moe_backend(


get --> select is the new convention

gemini-code-assist

Code Review

This pull request refactors the unquantized MoE method to introduce a backend selector and a unified kernel setup interface. The changes are well-structured and improve code clarity by centralizing the backend selection logic and separating kernel setup and weight conversion. I've identified a few areas for improvement, including an incorrect type hint, a misleading error message, and unused function parameters, which should be addressed to enhance correctness and maintainability.

gemini-code-assist · 2026-01-06T19:53:32Z

vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py

+
+def get_unquantized_moe_backend(
+    moe_parallel_config: FusedMoEParallelConfig,
+) -> UnquantizedMoeBackend | None:


The return type hint for get_unquantized_moe_backend is UnquantizedMoeBackend | None, but the function always returns a member of the UnquantizedMoeBackend enum and never None. The type hint should be corrected to UnquantizedMoeBackend to accurately reflect the function's behavior.

Suggested change

) -> UnquantizedMoeBackend | None:

) -> UnquantizedMoeBackend:

gemini-code-assist · 2026-01-06T19:53:32Z

vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py

+            )
+        elif self.unquantized_backend == UnquantizedMoeBackend.NONE:
+            raise ValueError(
+                "Unable to select quantization backend, please check supported backend."


The error message is misleading. This method is for unquantized MoE, but the error refers to a 'quantization backend'. It should refer to an 'unquantized MoE backend' to be accurate and avoid confusion.

Suggested change

"Unable to select quantization backend, please check supported backend."

"Unable to select unquantized MoE backend, please check supported backends."

gemini-code-assist · 2026-01-06T19:53:32Z

vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py

+        layer: Module,
+        w13_weight: torch.Tensor | None = None,
+        w2_weight: torch.Tensor | None = None,
+        w13_weight_scale: torch.Tensor | None = None,
+        w2_weight_scale: torch.Tensor | None = None,


The parameters w13_weight, w2_weight, w13_weight_scale, and w2_weight_scale are defined in the _convert_weights_to_kernel_format method signature but are not used within the method. They should be removed to simplify the signature and avoid confusion.

layer: Module

robertgshaw2-redhat · 2026-01-06T19:54:33Z

vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py

+
+    def __init__(self, moe: FusedMoEConfig):
+        super().__init__(moe)
+        self.unquantized_backend = get_unquantized_moe_backend(


rather than passing the moe parallel config, let's just pass use_dp and use_ep

robertgshaw2-redhat · 2026-01-06T19:55:09Z

vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py

+            self.kernel = mk.FusedMoEModularKernel(
+                MoEPrepareAndFinalizeNoEP(),
+                AiterExperts(self.moe_quant_config),
+                shared_experts=None,


drop the shared_experts, this is the default

robertgshaw2-redhat · 2026-01-06T19:58:14Z

vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py

-                    TritonExperts(self.moe_quant_config),
-                    shared_experts=None,
-                )
+            self._convert_weights_to_kernel_format(layer=layer)


to keep things consistent, there should be a single function called _setup_kernel() which does these two steps

robertgshaw2-redhat · 2026-01-06T19:59:13Z

vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py

+                "Unable to select quantization backend, please check supported backend."
+            )
+
+    def _convert_weights_to_kernel_format(


this function should be in oracle (we will eventually make it a method of the Expert)

Just the "replace_parameter" should be part of this function

vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py

tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-BF16-fi-cutlass.yaml

robertgshaw2-redhat · 2026-01-09T23:49:54Z

tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-BF16-triton.yaml

+accuracy_threshold: 0.92
+num_questions: 1319
+num_fewshot: 5
+server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 8"


you need to update this to tp=2

The llama4 is 108B parameter and can't fit in 2 H200 gpus.

Move llama4 tests to b200.

robertgshaw2-redhat · 2026-01-09T23:53:26Z

vllm/model_executor/layers/fused_moe/oracle/unquantized.py

+                    " VLLM_USE_FLASHINFER_MOE_FP16=1 to enable it.",
+                    scope="local",
+                )
+            elif use_dp:


Why does this kernel work with TP/EP but not DP/EP?

I dont see why there would be a distinction. I actually think this should work fine with the MK structure

It needs further investigation. The original code says it doesn't support DP/EP.

Oh maybe it's just not in selec_gemm_impl yet.

robertgshaw2-redhat · 2026-01-09T23:55:17Z

tests/evals/gsm8k/configs/moe-refactor/config-b200.txt

@@ -11,3 +11,9 @@ Qwen3-30B-A3B-NvFp4-ModelOpt-marlin.yaml
 Qwen3-30B-A3B-NvFp4-ModelOpt-fi-trtllm.yaml
 Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutlass.yaml
 Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutlass-dp-ep.yaml
+Llama-4-Scout-BF16-fi-cutlass.yaml


run half on the b200 and some on the h100 for CI time / budget

robertgshaw2-redhat · 2026-01-09T23:55:48Z

this PR is very well done and nicely structured. Just left some minor nits

cursor · 2026-01-09T23:57:50Z

vllm/model_executor/layers/fused_moe/oracle/unquantized.py

+        backend = UnquantizedMoeBackend.CPU
+
+    logger.info_once(_make_log_backend(backend), scope="local")
+    return backend


Backend variable may be uninitialized for unknown platforms

Medium Severity

The select_unquantized_moe_backend function uses separate if statements for platform checks (is_rocm(), is_cuda(), is_xpu(), is_cpu()). If none of these conditions are true (e.g., TPU or a future platform), the backend variable is never assigned, causing an UnboundLocalError when the function tries to log and return it. Using elif with a final else clause that raises an informative error or sets a default would prevent this.

tests/evals/gsm8k/configs/moe-refactor-dp-ep/Qwen3-30B-A3B-BF16-triton.yaml

mergify · 2026-01-10T01:28:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tests/evals/gsm8k/configs/moe-refactor-dp-ep/Qwen3-30B-A3B-BF16-triton.yaml

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

bnellnm · 2026-01-14T03:10:26Z

vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py

 logger = init_logger(__name__)


-# --8<-- [start:unquantized_fused_moe]


I think these weird comments need to preserved for doc purposes.

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>

zyongye requested review from mgoin and pavanimajety as code owners January 6, 2026 19:46

zyongye requested a review from robertgshaw2-redhat January 6, 2026 19:46

robertgshaw2-redhat changed the title ~~[MoE Refactor] Enable Unquantized backend selector and unified kernel setup interface~~ [MoE Refactor][17/N] Apply Refactor to Bf16 Jan 6, 2026

robertgshaw2-redhat reviewed Jan 6, 2026

View reviewed changes

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

robertgshaw2-redhat reviewed Jan 6, 2026

View reviewed changes

mergify bot added the nvidia label Jan 7, 2026

github-project-automation bot added this to NVIDIA Jan 7, 2026

zyongye added this to MoE Refactor Jan 7, 2026

github-project-automation bot moved this to Backlog in MoE Refactor Jan 7, 2026

zyongye moved this from Backlog to In progress in MoE Refactor Jan 7, 2026

zyongye force-pushed the unquantized_moe_backend_selector branch from 30d9ff4 to f354ef2 Compare January 9, 2026 00:54

cursor bot reviewed Jan 9, 2026

View reviewed changes

vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py Outdated Show resolved Hide resolved

tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-BF16-fi-cutlass.yaml Outdated Show resolved Hide resolved

robertgshaw2-redhat reviewed Jan 9, 2026

View reviewed changes

cursor bot reviewed Jan 9, 2026

View reviewed changes

zyongye added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 10, 2026

cursor bot reviewed Jan 10, 2026

View reviewed changes

tests/evals/gsm8k/configs/moe-refactor-dp-ep/Qwen3-30B-A3B-BF16-triton.yaml Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Jan 10, 2026

zyongye force-pushed the unquantized_moe_backend_selector branch from 522cb1c to e075293 Compare January 10, 2026 01:40

mergify bot removed the needs-rebase label Jan 10, 2026

robertgshaw2-redhat reviewed Jan 11, 2026

View reviewed changes

tests/evals/gsm8k/configs/moe-refactor-dp-ep/Qwen3-30B-A3B-BF16-triton.yaml Show resolved Hide resolved

zyongye added 5 commits January 13, 2026 22:53

change typo

7a080d1

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

change test config

4149df9

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

add EP/DP testing for BF16

92baa04

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

updare CI jobs

f036abf

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

change to qwen3

8d0e320

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

zyongye force-pushed the unquantized_moe_backend_selector branch from 4c46940 to 8d0e320 Compare January 13, 2026 22:53

Merge branch 'main' into unquantized_moe_backend_selector

3de3fd6

bnellnm reviewed Jan 14, 2026

View reviewed changes

robertgshaw2-redhat and others added 4 commits January 13, 2026 22:46

Merge branch 'main' into unquantized_moe_backend_selector

a11e03c

Merge branch 'main' into unquantized_moe_backend_selector

7aade9f

add back doc message

85bf40e

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Merge branch 'main' into unquantized_moe_backend_selector

a272b45

mgoin approved these changes Jan 15, 2026

View reviewed changes

vllm-bot merged commit 31c2925 into vllm-project:main Jan 15, 2026
49 of 51 checks passed

github-project-automation bot moved this from In progress to Done in MoE Refactor Jan 15, 2026

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 15, 2026

zyongye mentioned this pull request Jan 15, 2026

[Kernel] Add Sonic MoE integration for Hopper GPUs #31548

Open

vanbasten23 mentioned this pull request Jan 16, 2026

[Bug] Add TPU backend option #32438

Merged

5 tasks

kyuyeunk mentioned this pull request Jan 16, 2026

[Bugfix] Fix unquantized moe from upstream change vllm-project/tpu-inference#1471

Closed

iboiko-habana mentioned this pull request Jan 16, 2026

[Bugfix] Add OOT backend option #32471

Merged

5 tasks

zyongye deleted the unquantized_moe_backend_selector branch March 12, 2026 21:14

	) -> UnquantizedMoeBackend \| None:
	) -> UnquantizedMoeBackend:

	"Unable to select quantization backend, please check supported backend."
	"Unable to select unquantized MoE backend, please check supported backends."

		logger = init_logger(__name__)


		# --8<-- [start:unquantized_fused_moe]

Uh oh!

Conversation

zyongye commented Jan 6, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

robertgshaw2-redhat commented Jan 6, 2026

Uh oh!

robertgshaw2-redhat Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Jan 9, 2026

Uh oh!

cursor bot Jan 9, 2026

Choose a reason for hiding this comment

Backend variable may be uninitialized for unknown platforms

Uh oh!

Uh oh!

mergify bot commented Jan 10, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zyongye commented Jan 6, 2026 •

edited by github-actions bot

Loading

robertgshaw2-redhat Jan 6, 2026 •

edited

Loading

robertgshaw2-redhat Jan 9, 2026 •

edited

Loading