Skip to content

[MoE Refactor][17/N] Apply Refactor to Bf16#31827

Merged
vllm-bot merged 18 commits intovllm-project:mainfrom
zyongye:unquantized_moe_backend_selector
Jan 15, 2026
Merged

[MoE Refactor][17/N] Apply Refactor to Bf16#31827
vllm-bot merged 18 commits intovllm-project:mainfrom
zyongye:unquantized_moe_backend_selector

Conversation

@zyongye
Copy link
Member

@zyongye zyongye commented Jan 6, 2026

Purpose

Test Plan

gsm8k result for Triton kernel, flashinfer_cutlass kernel and aiter rocm kernel for Qwen/Qwen3-30B-A3B in TP(triton), TEP (flashinfer cutlass) and rocm.

lm_eval \
                --model local-completions \
                --tasks gsm8k \
                --model_args "model=Qwen/Qwen3-30B-A3B,base_url=http://localhost:8000/v1/completions,num_concurrent=1000,tokenized_requests=False"

Test Result

Triton

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8643|±  |0.0094|
|     |       |strict-match    |     5|exact_match|↑  |0.9045|±  |0.0081|

Flashinfer

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8544|±  |0.0097|
|     |       |strict-match    |     5|exact_match|↑  |0.8984|±  |0.0083|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Cursor Bugbot is generating a summary for commit f354ef2eabe320528a5815978bc8b87ee9fb505d. Configure here.


Note

Refactors unquantized MoE to centralize backend selection and kernel setup; adds BF16 GSM8K eval configs for MoE models and updates test config lists.

  • Introduces UnquantizedMoeBackend enum with select_unquantized_moe_backend and convert_to_unquantized_kernel_format, consolidating Aiter/FlashInfer/Triton selection, weight shuffling (swap_w13_to_w31), and kernel init via _setup_kernel in unquantized_fused_moe_method.py.
  • Simplifies process_weights_after_loading and maybe_make_prepare_finalize; adjusts DP/EP checks (uses dp_rank > 1) and logs; preserves inplace behavior differences per backend.
  • Adds BF16 GSM8K configs for Llama-4-Scout, Mixtral-8x7B, and Qwen3-30B-A3B (both fi-cutlass and triton) and registers them in config-b200.txt and config-h100.txt.

Written by Cursor Bugbot for commit f354ef2eabe320528a5815978bc8b87ee9fb505d. This will update automatically on new commits. Configure here.


Note

Streamlines unquantized MoE execution and adds BF16 evaluation configs.

  • Refactors unquantized MoE: introduces vllm/.../oracle/unquantized.py with UnquantizedMoeBackend, select_unquantized_moe_backend, convert_to_unquantized_kernel_format, and make_unquantized_moe_kernel; updates unquantized_fused_moe_method.py to use centralized backend selection, DP/EP checks (dp_rank > 1), weight shuffling/swapping, and kernel setup; simplifies maybe_make_prepare_finalize and CPU/XPU handling; adds kernel assertion.
  • Adds BF16 GSM8K configs for Llama-4-Scout, Mixtral-8x7B, and Qwen3-30B-A3B (Triton and FlashInfer CUTLASS variants with VLLM_USE_FLASHINFER_MOE_FP16 for CUTLASS) and registers them in config-b200.txt and config-h100.txt.

Written by Cursor Bugbot for commit f5fe3788a4f4933c64046e5d6f338ddce5c0fc16. This will update automatically on new commits. Configure here.


Note

Streamlines unquantized MoE and expands BF16 test coverage.

  • New oracle/unquantized.py: defines UnquantizedMoeBackend, backend selection (select_unquantized_moe_backend), weight layout conversion, and kernel construction
  • Refactors unquantized_fused_moe_method.py to use centralized backend logic, simplify process_weights_after_loading/maybe_make_prepare_finalize, adjust DP/EP checks, and assert kernel presence
  • Adds BF16 GSM8K configs (Triton and FlashInfer CUTLASS variants) for Llama-4-Scout, Mixtral-8x7B, Qwen3-30B-A3B; registers them in config-b200.txt and config-h100.txt

Written by Cursor Bugbot for commit bfe742bc808327197c0c05c7a6c98cfa65c35f73. This will update automatically on new commits. Configure here.


Note

Cursor Bugbot is generating a summary for commit 3759b5ff799a6b35dbba16a42b8ef85671fb1e8e. Configure here.


Note

Streamlines unquantized MoE execution and expands BF16 test coverage.

  • New oracle/unquantized.py defines UnquantizedMoeBackend, backend selection (select_unquantized_moe_backend), weight layout conversion, and kernel construction for AITER, FlashInfer CUTLASS, and Triton
  • Refactors unquantized_fused_moe_method.py to use centralized backend logic, simplifying maybe_make_prepare_finalize, CPU/XPU handling, and kernel setup; adds weight shuffling/swapping and kernel assertion
  • Adds BF16 GSM8K configs (Triton and FlashInfer CUTLASS variants) for Llama-4-Scout, Mixtral-8x7B, and Qwen3-30B-A3B, plus a DP+EP Triton config for Qwen3-30B-A3B; registers them in config-b200.txt and config-h100.txt

Written by Cursor Bugbot for commit 522cb1c327bae5b46778a18f03f616f68ec582b3. This will update automatically on new commits. Configure here.


Note

Cursor Bugbot is generating a summary for commit e0752934dec45fd6c34520fcc16e71254c84615d. Configure here.

@robertgshaw2-redhat robertgshaw2-redhat changed the title [MoE Refactor] Enable Unquantized backend selector and unified kernel setup interface [MoE Refactor][17/N] Apply Refactor to Bf16 Jan 6, 2026
@robertgshaw2-redhat
Copy link
Collaborator

Can you add to the CI jobs?

TRITON = 3


def get_unquantized_moe_backend(
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get --> select is the new convention

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the unquantized MoE method to introduce a backend selector and a unified kernel setup interface. The changes are well-structured and improve code clarity by centralizing the backend selection logic and separating kernel setup and weight conversion. I've identified a few areas for improvement, including an incorrect type hint, a misleading error message, and unused function parameters, which should be addressed to enhance correctness and maintainability.


def get_unquantized_moe_backend(
moe_parallel_config: FusedMoEParallelConfig,
) -> UnquantizedMoeBackend | None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The return type hint for get_unquantized_moe_backend is UnquantizedMoeBackend | None, but the function always returns a member of the UnquantizedMoeBackend enum and never None. The type hint should be corrected to UnquantizedMoeBackend to accurately reflect the function's behavior.

Suggested change
) -> UnquantizedMoeBackend | None:
) -> UnquantizedMoeBackend:

)
elif self.unquantized_backend == UnquantizedMoeBackend.NONE:
raise ValueError(
"Unable to select quantization backend, please check supported backend."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The error message is misleading. This method is for unquantized MoE, but the error refers to a 'quantization backend'. It should refer to an 'unquantized MoE backend' to be accurate and avoid confusion.

Suggested change
"Unable to select quantization backend, please check supported backend."
"Unable to select unquantized MoE backend, please check supported backends."

Comment on lines +269 to +273
layer: Module,
w13_weight: torch.Tensor | None = None,
w2_weight: torch.Tensor | None = None,
w13_weight_scale: torch.Tensor | None = None,
w2_weight_scale: torch.Tensor | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The parameters w13_weight, w2_weight, w13_weight_scale, and w2_weight_scale are defined in the _convert_weights_to_kernel_format method signature but are not used within the method. They should be removed to simplify the signature and avoid confusion.

        layer: Module


def __init__(self, moe: FusedMoEConfig):
super().__init__(moe)
self.unquantized_backend = get_unquantized_moe_backend(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than passing the moe parallel config, let's just pass use_dp and use_ep

self.kernel = mk.FusedMoEModularKernel(
MoEPrepareAndFinalizeNoEP(),
AiterExperts(self.moe_quant_config),
shared_experts=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop the shared_experts, this is the default

TritonExperts(self.moe_quant_config),
shared_experts=None,
)
self._convert_weights_to_kernel_format(layer=layer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep things consistent, there should be a single function called _setup_kernel() which does these two steps

"Unable to select quantization backend, please check supported backend."
)

def _convert_weights_to_kernel_format(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function should be in oracle (we will eventually make it a method of the Expert)

Just the "replace_parameter" should be part of this function

@mergify mergify bot added the nvidia label Jan 7, 2026
@github-project-automation github-project-automation bot moved this to Backlog in MoE Refactor Jan 7, 2026
@zyongye zyongye moved this from Backlog to In progress in MoE Refactor Jan 7, 2026
@zyongye zyongye force-pushed the unquantized_moe_backend_selector branch from 30d9ff4 to f354ef2 Compare January 9, 2026 00:54
accuracy_threshold: 0.92
num_questions: 1319
num_fewshot: 5
server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 8"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to update this to tp=2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The llama4 is 108B parameter and can't fit in 2 H200 gpus.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move llama4 tests to b200.

" VLLM_USE_FLASHINFER_MOE_FP16=1 to enable it.",
scope="local",
)
elif use_dp:
Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this kernel work with TP/EP but not DP/EP?

I dont see why there would be a distinction. I actually think this should work fine with the MK structure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs further investigation. The original code says it doesn't support DP/EP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh maybe it's just not in selec_gemm_impl yet.

@@ -11,3 +11,9 @@ Qwen3-30B-A3B-NvFp4-ModelOpt-marlin.yaml
Qwen3-30B-A3B-NvFp4-ModelOpt-fi-trtllm.yaml
Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutlass.yaml
Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutlass-dp-ep.yaml
Llama-4-Scout-BF16-fi-cutlass.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run half on the b200 and some on the h100 for CI time / budget

@robertgshaw2-redhat
Copy link
Collaborator

this PR is very well done and nicely structured. Just left some minor nits

backend = UnquantizedMoeBackend.CPU

logger.info_once(_make_log_backend(backend), scope="local")
return backend
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backend variable may be uninitialized for unknown platforms

Medium Severity

The select_unquantized_moe_backend function uses separate if statements for platform checks (is_rocm(), is_cuda(), is_xpu(), is_cpu()). If none of these conditions are true (e.g., TPU or a future platform), the backend variable is never assigned, causing an UnboundLocalError when the function tries to log and return it. Using elif with a final else clause that raises an informative error or sets a default would prevent this.

Fix in Cursor Fix in Web

@zyongye zyongye added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 10, 2026
@mergify
Copy link

mergify bot commented Jan 10, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zyongye.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 10, 2026
@zyongye zyongye force-pushed the unquantized_moe_backend_selector branch from 522cb1c to e075293 Compare January 10, 2026 01:40
@mergify mergify bot removed the needs-rebase label Jan 10, 2026
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
@zyongye zyongye force-pushed the unquantized_moe_backend_selector branch from 4c46940 to 8d0e320 Compare January 13, 2026 22:53
logger = init_logger(__name__)


# --8<-- [start:unquantized_fused_moe]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these weird comments need to preserved for doc purposes.

@vllm-bot vllm-bot merged commit 31c2925 into vllm-project:main Jan 15, 2026
49 of 51 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in MoE Refactor Jan 15, 2026
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Jan 15, 2026
@vanbasten23 vanbasten23 mentioned this pull request Jan 16, 2026
5 tasks
sammysun0711 pushed a commit to sammysun0711/vllm that referenced this pull request Jan 16, 2026
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
@zyongye zyongye deleted the unquantized_moe_backend_selector branch March 12, 2026 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants