[MoE Refactor] Migrate UnquantizedFusedMoEMethod and oracle to MK flow#36732
Closed
yzong-rh wants to merge 12 commits intovllm-project:mainfrom
Closed
[MoE Refactor] Migrate UnquantizedFusedMoEMethod and oracle to MK flow#36732yzong-rh wants to merge 12 commits intovllm-project:mainfrom
yzong-rh wants to merge 12 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Yifan Zong <yzong@redhat.com>
TPU and OOT platforms do not have in-tree unquantized MoE kernels and rely on OOT plugins to replace UnquantizedFusedMoEMethod via CustomOp.register_oot. Having dedicated enum values for them was misleading — they could never produce a working kernel and would silently fail at runtime with an opaque AssertionError. Replace them with a single UnquantizedMoeBackend.NONE value (mirroring Fp8MoeBackend.NONE) and add an assertion in UnquantizedFusedMoEMethod.__init__ to fail fast if no OOT plugin is registered. Also fix the if/elif chain in select_unquantized_moe_backend to prevent accidental overwrites across platform checks. Signed-off-by: Yifan Zong <yzong@redhat.com>
…ithic, which do not support them Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Yifan Zong <yzong@redhat.com>
…d oracle Signed-off-by: Yifan Zong <yzong@redhat.com> [MoE] convert_to_unquantized_kernel_format weight source consistency Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Yifan Zong <yzong@redhat.com>
…rrect Enums Signed-off-by: Yifan Zong <yzong@redhat.com>
This reverts commit be21911. Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Yifan Zong <yzong@redhat.com>
1. override and throw in `select_gemm_impl` 2. remove `NONE` backend and throw early on TPU/OOT platforms 3. remove _maybe_swap_to_batched_variant so that AVAILABLE_BACKENDS is set in one location 4. Use use_deepep_ll_kernels instead of use_all2all_kernels Signed-off-by: Yifan Zong <yzong@redhat.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring of the unquantized MoE backend selection and kernel initialization logic. The changes greatly improve modularity and maintainability by replacing complex conditional chains with a priority-based oracle and dedicated expert classes. The new structure is much cleaner and more extensible. I've found one critical typo that would prevent a backend from being selected correctly, which I've commented on. Otherwise, the changes look excellent.
…quantized.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: yzong-rh <yzong@redhat.com>
5 tasks
Contributor
Author
|
Further work ended up being done on original branch |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Continuation of #36286
Purpose
Migrate the unquantized MoE (BF16) code path from the legacy kernel initialization pattern to the modern modular pattern already used by FP8 and NvFP4.
The CPU backend is not migrated and remains on the old path due to interface differences (see below).
Background
There are unquantized
In the old path
UNSUPPORTED_BACKENDand were implemented inforward_monolithic_{cuda|cpu}.NoDPEPprepare/finalize, which may be swapped for an appropriate prepare/finalize inprepare_communication_buffer_for_model(after weight loading andprocess_weights_after_loading).This PR
vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.pytovllm/model_executor/layers/fused_moe/experts/trtllm_bf16_moe.py. MirrorsTrtLlmFp8Expertsand [MoE Refactor] Create MK for TRTLLM Kernels #32564Lifecycle: Old Path vs New Path
Non-monolithic (Triton, AITER, FlashInfer CUTLASS, XPU)
Monolithic — GPU (FlashInfer TRTLLM)
Monolithic — CPU (not migrated, unchanged)
Changes
New file:
experts/trtllm_bf16_moe.py—TrtLlmBf16Experts, aFusedMoEExpertsMonolithicsubclass wrapping theflashinfer.fused_moe.trtllm_bf16_moecall.oracle/unquantized.py:select_unquantized_moe_backendnow returns(backend, experts_cls)instead of justbackend, mirroring FP8. CPU returns(CPU, None).UNSUPPORTED_BACKEND. AddedBATCHED_TRITONenum variant andbackend_to_kernel_clsmapping.is_supported_config, log and skip unsupported ones.make_unquantized_moe_kernelnow callsmaybe_make_prepare_finalize(allow_new_interface=True)instead of hardcodingNoDPEP, and always returns aFusedMoEKernel.convert_to_unquantized_kernel_format.unquantized_fused_moe_method.py:__init__storesexperts_clsfrom the backend selector. Removedself.kernel,_is_monolithic, and_select_monolithic._setup_kernelstores the kernel inself.moe_kernel(notself.kernel), makingsupports_internal_mk=Trueand causingmaybe_init_modular_kernelto no-op.is_monolithicreturnsTruefor CPU, delegates tosuper()otherwise.forward_nativeandforward_cudauseself.moe_kernel.apply().apply_monolithicdispatches CPU toself.cpu_fused_moe, all others toself.moe_kernel.apply_monolithic().forward_monolithic_cuda,select_gemm_impl, and the FlashInfer TRTLLM branch fromprocess_weights_after_loading.Other cleanups:
rocm_aiter_moe_enabledcondition.NONE(mirrors [Bugfix][TPU] Return a Default fp8 MoE Backend #32908).shared_expertspassed toFusedMoEExpertsMonolithic.The CPU backend (
CPUFusedMOE/SGLFusedMOE) stays on the old monolithic path because it has three interface differences that make a clean migration non-trivial:FusedMoEConfig(renormalize,scoring_func,custom_routing_function).apply(w1, w2, ...)interface.Test Plan
Integration tests:
Updated unit tests:
Other unit tests:
Test Result
All unit tests pass on B200 machine
cc @robertgshaw2-redhat @bnellnm
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.