[MoE Refactor][17/N] Apply Refactor to Bf16#31827
[MoE Refactor][17/N] Apply Refactor to Bf16#31827vllm-bot merged 18 commits intovllm-project:mainfrom
Conversation
|
Can you add to the CI jobs? |
| TRITON = 3 | ||
|
|
||
|
|
||
| def get_unquantized_moe_backend( |
There was a problem hiding this comment.
get --> select is the new convention
There was a problem hiding this comment.
Code Review
This pull request refactors the unquantized MoE method to introduce a backend selector and a unified kernel setup interface. The changes are well-structured and improve code clarity by centralizing the backend selection logic and separating kernel setup and weight conversion. I've identified a few areas for improvement, including an incorrect type hint, a misleading error message, and unused function parameters, which should be addressed to enhance correctness and maintainability.
|
|
||
| def get_unquantized_moe_backend( | ||
| moe_parallel_config: FusedMoEParallelConfig, | ||
| ) -> UnquantizedMoeBackend | None: |
There was a problem hiding this comment.
The return type hint for get_unquantized_moe_backend is UnquantizedMoeBackend | None, but the function always returns a member of the UnquantizedMoeBackend enum and never None. The type hint should be corrected to UnquantizedMoeBackend to accurately reflect the function's behavior.
| ) -> UnquantizedMoeBackend | None: | |
| ) -> UnquantizedMoeBackend: |
| ) | ||
| elif self.unquantized_backend == UnquantizedMoeBackend.NONE: | ||
| raise ValueError( | ||
| "Unable to select quantization backend, please check supported backend." |
There was a problem hiding this comment.
The error message is misleading. This method is for unquantized MoE, but the error refers to a 'quantization backend'. It should refer to an 'unquantized MoE backend' to be accurate and avoid confusion.
| "Unable to select quantization backend, please check supported backend." | |
| "Unable to select unquantized MoE backend, please check supported backends." |
| layer: Module, | ||
| w13_weight: torch.Tensor | None = None, | ||
| w2_weight: torch.Tensor | None = None, | ||
| w13_weight_scale: torch.Tensor | None = None, | ||
| w2_weight_scale: torch.Tensor | None = None, |
There was a problem hiding this comment.
|
|
||
| def __init__(self, moe: FusedMoEConfig): | ||
| super().__init__(moe) | ||
| self.unquantized_backend = get_unquantized_moe_backend( |
There was a problem hiding this comment.
rather than passing the moe parallel config, let's just pass use_dp and use_ep
| self.kernel = mk.FusedMoEModularKernel( | ||
| MoEPrepareAndFinalizeNoEP(), | ||
| AiterExperts(self.moe_quant_config), | ||
| shared_experts=None, |
There was a problem hiding this comment.
drop the shared_experts, this is the default
| TritonExperts(self.moe_quant_config), | ||
| shared_experts=None, | ||
| ) | ||
| self._convert_weights_to_kernel_format(layer=layer) |
There was a problem hiding this comment.
to keep things consistent, there should be a single function called _setup_kernel() which does these two steps
| "Unable to select quantization backend, please check supported backend." | ||
| ) | ||
|
|
||
| def _convert_weights_to_kernel_format( |
There was a problem hiding this comment.
this function should be in oracle (we will eventually make it a method of the Expert)
Just the "replace_parameter" should be part of this function
30d9ff4 to
f354ef2
Compare
vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py
Outdated
Show resolved
Hide resolved
tests/evals/gsm8k/configs/moe-refactor/Llama-4-Scout-BF16-fi-cutlass.yaml
Outdated
Show resolved
Hide resolved
| accuracy_threshold: 0.92 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --tensor-parallel-size 8" |
There was a problem hiding this comment.
you need to update this to tp=2
There was a problem hiding this comment.
The llama4 is 108B parameter and can't fit in 2 H200 gpus.
There was a problem hiding this comment.
Move llama4 tests to b200.
| " VLLM_USE_FLASHINFER_MOE_FP16=1 to enable it.", | ||
| scope="local", | ||
| ) | ||
| elif use_dp: |
There was a problem hiding this comment.
Why does this kernel work with TP/EP but not DP/EP?
I dont see why there would be a distinction. I actually think this should work fine with the MK structure
There was a problem hiding this comment.
It needs further investigation. The original code says it doesn't support DP/EP.
There was a problem hiding this comment.
Oh maybe it's just not in selec_gemm_impl yet.
| @@ -11,3 +11,9 @@ Qwen3-30B-A3B-NvFp4-ModelOpt-marlin.yaml | |||
| Qwen3-30B-A3B-NvFp4-ModelOpt-fi-trtllm.yaml | |||
| Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutlass.yaml | |||
| Qwen3-30B-A3B-NvFp4-ModelOpt-fi-cutlass-dp-ep.yaml | |||
| Llama-4-Scout-BF16-fi-cutlass.yaml | |||
There was a problem hiding this comment.
run half on the b200 and some on the h100 for CI time / budget
|
this PR is very well done and nicely structured. Just left some minor nits |
| backend = UnquantizedMoeBackend.CPU | ||
|
|
||
| logger.info_once(_make_log_backend(backend), scope="local") | ||
| return backend |
There was a problem hiding this comment.
Backend variable may be uninitialized for unknown platforms
Medium Severity
The select_unquantized_moe_backend function uses separate if statements for platform checks (is_rocm(), is_cuda(), is_xpu(), is_cpu()). If none of these conditions are true (e.g., TPU or a future platform), the backend variable is never assigned, causing an UnboundLocalError when the function tries to log and return it. Using elif with a final else clause that raises an informative error or sets a default would prevent this.
tests/evals/gsm8k/configs/moe-refactor-dp-ep/Qwen3-30B-A3B-BF16-triton.yaml
Outdated
Show resolved
Hide resolved
|
This pull request has merge conflicts that must be resolved before it can be |
522cb1c to
e075293
Compare
tests/evals/gsm8k/configs/moe-refactor-dp-ep/Qwen3-30B-A3B-BF16-triton.yaml
Show resolved
Hide resolved
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
4c46940 to
8d0e320
Compare
| logger = init_logger(__name__) | ||
|
|
||
|
|
||
| # --8<-- [start:unquantized_fused_moe] |
There was a problem hiding this comment.
I think these weird comments need to preserved for doc purposes.
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Purpose
Test Plan
gsm8k result for Triton kernel, flashinfer_cutlass kernel and aiter rocm kernel for
Qwen/Qwen3-30B-A3Bin TP(triton), TEP (flashinfer cutlass) and rocm.Test Result
Triton
Flashinfer
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Cursor Bugbot is generating a summary for commit f354ef2eabe320528a5815978bc8b87ee9fb505d. Configure here.
Note
Refactors unquantized MoE to centralize backend selection and kernel setup; adds BF16 GSM8K eval configs for MoE models and updates test config lists.
UnquantizedMoeBackendenum withselect_unquantized_moe_backendandconvert_to_unquantized_kernel_format, consolidating Aiter/FlashInfer/Triton selection, weight shuffling (swap_w13_to_w31), and kernel init via_setup_kernelinunquantized_fused_moe_method.py.process_weights_after_loadingandmaybe_make_prepare_finalize; adjusts DP/EP checks (usesdp_rank > 1) and logs; preserves inplace behavior differences per backend.Llama-4-Scout,Mixtral-8x7B, andQwen3-30B-A3B(bothfi-cutlassandtriton) and registers them inconfig-b200.txtandconfig-h100.txt.Written by Cursor Bugbot for commit f354ef2eabe320528a5815978bc8b87ee9fb505d. This will update automatically on new commits. Configure here.
Note
Streamlines unquantized MoE execution and adds BF16 evaluation configs.
vllm/.../oracle/unquantized.pywithUnquantizedMoeBackend,select_unquantized_moe_backend,convert_to_unquantized_kernel_format, andmake_unquantized_moe_kernel; updatesunquantized_fused_moe_method.pyto use centralized backend selection, DP/EP checks (dp_rank > 1), weight shuffling/swapping, and kernel setup; simplifiesmaybe_make_prepare_finalizeand CPU/XPU handling; adds kernel assertion.Llama-4-Scout,Mixtral-8x7B, andQwen3-30B-A3B(Triton and FlashInfer CUTLASS variants withVLLM_USE_FLASHINFER_MOE_FP16for CUTLASS) and registers them inconfig-b200.txtandconfig-h100.txt.Written by Cursor Bugbot for commit f5fe3788a4f4933c64046e5d6f338ddce5c0fc16. This will update automatically on new commits. Configure here.
Note
Streamlines unquantized MoE and expands BF16 test coverage.
oracle/unquantized.py: definesUnquantizedMoeBackend, backend selection (select_unquantized_moe_backend), weight layout conversion, and kernel constructionunquantized_fused_moe_method.pyto use centralized backend logic, simplifyprocess_weights_after_loading/maybe_make_prepare_finalize, adjust DP/EP checks, and assert kernel presenceLlama-4-Scout,Mixtral-8x7B,Qwen3-30B-A3B; registers them inconfig-b200.txtandconfig-h100.txtWritten by Cursor Bugbot for commit bfe742bc808327197c0c05c7a6c98cfa65c35f73. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit 3759b5ff799a6b35dbba16a42b8ef85671fb1e8e. Configure here.
Note
Streamlines unquantized MoE execution and expands BF16 test coverage.
oracle/unquantized.pydefinesUnquantizedMoeBackend, backend selection (select_unquantized_moe_backend), weight layout conversion, and kernel construction for AITER, FlashInfer CUTLASS, and Tritonunquantized_fused_moe_method.pyto use centralized backend logic, simplifyingmaybe_make_prepare_finalize, CPU/XPU handling, and kernel setup; adds weight shuffling/swapping and kernel assertionLlama-4-Scout,Mixtral-8x7B, andQwen3-30B-A3B, plus a DP+EP Triton config forQwen3-30B-A3B; registers them inconfig-b200.txtandconfig-h100.txtWritten by Cursor Bugbot for commit 522cb1c327bae5b46778a18f03f616f68ec582b3. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit e0752934dec45fd6c34520fcc16e71254c84615d. Configure here.