[MoE Refactor] Split invoke_fused_moe_kernel#31050
[MoE Refactor] Split invoke_fused_moe_kernel#31050vllm-bot merged 10 commits intovllm-project:mainfrom
invoke_fused_moe_kernel#31050Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| num_tokens_post_padded, | ||
| mul_routed_weight, | ||
| top_k, | ||
| config["BLOCK_SIZE_M"], | ||
| config["BLOCK_SIZE_N"], | ||
| config["BLOCK_SIZE_K"], | ||
| bit, | ||
| config, | ||
| block_shape, |
There was a problem hiding this comment.
Pass compute_type when calling GPTQ/AWQ kernel
In dispatch_fused_moe_kernel the W8A16/W4A16 CUDA path calls invoke_fused_moe_triton_kernel_gptq_awq without the required compute_type, use_int8_w8a16, and use_int4_w4a16 arguments (the last positional argument is block_shape). The callee’s signature defined above requires those parameters, so when should_moe_wna16_use_cuda(...) returns true (grouped quantization with block_shape set), this branch raises a TypeError before running the kernel. Any WNA16 call with block quantization will therefore crash rather than execute.
Useful? React with 👍 / 👎.
|
Hi @zyongye, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request refactors the fused MoE kernel invocation logic by introducing new specialized Triton kernel functions for different quantization types (WNA16, GPTQ/AWQ) and consolidating their dispatch through a new dispatch_fused_moe_kernel function. This dispatcher replaces previous direct calls to invoke_fused_moe_kernel across various fused_experts_impl and apply methods. Additionally, the unquantized_fused_moe_method.py file is updated to utilize a FusedMoEModularKernel abstraction, which now handles MoE preparation and expert execution. Review feedback indicates the need to remove a pdb.set_trace() debugging call and to optimize the instantiation of FusedMoEModularKernel by moving it outside the forward_cuda method to improve efficiency.
| @@ -1680,6 +1832,7 @@ def fused_experts( | |||
| allow_deep_gemm: bool = False, | |||
| allow_cutlass_block_scaled_grouped_gemm: bool = False, | |||
| ) -> torch.Tensor: | |||
| # import pdb; pdb.set_trace() | |||
| self.kernel = mk.FusedMoEModularKernel( | ||
| MoEPrepareAndFinalizeNoEP(), | ||
| TritonExperts(self.moe_quant_config), | ||
| shared_experts=None, | ||
| ) |
|
Hi @zyongye, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
invoke_fused_moe_kernel
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
0c14373 to
4f69e85
Compare
|
|
||
|
|
||
| def invoke_fused_moe_kernel( | ||
| def invoke_fused_moe_triton_kernel_wna16( |
There was a problem hiding this comment.
This function runs the cuda kernel, but not the triton kernel.
Maybe we can use invoke_fused_moe_wna16_cuda_kernel and invoke_fused_moe_wna16_triton_kernel for the two kernels of moe_wna16.
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
|
The rest of the PR LGTM. BTW, I just found the moe wna16 maybe used in SM75+ in some cases, for example, when |
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
0f6a153 to
ff4af42
Compare
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
986ce16 to
d206c47
Compare
|
Thanks for reviewing @jinzhen-lin |
|
Hi guys @zyongye @robertgshaw2-redhat this PR causes that Inference FAILS |
|
Thank you for the report! Could you paste your command here? Thank you! |
|
@zyongye |
|
@zyongye |
|
@JartX is that model on huggingface? How can I run it on my end? |
|
@zyongye jart25/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit https://huggingface.co/jart25/Qwen3-VL-30B-A3B-Instruct-AWQ-4bit |
|
@zyongye It is qwen3 vl 30b compressed tensor awq int4 |
|
@JartX Yes I am aware of the PR. But changing back defeats the purpose of MoE refactoring. I am looking into the code path to see what I am missing. Thank you for the reminder. |
|
Sorry, I didn't mean to bother you, just wanted to offer some advice. I can test any code you have and give you feedback :). Cheers! |
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
@JartX one of your comments here included your huggingface token. I removed it from the comment, but I would also suggest revoking the token on huggingface if you haven't already. |
|
@russellb Thanks for the heads-up; the token was edited before pasting the command to serve as an example. I also want to thank you for your dedication in personally caring for and supporting the community. I owe you two beers 🙂 |
|
@russellb Speaking of something private, how could we upload the tests, LoRa tests by model? If I publish a LoRa for Qwen3vl 30B, could you tell me the correct way to do it? Thank you very much. |
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Currently
invoke_fused_moe_kernelfunction includes three kernels.Given that the marlin kernels widely support integer quantization starting SM75 (Turing), we isolated the widely used triton kernel with wna16 kernels so that later we can remove it cleanly when vllm drop support for SM70 GPUs.