[Bugfix] Disable the dispatch_ffn_combine kernel in MTP path#4751
[Bugfix] Disable the dispatch_ffn_combine kernel in MTP path#4751MengqingCao merged 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a bugfix to disable the fused-moe kernel during the dummy_run of the MTP (Multi-path Transformer) proposer. This is accomplished by checking if the selected MoE communication method is FUSED_ALLTOALL and reverting to the standard ALLTOALL method if it is. This change is localized and specifically targets the dummy_run, which is crucial for graph capturing. The modification correctly addresses a likely bug with the fused kernel in this context, and the implementation is sound. No issues were found in the proposed changes.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
2f86c9f to
77e74e4
Compare
|
4273f3c to
37ec423
Compare
Signed-off-by: mojave2 <chenchen145@huawei.com>
37ec423 to
c460dda
Compare
…oject#4751) ### What this PR does / why we need it? This PR is to fix a smoking test failure. Adjust mtp_proposer and model_runner_v1 to route MTP decoding through the non‑fused MoE implementation while keeping the overall inference flow unchanged. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: mojave2 <chenchen145@huawei.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>
What this PR does / why we need it?
This PR is to fix a smoking test failure. Adjust mtp_proposer and model_runner_v1 to route MTP decoding through the non‑fused MoE implementation while keeping the overall inference flow unchanged.
Does this PR introduce any user-facing change?
How was this patch tested?
This PR will be tested in smoking tests.