[bugfix] Fix dummy-run and multi-node issues in MoE routing and MTP#4947
[bugfix] Fix dummy-run and multi-node issues in MoE routing and MTP#4947wangxiyuan merged 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces fixes for Mixture-of-Experts (MoE) functionality, specifically targeting dummy-run and multi-node scenarios. The main changes involve updating the MoE communication method for FUSED_ALLTOALL to use MC2 components, which likely resolves issues in multi-node setups. Additionally, a guard on the expert parallelism size has been removed, enabling this path for larger configurations. A minor cleanup in a C++ kernel is also included. The changes appear to correctly address the intended fixes. However, I've pointed out that a docstring in moe_comm_method.py is now outdated due to these changes, which could impact future maintainability.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
4020532 to
972aee2
Compare
b7e217f to
7b3c477
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
84c68f8 to
c8a182f
Compare
Signed-off-by: mojave2 <chenchen145@huawei.com>
…llm-project#4947) ### What this PR does / why we need it? - Fix a premature `return` in `moe_init_routing_quant_v2.cpp` so the routing kernel completes correctly instead of exiting early in certain paths. - Switch `FusedAlltoAllCommImpl` to use the MC2-based token dispatcher and prepare/finalize routines, aligning MoE communication with the MC2 algorithm optimized for Ascend devices. - Add a temporary override in `MtpProposer` to map `FUSED_ALLTOALL` back to `ALLTOALL` until the MoE communication type selection logic is fully finalized, avoiding incorrect behavior in dummy-run flows. - Simplify the MoE communication selection for Ascend 910-93 in `NPUModelRunner` by removing the EP-size guard on `FUSED_ALLTOALL`, which fixes failures in multi-node / larger-EP configurations while keeping MC2 routing under the configured token capacity. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: mojave2 <chenchen145@huawei.com>
…llm-project#4947) ### What this PR does / why we need it? - Fix a premature `return` in `moe_init_routing_quant_v2.cpp` so the routing kernel completes correctly instead of exiting early in certain paths. - Switch `FusedAlltoAllCommImpl` to use the MC2-based token dispatcher and prepare/finalize routines, aligning MoE communication with the MC2 algorithm optimized for Ascend devices. - Add a temporary override in `MtpProposer` to map `FUSED_ALLTOALL` back to `ALLTOALL` until the MoE communication type selection logic is fully finalized, avoiding incorrect behavior in dummy-run flows. - Simplify the MoE communication selection for Ascend 910-93 in `NPUModelRunner` by removing the EP-size guard on `FUSED_ALLTOALL`, which fixes failures in multi-node / larger-EP configurations while keeping MC2 routing under the configured token capacity. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: mojave2 <chenchen145@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…llm-project#4947) ### What this PR does / why we need it? - Fix a premature `return` in `moe_init_routing_quant_v2.cpp` so the routing kernel completes correctly instead of exiting early in certain paths. - Switch `FusedAlltoAllCommImpl` to use the MC2-based token dispatcher and prepare/finalize routines, aligning MoE communication with the MC2 algorithm optimized for Ascend devices. - Add a temporary override in `MtpProposer` to map `FUSED_ALLTOALL` back to `ALLTOALL` until the MoE communication type selection logic is fully finalized, avoiding incorrect behavior in dummy-run flows. - Simplify the MoE communication selection for Ascend 910-93 in `NPUModelRunner` by removing the EP-size guard on `FUSED_ALLTOALL`, which fixes failures in multi-node / larger-EP configurations while keeping MC2 routing under the configured token capacity. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: mojave2 <chenchen145@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
What this PR does / why we need it?
returninmoe_init_routing_quant_v2.cppso the routing kernel completes correctly instead of exiting early in certain paths.FusedAlltoAllCommImplto use the MC2-based token dispatcher and prepare/finalize routines, aligning MoE communication with the MC2 algorithm optimized for Ascend devices.MtpProposerto mapFUSED_ALLTOALLback toALLTOALLuntil the MoE communication type selection logic is fully finalized, avoiding incorrect behavior in dummy-run flows.NPUModelRunnerby removing the EP-size guard onFUSED_ALLTOALL, which fixes failures in multi-node / larger-EP configurations while keeping MC2 routing under the configured token capacity.Does this PR introduce any user-facing change?
How was this patch tested?