[Feature] Support fine-grained shared expert overlap#5482
[Feature] Support fine-grained shared expert overlap#5482jianzs merged 6 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for fine-grained shared expert overlap in the MC2 codepath, which is a significant feature for performance optimization. The changes involve refactoring the MoE communication path to use dataclasses for return types instead of dictionaries, which improves code clarity and structure. The core logic for overlapping computations using NPU streams and events seems correct. However, I've found a critical issue with duplicated dataclass definitions that will cause a runtime error and must be fixed.
| @dataclass | ||
| class TokenDispatchResult: | ||
| hidden_states: torch.Tensor | ||
| group_list: torch.Tensor | ||
| group_list_type: int | ||
| dynamic_scale: torch.Tensor | None = field(default=None) | ||
| topk_scales: torch.Tensor | None = field(default=None) | ||
| context_metadata: dict = field(default_factory=dict) | ||
|
|
||
|
|
||
| @dataclass | ||
| class TokenCombineResult: | ||
| routed_out: torch.Tensor |
There was a problem hiding this comment.
The dataclasses TokenDispatchResult and TokenCombineResult are defined twice. The second definition of TokenCombineResult at line 69 overwrites the first one, and it's missing the shared_out field. This will cause a TypeError at runtime when TokenDispatcherWithMC2.token_combine tries to instantiate TokenCombineResult with the shared_out argument. Please remove the duplicate definitions from lines 58 to 70.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
d2175a7 to
b8ed98c
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
8652cfe to
4f19344
Compare
4f19344 to
b5660b4
Compare
yiz-liu
left a comment
There was a problem hiding this comment.
Looking good, can we have a RFC about this?
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
3703ac5 to
6800efa
Compare
e653b76 to
e9c48a1
Compare
Please take a look #5708 |
e9c48a1 to
cce38d8
Compare
44e109c to
8b49f80
Compare
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
1f1a81c to
7b55a43
Compare
…to FIA_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (110 commits) [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936) [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960) [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755) [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834) [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897) [CI]fix for lint CI (vllm-project#5982) [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034) [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928) [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933) [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908) [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855) [doc]Table split (vllm-project#5929) [Doc] Upgrade outdated ut doc (vllm-project#5937) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977) Eagle3 mm support, enablement on qwen3vl (vllm-project#4848) [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959) [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968) [Feature] Support fine-grained shared expert overlap (vllm-project#5482) [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963) [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776) ...
### What this PR does / why we need it? Same with #5482 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (637 commits) [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936) [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960) [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755) [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834) [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897) [CI]fix for lint CI (vllm-project#5982) [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034) [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928) [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933) [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908) [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855) [doc]Table split (vllm-project#5929) [Doc] Upgrade outdated ut doc (vllm-project#5937) [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977) Eagle3 mm support, enablement on qwen3vl (vllm-project#4848) [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959) [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968) [Feature] Support fine-grained shared expert overlap (vllm-project#5482) [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963) [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776) ...
Fine-grained control over shared expert overlap to prevent resource contention. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
…oject#5962) ### What this PR does / why we need it? Same with vllm-project#5482 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Fine-grained control over shared expert overlap to prevent resource contention. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Fine-grained control over shared expert overlap to prevent resource contention. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Fine-grained control over shared expert overlap to prevent resource contention. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Fine-grained control over shared expert overlap to prevent resource contention. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Fine-grained control over shared expert overlap to prevent resource contention. - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@5326c89 --------- Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
What this PR does / why we need it?
Fine-grained control over shared expert overlap to prevent resource contention.
Depends on #5481
Does this PR introduce any user-facing change?
No
How was this patch tested?