Skip to content

[Feature] Support fine-grained shared expert overlap#5482

Merged
jianzs merged 6 commits intovllm-project:mainfrom
jianzs:feat/mc2-overlap-sharedexp
Jan 17, 2026
Merged

[Feature] Support fine-grained shared expert overlap#5482
jianzs merged 6 commits intovllm-project:mainfrom
jianzs:feat/mc2-overlap-sharedexp

Conversation

@jianzs
Copy link
Copy Markdown
Collaborator

@jianzs jianzs commented Dec 29, 2025

What this PR does / why we need it?

Fine-grained control over shared expert overlap to prevent resource contention.

Depends on #5481

Does this PR introduce any user-facing change?

No

How was this patch tested?

@jianzs jianzs marked this pull request as draft December 29, 2025 10:14
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for fine-grained shared expert overlap in the MC2 codepath, which is a significant feature for performance optimization. The changes involve refactoring the MoE communication path to use dataclasses for return types instead of dictionaries, which improves code clarity and structure. The core logic for overlapping computations using NPU streams and events seems correct. However, I've found a critical issue with duplicated dataclass definitions that will cause a runtime error and must be fixed.

Comment on lines +58 to +70
@dataclass
class TokenDispatchResult:
hidden_states: torch.Tensor
group_list: torch.Tensor
group_list_type: int
dynamic_scale: torch.Tensor | None = field(default=None)
topk_scales: torch.Tensor | None = field(default=None)
context_metadata: dict = field(default_factory=dict)


@dataclass
class TokenCombineResult:
routed_out: torch.Tensor
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The dataclasses TokenDispatchResult and TokenCombineResult are defined twice. The second definition of TokenCombineResult at line 69 overwrites the first one, and it's missing the shared_out field. This will cause a TypeError at runtime when TokenDispatcherWithMC2.token_combine tries to instantiate TokenCombineResult with the shared_out argument. Please remove the duplicate definitions from lines 58 to 70.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@jianzs jianzs force-pushed the feat/mc2-overlap-sharedexp branch from d2175a7 to b8ed98c Compare December 29, 2025 12:05
@jianzs jianzs added ready read for review ready-for-test start test by label for PR labels Dec 29, 2025
@jianzs jianzs changed the title [Feature] Support fine-grained shared expert overlap in MC2 codepath [Feature] Support fine-grained shared expert overlap Dec 29, 2025
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@jianzs jianzs force-pushed the feat/mc2-overlap-sharedexp branch from 8652cfe to 4f19344 Compare December 30, 2025 12:56
@jianzs jianzs force-pushed the feat/mc2-overlap-sharedexp branch from 4f19344 to b5660b4 Compare December 31, 2025 06:26
@jianzs jianzs marked this pull request as ready for review December 31, 2025 06:28
Copy link
Copy Markdown
Collaborator

@yiz-liu yiz-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, can we have a RFC about this?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 6, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@jianzs jianzs force-pushed the feat/mc2-overlap-sharedexp branch from 3703ac5 to 6800efa Compare January 6, 2026 09:35
@jianzs jianzs force-pushed the feat/mc2-overlap-sharedexp branch 3 times, most recently from e653b76 to e9c48a1 Compare January 8, 2026 02:33
@jianzs jianzs added this to the v0.14.0rc1 milestone Jan 8, 2026
@jianzs
Copy link
Copy Markdown
Collaborator Author

jianzs commented Jan 8, 2026

Looking good, can we have a RFC about this?

Please take a look #5708

Copy link
Copy Markdown
Collaborator

@whx-sjtu whx-sjtu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jianzs jianzs force-pushed the feat/mc2-overlap-sharedexp branch 3 times, most recently from 44e109c to 8b49f80 Compare January 9, 2026 05:35
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
@jianzs jianzs force-pushed the feat/mc2-overlap-sharedexp branch from 1f1a81c to 7b55a43 Compare January 16, 2026 09:52
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
@jianzs jianzs merged commit 22f2531 into vllm-project:main Jan 17, 2026
20 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 19, 2026
…to FIA_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (110 commits)
  [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936)
  [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960)
  [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755)
  [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834)
  [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897)
  [CI]fix for lint CI (vllm-project#5982)
  [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034)
  [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928)
  [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933)
  [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908)
  [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855)
  [doc]Table split  (vllm-project#5929)
  [Doc] Upgrade outdated ut doc (vllm-project#5937)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977)
  Eagle3 mm support, enablement on qwen3vl (vllm-project#4848)
  [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959)
  [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968)
  [Feature] Support fine-grained shared expert overlap (vllm-project#5482)
  [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963)
  [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776)
  ...
yiz-liu pushed a commit that referenced this pull request Jan 20, 2026
### What this PR does / why we need it?
Same with #5482 

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 21, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (637 commits)
  [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936)
  [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960)
  [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755)
  [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834)
  [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897)
  [CI]fix for lint CI (vllm-project#5982)
  [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034)
  [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928)
  [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933)
  [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908)
  [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855)
  [doc]Table split  (vllm-project#5929)
  [Doc] Upgrade outdated ut doc (vllm-project#5937)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977)
  Eagle3 mm support, enablement on qwen3vl (vllm-project#4848)
  [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959)
  [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968)
  [Feature] Support fine-grained shared expert overlap (vllm-project#5482)
  [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963)
  [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776)
  ...
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
Fine-grained control over shared expert overlap to prevent resource
contention.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…oject#5962)

### What this PR does / why we need it?
Same with vllm-project#5482 

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
Fine-grained control over shared expert overlap to prevent resource
contention.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
Fine-grained control over shared expert overlap to prevent resource
contention.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
Fine-grained control over shared expert overlap to prevent resource
contention.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
Fine-grained control over shared expert overlap to prevent resource
contention.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
Fine-grained control over shared expert overlap to prevent resource
contention.

- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@5326c89

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ops module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Building an Event Callback Mechanism to Achieve Fine-Grained Overlap of Shared Experts

6 participants