[BugFix] Fix aclgraph accu problem in A2. by whx-sjtu · Pull Request #3163 · vllm-project/vllm-ascend

whx-sjtu · 2025-09-24T13:35:49Z

This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR #2980, which makes the all_reduce of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo.

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@17b4c66

Signed-off-by: whx-sjtu <2952154980@qq.com>

github-actions · 2025-09-24T13:35:59Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request aims to fix an accuracy issue with aclgraph by refactoring the AscendSharedFusedMoE layer to shield some logic from torch.dynamo. The approach is to move the implementation from the forward method to forward_impl. However, the new forward method contains a critical bug that will cause a ValueError at runtime due to incorrect tuple unpacking. I've provided a comment with a suggested fix.

gemini-code-assist · 2025-09-24T13:38:26Z

+        shared_out, fused_out = AscendFusedMoE.forward(
+            self,
+            hidden_states=hidden_states,
+            router_logits=router_logits,
+        )
+        return shared_out, fused_out


The call to AscendFusedMoE.forward is incorrect. The MRO for AscendSharedFusedMoE will cause this to resolve to vllm.model_executor.layers.fused_moe.layer.FusedMoE.forward, which returns a single tensor. Attempting to unpack this single tensor into two variables, shared_out and fused_out, will raise a ValueError at runtime.

Given that the implementation logic has been moved into forward_impl, which correctly returns a tuple of two tensors, the forward method should likely just call self.forward_impl.

return self.forward_impl( hidden_states=hidden_states, router_logits=router_logits, )

shen-shanshan · 2025-09-25T01:40:33Z

LGTM.

yiz-liu · 2025-09-25T08:59:28Z

LGTM

Signed-off-by: whx-sjtu <2952154980@qq.com>

Yikun · 2025-09-28T03:20:35Z

This PR doesn't include any e2e tests, please describe how you tested it.
What are your plans for adding e2e test to prevent this issue from breaking again? Please add the test plan as issue.

wangxiyuan · 2025-09-28T13:33:17Z

@Yikun talked with @whx-sjtu the case is that A2 + aclgraph + EP + max_num_tokens>512 our CI doesn't cover this case. Let's add one later.

This reverts commit 14d4ed5. Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>

This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: luolun <luolun1995@cmbchina.com>

This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: hwhaokun <haokun0405@163.com>

This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: nsdie <yeyifan@huawei.com>

This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo. - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@17b4c66 --------- Signed-off-by: whx-sjtu <2952154980@qq.com>

fix a2 accu problem

58c14f1

Signed-off-by: whx-sjtu <2952154980@qq.com>

whx-sjtu requested review from shen-shanshan and yiz-liu September 24, 2025 13:35

whx-sjtu requested a review from ApsarasX September 24, 2025 13:35

github-actions Bot added the module:ops label Sep 24, 2025

gemini-code-assist Bot reviewed Sep 24, 2025

View reviewed changes

whx-sjtu added ready read for review ready-for-test start test by label for PR labels Sep 25, 2025

wangxiyuan approved these changes Sep 25, 2025

View reviewed changes

yiz-liu approved these changes Sep 25, 2025

View reviewed changes

make maybe_all_reduce_tensor_model_parallel custom op

b256ff4

Signed-off-by: whx-sjtu <2952154980@qq.com>

whx-sjtu force-pushed the fix_a2_accu branch 2 times, most recently from 5fdaa50 to f8aa50b Compare September 25, 2025 12:15

fix lint

02da01a

Signed-off-by: whx-sjtu <2952154980@qq.com>

whx-sjtu force-pushed the fix_a2_accu branch from f8aa50b to 02da01a Compare September 25, 2025 12:30

realliujiaxu reviewed Sep 26, 2025

View reviewed changes

Comment thread vllm_ascend/ops/common_fused_moe.py Outdated

Comment thread vllm_ascend/ops/common_fused_moe.py Outdated

change register position

64d351d

Signed-off-by: whx-sjtu <2952154980@qq.com>

realliujiaxu approved these changes Sep 28, 2025

View reviewed changes

zzzzwwjj approved these changes Sep 28, 2025

View reviewed changes

wangxiyuan mentioned this pull request Sep 28, 2025

[Release]: Release checklist for v0.11.0rc1 #3141

Closed

42 tasks

wangxiyuan merged commit 14d4ed5 into vllm-project:main Sep 28, 2025
19 checks passed

Yikun added a commit to Yikun/vllm-ascend that referenced this pull request Sep 28, 2025

Revert "[BugFix] Fix aclgraph accu problem in A2. (vllm-project#3163)"

2c54880

This reverts commit 14d4ed5. Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

wangxiyuan mentioned this pull request Jan 26, 2026

[Community] Nominate whx-sjtu as maintainer #6268

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Fix aclgraph accu problem in A2.#3163

[BugFix] Fix aclgraph accu problem in A2.#3163
wangxiyuan merged 4 commits intovllm-project:mainfrom
whx-sjtu:fix_a2_accu

whx-sjtu commented Sep 24, 2025 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Sep 24, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Sep 24, 2025

Uh oh!

shen-shanshan commented Sep 25, 2025 •

edited

Loading

Uh oh!

yiz-liu commented Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

Yikun commented Sep 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

wangxiyuan commented Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

whx-sjtu commented Sep 24, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Sep 24, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

shen-shanshan commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiz-liu commented Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

Yikun commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wangxiyuan commented Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

whx-sjtu commented Sep 24, 2025 •

edited by github-actions Bot

Loading

shen-shanshan commented Sep 25, 2025 •

edited

Loading

Yikun commented Sep 28, 2025 •

edited

Loading