Skip to content

[BugFix] Fix aclgraph accu problem in A2.#3163

Merged
wangxiyuan merged 4 commits intovllm-project:mainfrom
whx-sjtu:fix_a2_accu
Sep 28, 2025
Merged

[BugFix] Fix aclgraph accu problem in A2.#3163
wangxiyuan merged 4 commits intovllm-project:mainfrom
whx-sjtu:fix_a2_accu

Conversation

@whx-sjtu
Copy link
Copy Markdown
Collaborator

@whx-sjtu whx-sjtu commented Sep 24, 2025

This PR fixes accuracy problem of aclgraph on A2. The problem is introduced by PR #2980, which makes the all_reduce of shared_experts exposed to torch dynamo. This PR moves all the codes into forward_impl to shiled from torch dynamo.

Signed-off-by: whx-sjtu <2952154980@qq.com>
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an accuracy issue with aclgraph by refactoring the AscendSharedFusedMoE layer to shield some logic from torch.dynamo. The approach is to move the implementation from the forward method to forward_impl. However, the new forward method contains a critical bug that will cause a ValueError at runtime due to incorrect tuple unpacking. I've provided a comment with a suggested fix.

Comment on lines +336 to +341
shared_out, fused_out = AscendFusedMoE.forward(
self,
hidden_states=hidden_states,
router_logits=router_logits,
)
return shared_out, fused_out
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The call to AscendFusedMoE.forward is incorrect. The MRO for AscendSharedFusedMoE will cause this to resolve to vllm.model_executor.layers.fused_moe.layer.FusedMoE.forward, which returns a single tensor. Attempting to unpack this single tensor into two variables, shared_out and fused_out, will raise a ValueError at runtime.

Given that the implementation logic has been moved into forward_impl, which correctly returns a tuple of two tensors, the forward method should likely just call self.forward_impl.

        return self.forward_impl(
            hidden_states=hidden_states,
            router_logits=router_logits,
        )

@shen-shanshan
Copy link
Copy Markdown
Collaborator

shen-shanshan commented Sep 25, 2025

LGTM.

@whx-sjtu whx-sjtu added ready read for review ready-for-test start test by label for PR labels Sep 25, 2025
@yiz-liu
Copy link
Copy Markdown
Collaborator

yiz-liu commented Sep 25, 2025

LGTM

Signed-off-by: whx-sjtu <2952154980@qq.com>
@whx-sjtu whx-sjtu force-pushed the fix_a2_accu branch 2 times, most recently from 5fdaa50 to f8aa50b Compare September 25, 2025 12:15
Signed-off-by: whx-sjtu <2952154980@qq.com>
Comment thread vllm_ascend/ops/common_fused_moe.py Outdated
Comment thread vllm_ascend/ops/common_fused_moe.py Outdated
Signed-off-by: whx-sjtu <2952154980@qq.com>
@Yikun
Copy link
Copy Markdown
Member

Yikun commented Sep 28, 2025

  1. This PR doesn't include any e2e tests, please describe how you tested it.
  2. What are your plans for adding e2e test to prevent this issue from breaking again? Please add the test plan as issue.

@wangxiyuan wangxiyuan merged commit 14d4ed5 into vllm-project:main Sep 28, 2025
19 checks passed
@wangxiyuan
Copy link
Copy Markdown
Collaborator

@Yikun talked with @whx-sjtu the case is that A2 + aclgraph + EP + max_num_tokens>512 our CI doesn't cover this case. Let's add one later.

Yikun added a commit to Yikun/vllm-ascend that referenced this pull request Sep 28, 2025
This reverts commit 14d4ed5.

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@17b4c66

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
luolun pushed a commit to luolun/vllm-ascend that referenced this pull request Nov 19, 2025
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@17b4c66

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
luolun pushed a commit to luolun/vllm-ascend that referenced this pull request Nov 19, 2025
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@17b4c66

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: luolun <luolun1995@cmbchina.com>
hwhaokun pushed a commit to hwhaokun/vllm-ascend that referenced this pull request Nov 19, 2025
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@17b4c66

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: hwhaokun <haokun0405@163.com>
NSDie pushed a commit to NSDie/vllm-ascend that referenced this pull request Nov 24, 2025
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@17b4c66

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: nsdie <yeyifan@huawei.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 9, 2025
This PR fixes accuracy problem of aclgraph on A2. The problem is
introduced by PR vllm-project#2980, which makes the `all_reduce` of shared_experts
exposed to torch dynamo. This PR moves all the codes into forward_impl
to shiled from torch dynamo.

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@17b4c66

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ops ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants