[Bugfix] Synchronize only the current stream to avoid device sync by IWantFight · Pull Request #6432 · vllm-project/vllm-ascend

IWantFight · 2026-01-30T09:35:15Z

What this PR does / why we need it?

Following PR #4233, a synchronization mechanism was introduced between steps in asynchronous scheduling with ACL Graph to address a hanging issue. However, full device-level synchronization is unnecessary—only the operations on the current stream need to be synchronized. Otherwise, if other background operations (such as send and recv) are running concurrently, they may negatively impact inference performance for the instance.

hang problem

Synchronizing only the current stream can also resolve the hang issue.

Does this PR introduce any user-facing change?

No

How was this patch tested?

vLLM version: v0.14.1
vLLM main: vllm-project/vllm@dc917cc

gemini-code-assist

Code Review

This pull request aims to improve performance by replacing a device-wide synchronization with a more granular stream-specific one. While this is a valid optimization strategy, the current implementation appears to introduce a critical race condition. The new synchronization mechanism fails to wait for necessary parameter updates that occur on a separate stream, potentially leading to the model executing with stale data and producing incorrect results. I have added a review comment detailing this critical issue and recommending the addition of explicit cross-stream synchronization to ensure correctness.

gemini-code-assist · 2026-01-30T09:37:12Z

        # To ensure proper ordering, we must call synchronize here before replaying,
        # so that update_attn_params only executes after the previous graph replay has fully completed.
-        torch.npu.synchronize()
+        torch.npu.current_stream().synchronize()


This change from a device-wide torch.npu.synchronize() to a stream-specific torch.npu.current_stream().synchronize() may introduce a race condition, making it unsafe.

The update_attn_params function, which updates graph parameters for the current iteration, appears to run on a separate update_stream. The subsequent entry.aclgraph.replay() on the current stream depends on these updates.

While the new synchronization waits for the previous graph replay on the current stream, it no longer waits for the parameter updates on update_stream. This could lead to replay() executing with stale or partially updated parameters, causing correctness issues.

The original torch.npu.synchronize() would have prevented this race condition, assuming update_attn_params is called before this point.

To fix this, explicit synchronization between the streams is required before replay(). For example, by using torch.npu.current_stream().wait_stream(update_stream) or waiting on an event recorded after the parameter updates. Without such synchronization, this change is incorrect.

github-actions · 2026-01-30T09:37:21Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2026-01-30T09:37:51Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: For_YL <zhangtangwei@huawei.com>

realliujiaxu · 2026-02-03T12:53:59Z

LGTM, thanks for your contribution!

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (59 commits) [Feat.]: 310p support MOE models (vllm-project#6530) [Doc] backport 0.13.0 release note (vllm-project#6584) [CI] Update UT CANN version to 8.5.0 for main branch (vllm-project#6564) [CI] Change A2 runner (vllm-project#6557) [Bugfix] Fix the incorrect use of the output parameter in _forward_fia_slidingwindow (vllm-project#6469) [main2main] upgrade vllm main 0202 (vllm-project#6560) [CI][npugraph_ex]Fix npugraph ex e2e test (vllm-project#6553) [Feature]KV pool supports sparse attention (vllm-project#6339) [bugfix]Fix accuracy issue in PCP/DCP with speculative decoding (vllm-project#6491) perf: adaptive block size selection in linear_persistent kernel (vllm-project#6537) [ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint (vllm-project#6475) [Bugfix]Fix of Pooling Code and Update of Pooling Usage Guide (vllm-project#6126) [Fusion] Add rmsnorm dynamic quant fusion pass (vllm-project#6274) [Bugfix] Synchronize only the current stream to avoid device sync (vllm-project#6432) [CI] Add long and short prompt tests for DeepSeek-V3.2 (vllm-project#6499) [Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (vllm-project#6442) [bugfix][npugraph_ex]duplicate pattern issue (vllm-project#6513) [bugfix][npugraph_ex]add the extra check for allreduce rmsnorm fusion pass (vllm-project#6430) [Quant] GLM4.7-Flash Support W8A8 (vllm-project#6492) [Nightly][BugFix] Remove kv_cache nz test case for test_mla_preprocess_nq.py (vllm-project#6505) ...

…lm-project#6432) ### What this PR does / why we need it? Following [PR vllm-project#4233](vllm-project#4233), a synchronization mechanism was introduced between steps in asynchronous scheduling with ACL Graph to address a hanging issue. However, full device-level synchronization is unnecessary—only the operations on the current stream need to be synchronized. Otherwise, if other background operations (such as send and recv) are running concurrently, they may negatively impact inference performance for the instance. hang problem ![c4bbfac9a9088acec0ad335b4c2af437](https://github.com/user-attachments/assets/b7c8c612-4d45-48ec-9465-954869f9643d) Synchronizing only the current stream can also resolve the hang issue. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: For_YL <zhangtangwei@huawei.com> Co-authored-by: For_YL <zhangtangwei@huawei.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

…lm-project#6432) ### What this PR does / why we need it? Following [PR vllm-project#4233](vllm-project#4233), a synchronization mechanism was introduced between steps in asynchronous scheduling with ACL Graph to address a hanging issue. However, full device-level synchronization is unnecessary—only the operations on the current stream need to be synchronized. Otherwise, if other background operations (such as send and recv) are running concurrently, they may negatively impact inference performance for the instance. hang problem ![c4bbfac9a9088acec0ad335b4c2af437](https://github.com/user-attachments/assets/b7c8c612-4d45-48ec-9465-954869f9643d) Synchronizing only the current stream can also resolve the hang issue. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: For_YL <zhangtangwei@huawei.com> Co-authored-by: For_YL <zhangtangwei@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…lm-project#6432) ### What this PR does / why we need it? Following [PR vllm-project#4233](vllm-project#4233), a synchronization mechanism was introduced between steps in asynchronous scheduling with ACL Graph to address a hanging issue. However, full device-level synchronization is unnecessary—only the operations on the current stream need to be synchronized. Otherwise, if other background operations (such as send and recv) are running concurrently, they may negatively impact inference performance for the instance. hang problem ![c4bbfac9a9088acec0ad335b4c2af437](https://github.com/user-attachments/assets/b7c8c612-4d45-48ec-9465-954869f9643d) Synchronizing only the current stream can also resolve the hang issue. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: For_YL <zhangtangwei@huawei.com> Co-authored-by: For_YL <zhangtangwei@huawei.com>

…lm-project#6432) ### What this PR does / why we need it? Following [PR vllm-project#4233](vllm-project#4233), a synchronization mechanism was introduced between steps in asynchronous scheduling with ACL Graph to address a hanging issue. However, full device-level synchronization is unnecessary—only the operations on the current stream need to be synchronized. Otherwise, if other background operations (such as send and recv) are running concurrently, they may negatively impact inference performance for the instance. hang problem ![c4bbfac9a9088acec0ad335b4c2af437](https://github.com/user-attachments/assets/b7c8c612-4d45-48ec-9465-954869f9643d) Synchronizing only the current stream can also resolve the hang issue. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: For_YL <zhangtangwei@huawei.com> Co-authored-by: For_YL <zhangtangwei@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…lm-project#6432) ### What this PR does / why we need it? Following [PR vllm-project#4233](vllm-project#4233), a synchronization mechanism was introduced between steps in asynchronous scheduling with ACL Graph to address a hanging issue. However, full device-level synchronization is unnecessary—only the operations on the current stream need to be synchronized. Otherwise, if other background operations (such as send and recv) are running concurrently, they may negatively impact inference performance for the instance. hang problem ![c4bbfac9a9088acec0ad335b4c2af437](https://github.com/user-attachments/assets/b7c8c612-4d45-48ec-9465-954869f9643d) Synchronizing only the current stream can also resolve the hang issue. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: For_YL <zhangtangwei@huawei.com> Co-authored-by: For_YL <zhangtangwei@huawei.com>

IWantFight requested a review from yiz-liu as a code owner January 30, 2026 09:35

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

github-actions bot added the merge-conflicts label Jan 30, 2026

IWantFight force-pushed the main branch from 6c3aca8 to 783646a Compare January 30, 2026 09:42

github-actions bot removed the merge-conflicts label Jan 30, 2026

fix(aclgraph): Synchronize only the current stream to avoid device sync

2745038

Signed-off-by: For_YL <zhangtangwei@huawei.com>

IWantFight force-pushed the main branch from 783646a to 2745038 Compare January 30, 2026 09:43

IWantFight closed this Feb 3, 2026

IWantFight reopened this Feb 3, 2026

realliujiaxu added ready read for review ready-for-test start test by label for PR labels Feb 3, 2026

realliujiaxu approved these changes Feb 3, 2026

View reviewed changes

wangxiyuan merged commit e7a13be into vllm-project:main Feb 4, 2026
57 of 60 checks passed

wangxiyuan mentioned this pull request Feb 24, 2026

[Misc]: test #6787

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Synchronize only the current stream to avoid device sync#6432

[Bugfix] Synchronize only the current stream to avoid device sync#6432
wangxiyuan merged 1 commit intovllm-project:mainfrom
IWantFight:main

IWantFight commented Jan 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

realliujiaxu commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

IWantFight commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

realliujiaxu commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

IWantFight commented Jan 30, 2026 •

edited

Loading