[Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT by jesse996 · Pull Request #2788 · vllm-project/vllm-ascend

jesse996 · 2025-09-05T15:00:44Z

This PR is based on top of vllm-project/vllm#22760

What this PR does / why we need it?

When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync.

Does this PR introduce any user-facing change?

How was this patch tested?

Bring up vLLM server

VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l
og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000

Before：

After

As shown in the figure, the TTFT decreased

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@9607d5e

github-actions · 2025-09-05T15:00:53Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces a valid performance optimization by replacing a blocking .tolist() call with a non-blocking D2H copy and an NPU event synchronization. This is a good approach to avoid device-wide stalls. However, there is a critical bug in the implementation where the pre-allocated pinned memory tensor is sized incorrectly and uses an undefined attribute, which will cause a runtime error. I've provided a fix for this issue.

Signed-off-by: jesse <szxfml@gmail.com>

codecov · 2025-09-05T15:31:31Z

Codecov Report

❌ Patch coverage is 95.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.36%. Comparing base (1bbb20e) to head (5be58d5).
⚠️ Report is 21 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/worker/model_runner_v1.py	60.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2788      +/-   ##
==========================================
+ Coverage   74.76%   75.36%   +0.59%     
==========================================
  Files         150      155       +5     
  Lines       20891    21350     +459     
==========================================
+ Hits        15620    16091     +471     
+ Misses       5271     5259      -12

Flag	Coverage Δ
unittests	`75.36% <95.00%> (+0.59%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: jesse <szxfml@gmail.com>

wangxiyuan · 2025-09-08T08:00:48Z

nice work, can you print the benchmark result with/without this PR to make sure it works as expect?

jesse996 · 2025-09-09T02:56:31Z

nice work, can you print the benchmark result with/without this PR to make sure it works as expect?

added to the beginning

github-actions · 2025-09-11T08:47:35Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: jesse <szxfml@gmail.com>

github-actions · 2025-09-16T03:08:15Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: jesse <szxfml@gmail.com>

wangxiyuan · 2025-09-22T06:47:05Z

        return False
+
+    def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]:
+        # This is a short term mitigation for issue mentioned in


can you rewrite the comment to ascend case?

Signed-off-by: jesse <szxfml@gmail.com>

…to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" This reverts commit 6995a7b.

…3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: #3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8

…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8 Signed-off-by: huangdong2022 <huangdong51@huawei.com>

… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788) ### What this PR does / why we need it? When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Bring up vLLM server ```bash VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000 ``` ## Before： ![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1) ## After ![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691) As shown in the figure, the TTFT decreased - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: jesse <szxfml@gmail.com>

…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8

…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8 Signed-off-by: luolun <luolun1995@cmbchina.com>

… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788) ### What this PR does / why we need it? When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Bring up vLLM server ```bash VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000 ``` ## Before： ![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1) ## After ![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691) As shown in the figure, the TTFT decreased - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: jesse <szxfml@gmail.com> Signed-off-by: hwhaokun <haokun0405@163.com>

…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8 Signed-off-by: hwhaokun <haokun0405@163.com>

… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788) ### What this PR does / why we need it? When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Bring up vLLM server ```bash VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000 ``` ## Before： ![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1) ## After ![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691) As shown in the figure, the TTFT decreased - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: jesse <szxfml@gmail.com> Signed-off-by: nsdie <yeyifan@huawei.com>

…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8 Signed-off-by: nsdie <yeyifan@huawei.com>

… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788) ### What this PR does / why we need it? When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Bring up vLLM server ```bash VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000 ``` ## Before： ![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1) ## After ![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691) As shown in the figure, the TTFT decreased - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: jesse <szxfml@gmail.com>

…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8

… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788) ### What this PR does / why we need it? When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Bring up vLLM server ```bash VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000 ``` ## Before： ![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1) ## After ![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691) As shown in the figure, the TTFT decreased - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: jesse <szxfml@gmail.com>

…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8

gemini-code-assist Bot reviewed Sep 5, 2025

View reviewed changes

Comment thread vllm_ascend/worker/model_runner_v1.py

jesse996 force-pushed the event-sync branch from 60adb86 to 29c1ddd Compare September 5, 2025 15:03

use event sync

b6c5ef9

Signed-off-by: jesse <szxfml@gmail.com>

jesse996 force-pushed the event-sync branch from 29c1ddd to b6c5ef9 Compare September 5, 2025 15:13

jesse996 force-pushed the event-sync branch from 5022ce7 to b6c5ef9 Compare September 5, 2025 17:58

github-actions Bot added the module:tests label Sep 5, 2025

jesse996 force-pushed the event-sync branch from 5330571 to b6c5ef9 Compare September 5, 2025 18:07

jesse996 added 3 commits September 6, 2025 02:10

add test

9816a36

Signed-off-by: jesse <szxfml@gmail.com>

update test

f14a98b

Signed-off-by: jesse <szxfml@gmail.com>

fix test

beabae4

Signed-off-by: jesse <szxfml@gmail.com>

jesse996 force-pushed the event-sync branch from caa80f6 to beabae4 Compare September 6, 2025 00:47

jesse996 added 6 commits September 6, 2025 13:03

fix test

3da83fe

Signed-off-by: jesse <szxfml@gmail.com>

fix test

1695f5f

Signed-off-by: jesse <szxfml@gmail.com>

fix test

c483b20

Signed-off-by: jesse <szxfml@gmail.com>

fix test

ed0b72f

Signed-off-by: jesse <szxfml@gmail.com>

fix test

1f9cb35

Signed-off-by: jesse <szxfml@gmail.com>

fix test

9c8fb4c

Signed-off-by: jesse <szxfml@gmail.com>

github-actions Bot added the merge-conflicts label Sep 11, 2025

github-actions Bot removed the merge-conflicts label Sep 11, 2025

Merge branch 'main' into event-sync

5be58d5

Signed-off-by: jesse <szxfml@gmail.com>

jesse996 force-pushed the event-sync branch from 6de8951 to 5be58d5 Compare September 11, 2025 11:58

jesse996 added 2 commits September 15, 2025 10:00

update test

598c896

Signed-off-by: jesse <szxfml@gmail.com>

update test

674be75

Signed-off-by: jesse <szxfml@gmail.com>

github-actions Bot added the merge-conflicts label Sep 16, 2025

Merge branch 'main' into event-sync

4588d12

github-actions Bot removed the merge-conflicts label Sep 16, 2025

update test

d81f665

Signed-off-by: jesse <szxfml@gmail.com>

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Sep 18, 2025

wangxiyuan reviewed Sep 22, 2025

View reviewed changes

update comment

dd4c177

Signed-off-by: jesse <szxfml@gmail.com>

wangxiyuan approved these changes Sep 24, 2025

View reviewed changes

wangxiyuan merged commit 6995a7b into vllm-project:main Sep 24, 2025
19 checks passed

wangxiyuan mentioned this pull request Sep 25, 2025

Revert "[Disagg][Perf] Use NPU event sync instead of blocking tolist #3194

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT#2788

[Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT#2788
wangxiyuan merged 16 commits intovllm-project:mainfrom
jesse996:event-sync

jesse996 commented Sep 5, 2025 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Sep 5, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

codecov Bot commented Sep 5, 2025 •

edited

Loading

Uh oh!

wangxiyuan commented Sep 8, 2025

Uh oh!

jesse996 commented Sep 9, 2025

Uh oh!

github-actions Bot commented Sep 11, 2025

Uh oh!

github-actions Bot commented Sep 16, 2025

Uh oh!

wangxiyuan Sep 22, 2025

Uh oh!

jesse996 Sep 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jesse996 commented Sep 5, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Before：

After

Uh oh!

github-actions Bot commented Sep 5, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

codecov Bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wangxiyuan commented Sep 8, 2025

Uh oh!

jesse996 commented Sep 9, 2025

Uh oh!

github-actions Bot commented Sep 11, 2025

Uh oh!

github-actions Bot commented Sep 16, 2025

Uh oh!

wangxiyuan Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

jesse996 Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jesse996 commented Sep 5, 2025 •

edited by github-actions Bot

Loading

codecov Bot commented Sep 5, 2025 •

edited

Loading