Skip to content

[Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT#2788

Merged
wangxiyuan merged 16 commits intovllm-project:mainfrom
jesse996:event-sync
Sep 24, 2025

Conversation

@jesse996
Copy link
Copy Markdown
Contributor

@jesse996 jesse996 commented Sep 5, 2025

This PR is based on top of vllm-project/vllm#22760

What this PR does / why we need it?

When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync.

Does this PR introduce any user-facing change?

How was this patch tested?

Bring up vLLM server

VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l
og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000

Before:

76218085a0cde9b2a73214e35fb7fc08

After

6c2111136673332244d3ce11060f4048

As shown in the figure, the TTFT decreased

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Sep 5, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valid performance optimization by replacing a blocking .tolist() call with a non-blocking D2H copy and an NPU event synchronization. This is a good approach to avoid device-wide stalls. However, there is a critical bug in the implementation where the pre-allocated pinned memory tensor is sized incorrectly and uses an undefined attribute, which will cause a runtime error. I've provided a fix for this issue.

Comment thread vllm_ascend/worker/model_runner_v1.py
Signed-off-by: jesse <szxfml@gmail.com>
@jesse996 jesse996 changed the title [Disagg][Perf] Use CUDA event sync instead of blocking tolist to avoid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT [Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT Sep 5, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Sep 5, 2025

Codecov Report

❌ Patch coverage is 95.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.36%. Comparing base (1bbb20e) to head (5be58d5).
⚠️ Report is 21 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/worker/model_runner_v1.py 60.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2788      +/-   ##
==========================================
+ Coverage   74.76%   75.36%   +0.59%     
==========================================
  Files         150      155       +5     
  Lines       20891    21350     +459     
==========================================
+ Hits        15620    16091     +471     
+ Misses       5271     5259      -12     
Flag Coverage Δ
unittests 75.36% <95.00%> (+0.59%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: jesse <szxfml@gmail.com>
@wangxiyuan
Copy link
Copy Markdown
Collaborator

nice work, can you print the benchmark result with/without this PR to make sure it works as expect?

@jesse996
Copy link
Copy Markdown
Contributor Author

jesse996 commented Sep 9, 2025

nice work, can you print the benchmark result with/without this PR to make sure it works as expect?

added to the beginning

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: jesse <szxfml@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: jesse <szxfml@gmail.com>
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Sep 18, 2025
return False

def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]:
# This is a short term mitigation for issue mentioned in
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rewrite the comment to ascend case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Signed-off-by: jesse <szxfml@gmail.com>
@wangxiyuan wangxiyuan merged commit 6995a7b into vllm-project:main Sep 24, 2025
19 checks passed
wangxiyuan added a commit to wangxiyuan/vllm-ascend that referenced this pull request Sep 25, 2025
…to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)"

This reverts commit 6995a7b.
Yikun pushed a commit that referenced this pull request Sep 25, 2025
…3194)

…to avoid unintentional copy ops blocking across different NPU streams,
improving disagg TTIT/TTFT (#2788)"



### What this PR does / why we need it?
This reverts commit 6995a7b. We'll add
it back once the issue is fixed.

related issue: #3195

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@52d0cb8
huangdong2022 pushed a commit to huangdong2022/vllm-ascend that referenced this pull request Sep 26, 2025
…llm-project#3194)

…to avoid unintentional copy ops blocking across different NPU streams,
improving disagg TTIT/TTFT (vllm-project#2788)"

### What this PR does / why we need it?
This reverts commit 6995a7b. We'll add
it back once the issue is fixed.

related issue: vllm-project#3195

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@52d0cb8

Signed-off-by: huangdong2022 <huangdong51@huawei.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)

### What this PR does / why we need it?
When we copy the sampled valid token ids from device to host, avoid
using tolist which would trigger a CUDA wise stream sync if the source
is on device. We change it to use non-blocking copy followed by an
explicit CUDA event sync.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Bring up vLLM server
```bash
VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l
og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000
```
## Before:

![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1)
## After

![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691)

As shown in the figure, the TTFT decreased


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: jesse <szxfml@gmail.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…llm-project#3194)

…to avoid unintentional copy ops blocking across different NPU streams,
improving disagg TTIT/TTFT (vllm-project#2788)"



### What this PR does / why we need it?
This reverts commit 6995a7b. We'll add
it back once the issue is fixed.

related issue: vllm-project#3195

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@52d0cb8
luolun pushed a commit to luolun/vllm-ascend that referenced this pull request Nov 19, 2025
…llm-project#3194)

…to avoid unintentional copy ops blocking across different NPU streams,
improving disagg TTIT/TTFT (vllm-project#2788)"



### What this PR does / why we need it?
This reverts commit 6995a7b. We'll add
it back once the issue is fixed.

related issue: vllm-project#3195

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@52d0cb8
Signed-off-by: luolun <luolun1995@cmbchina.com>
luolun pushed a commit to luolun/vllm-ascend that referenced this pull request Nov 19, 2025
…llm-project#3194)

…to avoid unintentional copy ops blocking across different NPU streams,
improving disagg TTIT/TTFT (vllm-project#2788)"

### What this PR does / why we need it?
This reverts commit 6995a7b. We'll add
it back once the issue is fixed.

related issue: vllm-project#3195

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@52d0cb8

Signed-off-by: luolun <luolun1995@cmbchina.com>
hwhaokun pushed a commit to hwhaokun/vllm-ascend that referenced this pull request Nov 19, 2025
… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)

### What this PR does / why we need it?
When we copy the sampled valid token ids from device to host, avoid
using tolist which would trigger a CUDA wise stream sync if the source
is on device. We change it to use non-blocking copy followed by an
explicit CUDA event sync.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Bring up vLLM server
```bash
VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l
og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000
```
## Before:

![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1)
## After

![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691)

As shown in the figure, the TTFT decreased

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: hwhaokun <haokun0405@163.com>
hwhaokun pushed a commit to hwhaokun/vllm-ascend that referenced this pull request Nov 19, 2025
…llm-project#3194)

…to avoid unintentional copy ops blocking across different NPU streams,
improving disagg TTIT/TTFT (vllm-project#2788)"

### What this PR does / why we need it?
This reverts commit 6995a7b. We'll add
it back once the issue is fixed.

related issue: vllm-project#3195

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@52d0cb8

Signed-off-by: hwhaokun <haokun0405@163.com>
NSDie pushed a commit to NSDie/vllm-ascend that referenced this pull request Nov 24, 2025
… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)

### What this PR does / why we need it?
When we copy the sampled valid token ids from device to host, avoid
using tolist which would trigger a CUDA wise stream sync if the source
is on device. We change it to use non-blocking copy followed by an
explicit CUDA event sync.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Bring up vLLM server
```bash
VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l
og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000
```
## Before:

![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1)
## After

![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691)

As shown in the figure, the TTFT decreased

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: nsdie <yeyifan@huawei.com>
NSDie pushed a commit to NSDie/vllm-ascend that referenced this pull request Nov 24, 2025
…llm-project#3194)

…to avoid unintentional copy ops blocking across different NPU streams,
improving disagg TTIT/TTFT (vllm-project#2788)"

### What this PR does / why we need it?
This reverts commit 6995a7b. We'll add
it back once the issue is fixed.

related issue: vllm-project#3195

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@52d0cb8

Signed-off-by: nsdie <yeyifan@huawei.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 9, 2025
… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)

### What this PR does / why we need it?
When we copy the sampled valid token ids from device to host, avoid
using tolist which would trigger a CUDA wise stream sync if the source
is on device. We change it to use non-blocking copy followed by an
explicit CUDA event sync.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Bring up vLLM server
```bash
VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l
og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000
```
## Before:

![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1)
## After

![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691)

As shown in the figure, the TTFT decreased


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: jesse <szxfml@gmail.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 9, 2025
…llm-project#3194)

…to avoid unintentional copy ops blocking across different NPU streams,
improving disagg TTIT/TTFT (vllm-project#2788)"



### What this PR does / why we need it?
This reverts commit 6995a7b. We'll add
it back once the issue is fixed.

related issue: vllm-project#3195

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@52d0cb8
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)

### What this PR does / why we need it?
When we copy the sampled valid token ids from device to host, avoid
using tolist which would trigger a CUDA wise stream sync if the source
is on device. We change it to use non-blocking copy followed by an
explicit CUDA event sync.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Bring up vLLM server
```bash
VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l
og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000
```
## Before:

![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1)
## After

![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691)

As shown in the figure, the TTFT decreased


- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@9607d5e

---------

Signed-off-by: jesse <szxfml@gmail.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
…llm-project#3194)

…to avoid unintentional copy ops blocking across different NPU streams,
improving disagg TTIT/TTFT (vllm-project#2788)"



### What this PR does / why we need it?
This reverts commit 6995a7b. We'll add
it back once the issue is fixed.

related issue: vllm-project#3195

### How was this patch tested?

- vLLM version: v0.10.2
- vLLM main:
vllm-project/vllm@52d0cb8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants