[Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT#2788
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request introduces a valid performance optimization by replacing a blocking .tolist() call with a non-blocking D2H copy and an NPU event synchronization. This is a good approach to avoid device-wide stalls. However, there is a critical bug in the implementation where the pre-allocated pinned memory tensor is sized incorrectly and uses an undefined attribute, which will cause a runtime error. I've provided a fix for this issue.
Signed-off-by: jesse <szxfml@gmail.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2788 +/- ##
==========================================
+ Coverage 74.76% 75.36% +0.59%
==========================================
Files 150 155 +5
Lines 20891 21350 +459
==========================================
+ Hits 15620 16091 +471
+ Misses 5271 5259 -12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
nice work, can you print the benchmark result with/without this PR to make sure it works as expect? |
added to the beginning |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: jesse <szxfml@gmail.com>
6de8951 to
5be58d5
Compare
Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: jesse <szxfml@gmail.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: jesse <szxfml@gmail.com>
| return False | ||
|
|
||
| def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]: | ||
| # This is a short term mitigation for issue mentioned in |
There was a problem hiding this comment.
can you rewrite the comment to ascend case?
Signed-off-by: jesse <szxfml@gmail.com>
…to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" This reverts commit 6995a7b.
…3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: #3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8
…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8 Signed-off-by: huangdong2022 <huangdong51@huawei.com>
… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788) ### What this PR does / why we need it? When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Bring up vLLM server ```bash VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000 ``` ## Before:  ## After  As shown in the figure, the TTFT decreased - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: jesse <szxfml@gmail.com>
…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8
…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8 Signed-off-by: luolun <luolun1995@cmbchina.com>
…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8 Signed-off-by: luolun <luolun1995@cmbchina.com>
… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788) ### What this PR does / why we need it? When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Bring up vLLM server ```bash VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000 ``` ## Before:  ## After  As shown in the figure, the TTFT decreased - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: jesse <szxfml@gmail.com> Signed-off-by: hwhaokun <haokun0405@163.com>
…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8 Signed-off-by: hwhaokun <haokun0405@163.com>
… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788) ### What this PR does / why we need it? When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Bring up vLLM server ```bash VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000 ``` ## Before:  ## After  As shown in the figure, the TTFT decreased - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: jesse <szxfml@gmail.com> Signed-off-by: nsdie <yeyifan@huawei.com>
…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8 Signed-off-by: nsdie <yeyifan@huawei.com>
… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788) ### What this PR does / why we need it? When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Bring up vLLM server ```bash VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000 ``` ## Before:  ## After  As shown in the figure, the TTFT decreased - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: jesse <szxfml@gmail.com>
…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8
… unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788) ### What this PR does / why we need it? When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Bring up vLLM server ```bash VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000 ``` ## Before:  ## After  As shown in the figure, the TTFT decreased - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: jesse <szxfml@gmail.com>
…llm-project#3194) …to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (vllm-project#2788)" ### What this PR does / why we need it? This reverts commit 6995a7b. We'll add it back once the issue is fixed. related issue: vllm-project#3195 ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@52d0cb8
This PR is based on top of vllm-project/vllm#22760
What this PR does / why we need it?
When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync.
Does this PR introduce any user-facing change?
How was this patch tested?
Bring up vLLM server
Before:
After
As shown in the figure, the TTFT decreased