Revert "[bugfix]limit graph replay sync (#5761)"#5965
Revert "[bugfix]limit graph replay sync (#5761)"#5965wangxiyuan merged 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request reverts a previous change, making the torch.npu.synchronize() call before graph replay unconditional. This is a necessary change to ensure correctness and prevent race conditions between model iterations, especially in asynchronous execution environments. However, the unconditional device-wide synchronization can be a performance bottleneck. I've added a suggestion to consider a more fine-grained, event-based synchronization mechanism to mitigate the performance impact while maintaining correctness.
| # so that update_attn_params only executes after the previous graph replay has fully completed. | ||
| if self.runtime_mode == CUDAGraphMode.FULL: | ||
| torch.npu.synchronize() | ||
| torch.npu.synchronize() |
There was a problem hiding this comment.
While making the synchronization unconditional correctly addresses a potential race condition, using torch.npu.synchronize() can introduce a significant performance bottleneck as it stalls the CPU and waits for all kernels on the device to complete. A more performant approach would be to use explicit event-based synchronization. For instance, you could record an event after the update_attn_params call in the previous iteration and have the current iteration's stream wait for that specific event before replaying the graph. This would avoid a full device-wide synchronization and improve overall throughput.
This reverts commit 4453c60. Signed-off-by: Angazenn <supperccell@163.com>
…oject#5965) ### What this PR does / why we need it? reverts vllm-project#5761 to fix accuracy issues when using piecewise graph mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 Signed-off-by: Angazenn <supperccell@163.com>
…oject#5965) ### What this PR does / why we need it? reverts vllm-project#5761 to fix accuracy issues when using piecewise graph mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 Signed-off-by: Angazenn <supperccell@163.com>
…oject#5965) ### What this PR does / why we need it? reverts vllm-project#5761 to fix accuracy issues when using piecewise graph mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…oject#5965) ### What this PR does / why we need it? reverts vllm-project#5761 to fix accuracy issues when using piecewise graph mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 Signed-off-by: Angazenn <supperccell@163.com>
…oject#5965) ### What this PR does / why we need it? reverts vllm-project#5761 to fix accuracy issues when using piecewise graph mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…oject#5965) ### What this PR does / why we need it? reverts vllm-project#5761 to fix accuracy issues when using piecewise graph mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2c24bc6 Signed-off-by: Angazenn <supperccell@163.com>
|
@Angazenn @yiz-liu @wangxiyuan Hi , regarding #5761 and the revert in #5965, could you share what exact accuracy issue was observed after removing the PIECEWISE replay sync? Was it output corruption / gibberish, nondeterminism, or a measurable benchmark accuracy drop? Was the root cause identified? |
What this PR does / why we need it?
reverts #5761 to fix accuracy issues when using piecewise graph mode.
Does this PR introduce any user-facing change?
How was this patch tested?