Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions vllm_ascend/compilation/acl_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,13 +186,12 @@ def __call__(self, *args, **kwargs):
)

logger.info_once("Replaying aclgraph")
# In async scheduling or multi-threaded (MT) scenarios when graph mode is FULL, it is possible that
# In async scheduling or multi-threaded (MT) scenarios, it is possible that
# the CPU's record event (from update_attn_params) for the iteration i completes
# before the grph replay of iteration i-1.
# To ensure proper ordering, we must call synchronize here before replaying,
# so that update_attn_params only executes after the previous graph replay has fully completed.
if self.runtime_mode == CUDAGraphMode.FULL:
torch.npu.synchronize()
torch.npu.synchronize()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While making the synchronization unconditional correctly addresses a potential race condition, using torch.npu.synchronize() can introduce a significant performance bottleneck as it stalls the CPU and waits for all kernels on the device to complete. A more performant approach would be to use explicit event-based synchronization. For instance, you could record an event after the update_attn_params call in the previous iteration and have the current iteration's stream wait for that specific event before replaying the graph. This would avoid a full device-wide synchronization and improve overall throughput.

entry.aclgraph.replay()
return entry.output

Expand Down
Loading