[Perf] Replace cudaMemsetAsync with in-kernel cleanup for persistent_topk#41748
[Perf] Replace cudaMemsetAsync with in-kernel cleanup for persistent_topk#41748LopezCastroRoberto wants to merge 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
|
Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request moves the zero-initialization of the RadixRowState workspace from a host-side cudaMemsetAsync to an in-kernel cleanup at the end of the persistent TopK kernel. This change aims to improve performance and support CUDA graph replays by avoiding stream-ordered host calls. However, the current in-kernel implementation introduces a performance bottleneck due to excessive memory fences in a loop and a race condition where the arrival_counter could be reset before all histogram zeros are globally visible. I have provided feedback on how to parallelize the zeroing and use a proper synchronization barrier.
|
@claude review |
|
cc: @zyongye |
Motivation
Replaces the per-call
cudaMemsetAsync(PR #41444 and #41665) with an in-kernel cleanup at the end ofpersistent_topk_kernel, eliminating a 3-5us overhead per launch (decode step).This is particularly relevant in low-latency scenarios and when
max_seq_len <= RADIX_THRESHOLD, where the workspace is not required. Despite this, the persistent kernel currently invokescudaMemsetAsyncunconditionally, incurring unnecessary overhead.