Skip to content

[Perf] Replace cudaMemsetAsync with in-kernel cleanup for persistent_topk#41748

Open
LopezCastroRoberto wants to merge 3 commits intovllm-project:mainfrom
LopezCastroRoberto:perf/inline_init_persistent_topK
Open

[Perf] Replace cudaMemsetAsync with in-kernel cleanup for persistent_topk#41748
LopezCastroRoberto wants to merge 3 commits intovllm-project:mainfrom
LopezCastroRoberto:perf/inline_init_persistent_topK

Conversation

@LopezCastroRoberto
Copy link
Copy Markdown
Contributor

@LopezCastroRoberto LopezCastroRoberto commented May 5, 2026

Motivation

Replaces the per-call cudaMemsetAsync (PR #41444 and #41665) with an in-kernel cleanup at the end of persistent_topk_kernel, eliminating a 3-5us overhead per launch (decode step).

This is particularly relevant in low-latency scenarios and when max_seq_len <= RADIX_THRESHOLD, where the workspace is not required. Despite this, the persistent kernel currently invokes cudaMemsetAsync unconditionally, incurring unnecessary overhead.

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 5, 2026

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request moves the zero-initialization of the RadixRowState workspace from a host-side cudaMemsetAsync to an in-kernel cleanup at the end of the persistent TopK kernel. This change aims to improve performance and support CUDA graph replays by avoiding stream-ordered host calls. However, the current in-kernel implementation introduces a performance bottleneck due to excessive memory fences in a loop and a race condition where the arrival_counter could be reset before all histogram zeros are globally visible. I have provided feedback on how to parallelize the zeroing and use a proper synchronization barrier.

Comment thread csrc/persistent_topk.cuh
@LopezCastroRoberto LopezCastroRoberto marked this pull request as draft May 5, 2026 16:39
@LopezCastroRoberto LopezCastroRoberto marked this pull request as ready for review May 5, 2026 16:44
@LopezCastroRoberto LopezCastroRoberto marked this pull request as draft May 5, 2026 16:55
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
@LopezCastroRoberto LopezCastroRoberto marked this pull request as ready for review May 5, 2026 17:56
@LopezCastroRoberto
Copy link
Copy Markdown
Contributor Author

@claude review

@LopezCastroRoberto
Copy link
Copy Markdown
Contributor Author

cc: @zyongye

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant