Skip to content

scheduler: Cache also the last block after KV recving#32168

Closed
orozery wants to merge 1 commit intovllm-project:mainfrom
orozery:sched-cache-last-async-loaded-block
Closed

scheduler: Cache also the last block after KV recving#32168
orozery wants to merge 1 commit intovllm-project:mainfrom
orozery:sched-cache-last-async-loaded-block

Conversation

@orozery
Copy link
Copy Markdown
Collaborator

@orozery orozery commented Jan 12, 2026

This PR fixes the scheduler to commit the last full block of KV data that was async received.

@robertgshaw2-redhat this is modifying code you introduced in #17751.
I think it's safe to cache that last block as well, but not sure.
cc @njhill

BTW, do we really have to re-compute the last token, or can we somehow re-use the KV data that we saved for it?


Note

Ensures KV blocks are fully cached after async KV receive while preserving correct sampling behavior.

  • In Scheduler._update_waiting_for_remote_kv, after caching received blocks, sets num_computed_tokens to request.num_tokens - 1 when equal, so the last token is recomputed next step
  • Previously decremented before caching; now the last full block is cached too, improving cache commit behavior for completed blocks

Written by Cursor Bugbot for commit 0076065. This will update automatically on new commits. Configure here.

This commit fixes the scheduler to commit the last full block of KV data
that was async received.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes an issue in the scheduler where the last block of asynchronously received KV data was not being cached. By moving the decrement of num_computed_tokens to after the call to cache_blocks, you ensure that the complete KV cache for all received tokens is stored in the prefix cache. The subsequent decrement is still necessary to trigger the recomputation of the last token, which is required to generate logits for sampling the next token. This change is safe and improves caching behavior as intended.

Copy link
Copy Markdown
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For general case without kv connector, we need to recompute the last token to generate the logprobs to sample the first output token.

Any strong reason to cache the last block?

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Jan 14, 2026

Any strong reason to cache the last block?

If you don't cache the last block you will have to recompute the entire last block, not just the last token.
I think the question should be the opposite:
Is there any reason why not to cache the last block?

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Mar 6, 2026

Superseded by #34616.

@orozery orozery closed this Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants