scheduler: Cache also the last block after KV recving#32168
scheduler: Cache also the last block after KV recving#32168orozery wants to merge 1 commit intovllm-project:mainfrom
Conversation
This commit fixes the scheduler to commit the last full block of KV data that was async received. Signed-off-by: Or Ozeri <oro@il.ibm.com>
There was a problem hiding this comment.
Code Review
This pull request correctly fixes an issue in the scheduler where the last block of asynchronously received KV data was not being cached. By moving the decrement of num_computed_tokens to after the call to cache_blocks, you ensure that the complete KV cache for all received tokens is stored in the prefix cache. The subsequent decrement is still necessary to trigger the recomputation of the last token, which is required to generate logits for sampling the next token. This change is safe and improves caching behavior as intended.
heheda12345
left a comment
There was a problem hiding this comment.
For general case without kv connector, we need to recompute the last token to generate the logprobs to sample the first output token.
Any strong reason to cache the last block?
If you don't cache the last block you will have to recompute the entire last block, not just the last token. |
|
Superseded by #34616. |
This PR fixes the scheduler to commit the last full block of KV data that was async received.
@robertgshaw2-redhat this is modifying code you introduced in #17751.
I think it's safe to cache that last block as well, but not sure.
cc @njhill
BTW, do we really have to re-compute the last token, or can we somehow re-use the KV data that we saved for it?
Note
Ensures KV blocks are fully cached after async KV receive while preserving correct sampling behavior.
Scheduler._update_waiting_for_remote_kv, after caching received blocks, setsnum_computed_tokenstorequest.num_tokens - 1when equal, so the last token is recomputed next stepWritten by Cursor Bugbot for commit 0076065. This will update automatically on new commits. Configure here.