[Core][KV Connector] Remove use of num_cached_tokens in error handling#38096
[Core][KV Connector] Remove use of num_cached_tokens in error handling#38096orozery merged 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the handling of invalid KV cache blocks within the scheduler. It introduces a num_scheduled_tokens parameter to _handle_invalid_blocks and _update_requests_with_invalid_blocks to refine how computed tokens are tracked, especially for newly scheduled tokens. The logic for truncating computed tokens and identifying blocks for eviction has been updated, incorporating a new cdiv utility. A review comment highlights a potential performance improvement: newly allocated, empty blocks that become invalid should be directly freed using kv_cache_manager.free() instead of being added to blocks_to_evict, as eviction might incur unnecessary overhead for empty blocks.
|
This change looks bigger than what it can be. with: and replace the rest of |
Refactor _update_requests_with_invalid_blocks() to avoid recomputation logic based on `num_cached_tokens`, simplifying the logic, and making the sync-shared-blocks case less special. This also paves the way for refactoring prefill cache statistics in vllm-project#37460 in an effort to stamp out reports of `Counters can only be incremented by non-negative amounts`. Based on Or's work in commit f02a5c80 of vllm-project#35223 Co-authored-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
a16e979 to
317d3df
Compare
Ok, fair. I took the refactoring a bit further, but that can come later if it is valuable |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the KV cache invalidation logic within the scheduler by introducing num_scheduled_tokens to _handle_invalid_blocks and _update_requests_with_invalid_blocks. This new parameter allows for a more precise calculation of req_num_computed_tokens by subtracting scheduled but not yet computed tokens from the total, ensuring accurate token counts when handling invalid KV cache blocks. This change streamlines the logic for both asynchronous and synchronous loading scenarios. I have no feedback to provide.
vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com>
vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com>
vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com>
…s with PrefillStats In OutputProcessor, we take the first EngineCoreOutput as a signal that prefill has completed, and record certain statistics about it. On the scheduler side, because of preemption, we might have prefills that are scheduled but never completed, or we might need to recompute an already completed prefill. To add clarity, we use PrefillStats to track the first scheduled prefill so that the stats can be returned to the frontend via EngineCoreOutput. num_cached_tokens was previously used for KV transfer failure recovery, but this is no longer true as of vllm-project#38096. We also no longer attempt to correct these prefill metrics if KV transfers failed, since this introduced unjustified brittleness to an already brittle code path. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Refactor
_update_requests_with_invalid_blocks()to avoid recomputation logic based onnum_cached_tokens, simplifying the logic, and making the sync-shared-blocks case less special.This also paves the way for refactoring prefill cache statistics in #37460 in an effort to stamp out reports of
Counters can only be incremented by non-negative amounts.Based on Or's work in commit f02a5c80 of #35223