[Core][KV Connector] Remove use of num_cached_tokens in error handling by markmc · Pull Request #38096 · vllm-project/vllm

markmc · 2026-03-25T10:28:17Z

Refactor _update_requests_with_invalid_blocks() to avoid recomputation logic based on num_cached_tokens, simplifying the logic, and making the sync-shared-blocks case less special.

This also paves the way for refactoring prefill cache statistics in #37460 in an effort to stamp out reports of
Counters can only be incremented by non-negative amounts.

Based on Or's work in commit f02a5c80 of #35223

gemini-code-assist

Code Review

This pull request refactors the handling of invalid KV cache blocks within the scheduler. It introduces a num_scheduled_tokens parameter to _handle_invalid_blocks and _update_requests_with_invalid_blocks to refine how computed tokens are tracked, especially for newly scheduled tokens. The logic for truncating computed tokens and identifying blocks for eviction has been updated, incorporating a new cdiv utility. A review comment highlights a potential performance improvement: newly allocated, empty blocks that become invalid should be directly freed using kv_cache_manager.free() instead of being added to blocks_to_evict, as eviction might incur unnecessary overhead for empty blocks.

orozery · 2026-03-25T11:36:25Z

This change looks bigger than what it can be.
Can't we just replace:

                # Sync loading. num_computed_tokens includes new tokens
                req_num_computed_tokens = request.num_cached_tokens

with:

                # Sync loading. num_computed_tokens includes new tokens
                req_num_computed_tokens = (
                    request.num_computed_tokens - num_scheduled_tokens.get(req_id, 0)
                )

and replace the rest of request.num_cached_tokens with req_num_computed_tokens ?

Refactor _update_requests_with_invalid_blocks() to avoid recomputation logic based on `num_cached_tokens`, simplifying the logic, and making the sync-shared-blocks case less special. This also paves the way for refactoring prefill cache statistics in vllm-project#37460 in an effort to stamp out reports of `Counters can only be incremented by non-negative amounts`. Based on Or's work in commit f02a5c80 of vllm-project#35223 Co-authored-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2026-03-25T16:25:30Z

This change looks bigger than what it can be.

Ok, fair. I took the refactoring a bit further, but that can come later if it is valuable

markmc · 2026-03-25T16:59:08Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the KV cache invalidation logic within the scheduler by introducing num_scheduled_tokens to _handle_invalid_blocks and _update_requests_with_invalid_blocks. This new parameter allows for a more precise calculation of req_num_computed_tokens by subtracting scheduled but not yet computed tokens from the total, ensuring accurate token counts when handling invalid KV cache blocks. This change streamlines the logic for both asynchronous and synchronous loading scenarios. I have no feedback to provide.

orozery

LGTM

vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: johnnynunez <johnnynuca14@gmail.com>

vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com>

vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com>

vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com>

…s with PrefillStats In OutputProcessor, we take the first EngineCoreOutput as a signal that prefill has completed, and record certain statistics about it. On the scheduler side, because of preemption, we might have prefills that are scheduled but never completed, or we might need to recompute an already completed prefill. To add clarity, we use PrefillStats to track the first scheduled prefill so that the stats can be returned to the frontend via EngineCoreOutput. num_cached_tokens was previously used for KV transfer failure recovery, but this is no longer true as of vllm-project#38096. We also no longer attempt to correct these prefill metrics if KV transfers failed, since this introduced unjustified brittleness to an already brittle code path. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners March 25, 2026 10:28

mergify Bot added the v1 label Mar 25, 2026

gemini-code-assist Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread vllm/v1/core/sched/scheduler.py Outdated

markmc mentioned this pull request Mar 25, 2026

[Core][Metrics][BugFix] Replace num_cached_tokens/num_external_computed_tokens with PrefillStats #37460

Merged

markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 25, 2026

markmc force-pushed the scheduler-recovery-rework branch from a16e979 to 317d3df Compare March 25, 2026 16:24

gemini-code-assist Bot reviewed Mar 25, 2026

View reviewed changes

orozery approved these changes Mar 25, 2026

View reviewed changes

orozery enabled auto-merge (squash) March 25, 2026 18:00

orozery merged commit e38817f into vllm-project:main Mar 25, 2026
52 checks passed

RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026

[Core][KV Connector] Remove use of num_cached_tokens in error handling (

f8e2a6c

vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Core][KV Connector] Remove use of num_cached_tokens in error handling (

7330dd7

vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com>

puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026

[Core][KV Connector] Remove use of num_cached_tokens in error handling (

fe3fff2

vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

[Core][KV Connector] Remove use of num_cached_tokens in error handling (

8f1913f

vllm-project#38096) Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core][KV Connector] Remove use of num_cached_tokens in error handling#38096

[Core][KV Connector] Remove use of num_cached_tokens in error handling#38096
orozery merged 1 commit intovllm-project:mainfrom
markmc:scheduler-recovery-rework

markmc commented Mar 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

orozery commented Mar 25, 2026

Uh oh!

markmc commented Mar 25, 2026

Uh oh!

markmc commented Mar 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

orozery left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

markmc commented Mar 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

orozery commented Mar 25, 2026

Uh oh!

markmc commented Mar 25, 2026

Uh oh!

markmc commented Mar 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

orozery left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants