Skip to content

[Core][KV Connector] Remove use of num_cached_tokens in error handling#38096

Merged
orozery merged 1 commit intovllm-project:mainfrom
markmc:scheduler-recovery-rework
Mar 25, 2026
Merged

[Core][KV Connector] Remove use of num_cached_tokens in error handling#38096
orozery merged 1 commit intovllm-project:mainfrom
markmc:scheduler-recovery-rework

Conversation

@markmc
Copy link
Copy Markdown
Member

@markmc markmc commented Mar 25, 2026

Refactor _update_requests_with_invalid_blocks() to avoid recomputation logic based on num_cached_tokens, simplifying the logic, and making the sync-shared-blocks case less special.

This also paves the way for refactoring prefill cache statistics in #37460 in an effort to stamp out reports of
Counters can only be incremented by non-negative amounts.

Based on Or's work in commit f02a5c80 of #35223

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the handling of invalid KV cache blocks within the scheduler. It introduces a num_scheduled_tokens parameter to _handle_invalid_blocks and _update_requests_with_invalid_blocks to refine how computed tokens are tracked, especially for newly scheduled tokens. The logic for truncating computed tokens and identifying blocks for eviction has been updated, incorporating a new cdiv utility. A review comment highlights a potential performance improvement: newly allocated, empty blocks that become invalid should be directly freed using kv_cache_manager.free() instead of being added to blocks_to_evict, as eviction might incur unnecessary overhead for empty blocks.

Comment thread vllm/v1/core/sched/scheduler.py Outdated
@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 25, 2026
@orozery
Copy link
Copy Markdown
Collaborator

orozery commented Mar 25, 2026

This change looks bigger than what it can be.
Can't we just replace:

                # Sync loading. num_computed_tokens includes new tokens
                req_num_computed_tokens = request.num_cached_tokens

with:

                # Sync loading. num_computed_tokens includes new tokens
                req_num_computed_tokens = (
                    request.num_computed_tokens - num_scheduled_tokens.get(req_id, 0)
                )

and replace the rest of request.num_cached_tokens with req_num_computed_tokens ?

Refactor _update_requests_with_invalid_blocks() to avoid
recomputation logic based on `num_cached_tokens`, simplifying
the logic, and making the sync-shared-blocks case less special.

This also paves the way for refactoring prefill cache statistics
in vllm-project#37460 in an effort to stamp out reports of
`Counters can only be incremented by non-negative amounts`.

Based on Or's work in commit f02a5c80 of vllm-project#35223

Co-authored-by: Or Ozeri <oro@il.ibm.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc force-pushed the scheduler-recovery-rework branch from a16e979 to 317d3df Compare March 25, 2026 16:24
@markmc
Copy link
Copy Markdown
Member Author

markmc commented Mar 25, 2026

This change looks bigger than what it can be.

Ok, fair. I took the refactoring a bit further, but that can come later if it is valuable

@markmc
Copy link
Copy Markdown
Member Author

markmc commented Mar 25, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the KV cache invalidation logic within the scheduler by introducing num_scheduled_tokens to _handle_invalid_blocks and _update_requests_with_invalid_blocks. This new parameter allows for a more precise calculation of req_num_computed_tokens by subtracting scheduled but not yet computed tokens from the total, ensuring accurate token counts when handling invalid KV cache blocks. This change streamlines the logic for both asynchronous and synchronous loading scenarios. I have no feedback to provide.

Copy link
Copy Markdown
Collaborator

@orozery orozery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@orozery orozery enabled auto-merge (squash) March 25, 2026 18:00
@orozery orozery merged commit e38817f into vllm-project:main Mar 25, 2026
52 checks passed
johnnynunez pushed a commit to johnnynunez/vllm that referenced this pull request Mar 25, 2026
vllm-project#38096)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026
nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026
vllm-project#38096)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
vllm-project#38096)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
markmc added a commit to markmc/vllm that referenced this pull request Apr 13, 2026
…s with PrefillStats

In OutputProcessor, we take the first EngineCoreOutput as a signal
that prefill has completed, and record certain statistics about it.

On the scheduler side, because of preemption, we might have prefills
that are scheduled but never completed, or we might need to recompute
an already completed prefill. To add clarity, we use PrefillStats
to track the first scheduled prefill so that the stats can be returned
to the frontend via EngineCoreOutput.

num_cached_tokens was previously used for KV transfer failure recovery,
but this is no longer true as of vllm-project#38096. We also no longer attempt to
correct these prefill metrics if KV transfers failed, since this
introduced unjustified brittleness to an already brittle code path.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants