[Feature]: Async scheduling support with KVConnector by KevinCheung2259 · Pull Request #27644 · vllm-project/vllm

KevinCheung2259 · 2025-10-28T09:52:18Z

Purpose

As demonstrated in our tests and discussed in issue #19970 , async scheduling is currently not supported with the KV connector. Both of these features are critical for LLM inference.

This PR support async scheduling mode when a KVConnector is configured. This unlocks better concurrency, reduces tail latency, and improves throughput for prefix-reuse and offloading scenarios.

Core Changes

Extended SchedulerOutput dataclass (vllm/v1/core/sched/output.py)
- Added num_output_placeholders: dict[str, int] | None field to track tokens scheduled but not yet generated
- Enables KV connectors to distinguish between scheduled placeholders and actually computed tokens
- Backward compatible: defaults to None, populated only in async scheduling mode
Added _get_num_output_placeholders() method (vllm/v1/core/sched/scheduler.py)
- Base implementation in Scheduler returns empty dict (sync mode: no placeholders)
- Called before constructing SchedulerOutput to collect placeholder information
- Designed for extensibility: subclasses override to provide actual counts
AsyncScheduler override (vllm/v1/core/sched/async_scheduler.py)
- Overrides _get_num_output_placeholders() to collect actual placeholder counts from each request
- Provides KV connectors with the formula: real_computed_tokens = num_computed_tokens - num_output_placeholders
- Maintains existing async scheduling logic while exposing placeholder state

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

- Add num_output_placeholders field to SchedulerOutput - Implement _get_num_output_placeholders() method for collecting placeholder info - AsyncScheduler overrides to populate actual placeholder counts - OffloadingConnector updated to handle placeholders correctly - Add comprehensive design documentation This enables external KV cache systems (e.g., LMCache) to correctly determine the actual computed token boundary in async scheduling mode by using: real_computed = num_computed_tokens - num_output_placeholders Changes are backward compatible with sync scheduling.

…pport

mergify · 2025-10-28T09:52:52Z

Documentation preview: https://vllm--27644.org.readthedocs.build/en/27644/

gemini-code-assist

Code Review

This pull request adds support for asynchronous scheduling with KVConnectors, a critical feature for improving concurrency and performance. The approach taken is sound, extending SchedulerOutput to carry placeholder information and using a new overridable method in the Scheduler class to populate it. My review includes a key suggestion to improve the type consistency of the new field in SchedulerOutput, which simplifies consumer logic and enhances maintainability. Overall, the changes are well-implemented and align with the goal of enabling async scheduling for KV cache connectors.

gemini-code-assist · 2025-10-28T09:54:29Z

vllm/v1/core/sched/output.py

+    # generated. KV connectors (e.g., LMCache) use this to determine the actual
+    # computed token boundary for caching.
+    # For sync scheduling: this dict is empty (all placeholders are 0).
+    num_output_placeholders: dict[str, int] | None = None


The type hint dict[str, int] | None and default value None are inconsistent with the comment on line 179 ("this dict is empty") and the implementation in scheduler.py, which always provides a dict. This ambiguity can lead to confusion and potential bugs. To improve clarity and type safety, this should be defined as a non-optional dictionary. This also allows for cleaner access patterns in consumers like OffloadingConnector.

Suggested change

num_output_placeholders: dict[str, int] | None = None

num_output_placeholders: dict[str, int]

gemini-code-assist · 2025-10-28T09:54:30Z

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

+            num_placeholders = 0
+            if scheduler_output.num_output_placeholders:
+                num_placeholders = scheduler_output.num_output_placeholders.get(
+                    req_id, 0
+                )


Contingent on the suggested change in vllm/v1/core/sched/output.py to make num_output_placeholders a non-optional dict, this logic can be simplified. Since num_output_placeholders would always be a dictionary, you can directly use .get() which safely handles cases where a req_id is not present or the dictionary is empty. The if check becomes redundant because an empty dictionary is falsy, and .get() on an empty dictionary will correctly return the default value.

num_placeholders = scheduler_output.num_output_placeholders.get( req_id, 0 )

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-10-28T09:57:13Z

vllm/v1/core/sched/scheduler.py

        self._update_after_schedule(scheduler_output)
        return scheduler_output



Async placeholder counts captured before increment

The new num_output_placeholders snapshot is taken in Scheduler.schedule() before _update_after_schedule() runs, yet _update_after_schedule() immediately increments request.num_output_placeholders for every request that is about to emit a token. OffloadingConnectorScheduler._get_reqs_to_store() then uses the snapshot (scheduler_output.num_output_placeholders) together with the pre–schedule req.num_computed_tokens and adds new_tokens. In async decoding steps where a placeholder is added this cycle, the snapshot is missing that placeholder, so the connector subtracts only the old placeholders and still adds new_tokens, effectively planning to offload one token that has not been generated yet. This causes LMCache/offloading to store blocks that contain placeholder data instead of computed KV values, breaking cache correctness whenever async decoding is active.

Useful? React with 👍 / 👎.

KevinCheung2259 and others added 2 commits October 28, 2025 06:35

Merge branch 'vllm-project:main' into kvconnector-async-scheduling-su…

afc5ab3

…pport

KevinCheung2259 requested review from ApostaC, NickLucche, WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners October 28, 2025 09:52

KevinCheung2259 closed this Oct 28, 2025

mergify bot added documentation Improvements or additions to documentation v1 kv-connector labels Oct 28, 2025

gemini-code-assist bot reviewed Oct 28, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Async scheduling support with KVConnector#27644

[Feature]: Async scheduling support with KVConnector#27644
KevinCheung2259 wants to merge 2 commits intovllm-project:mainfrom
KevinCheung2259:kvconnector-async-scheduling-support

KevinCheung2259 commented Oct 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 28, 2025

Uh oh!

gemini-code-assist bot Oct 28, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	num_output_placeholders: dict[str, int] \| None = None
	num_output_placeholders: dict[str, int]

		self._update_after_schedule(scheduler_output)
		return scheduler_output

Uh oh!

Conversation

KevinCheung2259 commented Oct 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Core Changes

Uh oh!

mergify bot commented Oct 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KevinCheung2259 commented Oct 28, 2025 •

edited by github-actions bot

Loading