Add FULL CUDA-Graph support for KV Connector path by sdavidbd · Pull Request #27026 · vllm-project/vllm

sdavidbd · 2025-10-16T13:41:44Z

Purpose

During a dummy run, KV connector APIs such as wait_for_layer_load() and save_kv_layer() are currently invoked outside the usual connector context manager.
This causes the connector metadata to remain unset, breaking the invariant that KVConnectorMetadata must always be initialized before model execution, and leading to assertion failures.

This PR fixes the KV connector path for dummy runs, ensuring that connector APIs are invoked through the same code path as in normal runs and that connector invariants (such as setting metadata) are preserved.

Test Plan

Reproduce the scenario described in #26675 (full-graph capture with KV connector).
Run with and without the fix applied.

Test Result

With the fix applied, the assertion reported in #26675 no longer reproduces.
Full-graph capture completes successfully with the KV connector enabled.

mergify · 2025-10-16T13:42:17Z

Documentation preview: https://vllm--27026.org.readthedocs.build/en/27026/

gemini-code-assist

Code Review

This pull request effectively resolves an assertion failure during dummy runs with KV connectors by introducing a dedicated dummy metadata instance. The approach of using a sentinel DUMMY_CONNECTOR_METADATA object and an _is_dummy_run() helper method is a clean solution that ensures the connector's code path is consistent for both normal and dummy runs. The changes are correctly propagated through various connector implementations, and the core fix in kv_connector_model_runner_mixin.py is well-implemented. I've identified one critical issue in an example connector where a method is accessed as a property, which will lead to incorrect behavior.

examples/offline_inference/kv_load_failure_recovery/rogue_shared_storage_connector.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR.

@codex fix this CI failure
@codex address that feedback

examples/offline_inference/kv_load_failure_recovery/rogue_shared_storage_connector.py

markmc

If I'm understanding this correctly, I think we should easily be able to find a way to not call these KV connector APIs in the context of a dummy run ... and that way we don't need to change connectors

e.g. can we make maybe_save_kv_layer_to_connector() just return if there is no metadata bound to the connector?

Like ... why would we want the connector to be called at all for a warm-up run?

markmc · 2025-10-17T10:40:37Z

vllm/distributed/kv_transfer/kv_connector/v1/base.py

+# In a dummy run, the connector has no real metadata; this sentinel is bound to the
+# connector to preserve the invariant that the connector always has metadata bound
+# before model execution.
+DUMMY_CONNECTOR_METADATA: Final = _DummyKVConnectorMetadata()


I'm pretty dubious about this - it would seem an obvious invariant that a connector should expect to only ever see metadata returned from its build_connector_meta() method

I agree — a cleaner approach would be to have build_connector_meta() construct a concrete metadata object for the dummy run, for example by invoking it with an empty SchedulerOutput.

I’ll work on such an alternative. The only concern is that, in the dummy run, build_connector_metadata() would be invoked from the worker process — while this API is currently expected to be called only from the Scheduler process, causing assertions in some connectors like NixlConnector.

markmc · 2025-10-17T10:41:33Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

@@ -256,6 +256,9 @@ def build_kv_connector_stats(
        )

    def start_load_kv(self, forward_context: "ForwardContext", **kwargs) -> None:
+        if self._is_dummy_run():
+            return


What happens to existing out-of-tree connectors that don't have this code?

Currently, such connectors would hit an assertion if they attempt to access metadata during a dummy run. Out-of-tree connectors may therefore need minor adjustments to use is_dummy_run() and handle this case explicitly. I agree that the more robust alternative — invoking build_connector_metadata() to create a concrete metadata object for the dummy run instead of using DUMMY_CONNECTOR_METADATA — would make this seamless for concrete connectors.

harrisonyhq · 2025-10-21T06:48:04Z

If I'm understanding this correctly, I think we should easily be able to find a way to not call these KV connector APIs in the context of a dummy run ... and that way we don't need to change connectors

e.g. can we make maybe_save_kv_layer_to_connector() just return if there is no metadata bound to the connector?

Like ... why would we want the connector to be called at all for a warm-up run?

Maybe we want to capture the dump and load behavior of connector into the full cuda graph? Because here it is not just a simply warm up, but is a dummy run called by cuda graph capturing.

markmc · 2025-10-28T11:51:42Z

Maybe we want to capture the dump and load behavior of connector into the full cuda graph? Because here it is not just a simply warm up, but is a dummy run called by cuda graph capturing.

You may well be correct, but it's not obvious to me 🤷 Care to explain in more detail with examples of what connector behavior should be captured in the cuda graph?

I feel like we should start with a simple fix for #26675, however. So to re-state my proposal in more detail:

It should be valid to call self.model() in GPUModelRunner with no connector metadata bound - we should simply skip any calls into the connector in that case
A KVConnector.has_connector_metadata() method could be added to check whether there is bound metadata
We can skip connector.save_kv_layer() in maybe_save_kv_layer_to_connector() if there is no connector metadata bound
Same for connector.wait_for_layer_load in wait_for_kv_layer_from_connector()

Signed-off-by: David Ben-David <davidb@pliops.com>

…R_METADATA Signed-off-by: David Ben-David <davidb@pliops.com>

NickLucche

Thank you for you work @sdavidbd !

After some thinking, while I am not a huge fan of the implementation due to it affecting OOT connectors, I think this approach is needed if we want to guarantee cuda-graphable save/load ops in the connector contract.

Afaik tracing these methods could save us the overhead of issuing a bunch of memcpy from cache tensors to a buffer (when present) up to recording direct nccl collective for synchronous scenarios (best-case?).
Despite that these might not be primary use-cases from what I've seen so far, in terms of transfer patterns. Happy to look at some reference of useful cuda-graphed save/load if you have any @sdavidbd to make up my mind.

The only way I see we could side-step the issue as @markmc was suggesting, is if we change the contract of these methods to be non-grapheable (ie we just skip the connector call during tracing).

I feel like a brief comment on the need for recoding these methods into the graph by @LucasWilkinson @ProExpertProg could shed some light.

PS requested changes just to be sure these points are discussed

NickLucche · 2025-11-03T09:52:30Z

vllm/v1/worker/kv_connector_model_runner_mixin.py

+_EMPTY_SCHEDULER_OUTPUT: Final[SchedulerOutput] = SchedulerOutput(
+    scheduled_new_reqs=[],
+    scheduled_cached_reqs=CachedRequestData.make_empty(),
+    num_scheduled_tokens={},
+    total_num_scheduled_tokens=0,
+    scheduled_spec_decode_tokens={},
+    scheduled_encoder_inputs={},
+    num_common_prefix_blocks=[],
+    finished_req_ids=set(),


not a fan of having to do this, also we probably should not maintain this structure here as it belongs with the scheduler "namespace".
Furthermore, I am not sure this pre-init has any benefit given it should only be called once during the dummy run.

Given that SchedulerOutput is already an input to GPUModelRunner, I think it’s acceptable for the runner to construct a dummy instance for its own internal use.
Also, the dummy run may be invoked multiple times (e.g., for model warm-up or for capturing different CUDA graphs), so having a pre-initialized SchedulerOutput keeps this path simple and consistent.

NickLucche · 2025-11-03T09:55:07Z

vllm/v1/worker/kv_connector_model_runner_mixin.py

+        if has_kv_transfer_group():
+            kv_connector = get_kv_transfer_group()
+            meta = kv_connector.build_connector_meta(_EMPTY_SCHEDULER_OUTPUT)
+            _EMPTY_SCHEDULER_OUTPUT.kv_connector_metadata = meta


nit: behavior does not (rightfully) match the other EMPTY_* objects which are copied to guarantee they remain EMPTY_*, might be misleading to some level

True — unlike other EMPTY_* objects, this one is only used internally for this specific path, and the leading _ indicates it’s a module-level private constant.

sdavidbd · 2025-11-06T09:28:19Z

Thanks, @NickLucche , for taking the time to dive into this and for sharing your thoughts.

I agree — it makes sense to first clarify whether graph capture for the KV connector is truly valuable before we commit to supporting it in the connector contract.

To move things forward, I’ll open a separate PR to fix #26675 by following @markmc’s suggestion — temporarily disabling graph capture for the KV connector path.
We can then keep this PR focused on enabling CUDA graph support and validating its benefits. I’ll also prepare a small proof-of-concept example to demonstrate cases where graphed save/load operations could be useful.

NickLucche · 2025-11-06T14:00:30Z

Thanks a lot for your work @sdavidbd !

NickLucche · 2025-11-12T10:49:57Z

~~Closing for now as you landed #28253. Feel free to re-open at any time for more discussion @sdavidbd.~~
Nvm let's keep this open.

mergify · 2025-11-24T02:33:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sdavidbd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

sdavidbd requested review from ApostaC and NickLucche as code owners October 16, 2025 13:41

mergify bot added documentation Improvements or additions to documentation v1 kv-connector labels Oct 16, 2025

gemini-code-assist bot reviewed Oct 16, 2025

View reviewed changes

examples/offline_inference/kv_load_failure_recovery/rogue_shared_storage_connector.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 16, 2025

View reviewed changes

examples/offline_inference/kv_load_failure_recovery/rogue_shared_storage_connector.py Outdated Show resolved Hide resolved

sdavidbd force-pushed the fix/dummy_run_kv_connector_path branch from 9e4c041 to 8446d7c Compare October 16, 2025 14:09

markmc requested changes Oct 17, 2025

View reviewed changes

David Ben-David added 2 commits October 29, 2025 14:03

[Bugfix] Fix KV connector path for dummy runs

c90d351

Signed-off-by: David Ben-David <davidb@pliops.com>

Build concrete metadata for dummy run instead of using DUMMY_CONNECTO…

350d3f1

…R_METADATA Signed-off-by: David Ben-David <davidb@pliops.com>

sdavidbd force-pushed the fix/dummy_run_kv_connector_path branch from 8446d7c to 350d3f1 Compare October 29, 2025 12:15

markmc mentioned this pull request Oct 31, 2025

[Misc] Refactor Attention kv transfer methods into decorator #27816

Merged

NickLucche requested changes Nov 3, 2025

View reviewed changes

sdavidbd mentioned this pull request Nov 6, 2025

[BugFix] Avoid calling KV connector layer APIs when metadata is unset #28253

Merged

NickLucche closed this Nov 12, 2025

NickLucche reopened this Nov 12, 2025

sdavidbd marked this pull request as draft November 17, 2025 15:24

sdavidbd changed the title ~~[Bugfix] Fix KV connector path for dummy runs~~ Add FULL CUDA-Graph support for KV Connector path Nov 17, 2025

mergify bot added the nvidia label Nov 17, 2025

github-project-automation bot added this to NVIDIA Nov 17, 2025

mergify bot added the needs-rebase label Nov 24, 2025

LucasWilkinson mentioned this pull request Nov 28, 2025

fix: Add PIECEWISE cudagraph mode config for prefill server to avoid startup errors #29079

Open

NickLucche mentioned this pull request Dec 2, 2025

Fix KV cache sync issue during CUDA graph replay #29755

Closed

Uh oh!

Conversation

sdavidbd commented Oct 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harrisonyhq commented Oct 21, 2025

Uh oh!

markmc commented Oct 28, 2025

Uh oh!

NickLucche left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdavidbd commented Nov 6, 2025

Uh oh!

NickLucche commented Nov 6, 2025

Uh oh!

NickLucche commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sdavidbd commented Oct 16, 2025 •

edited by github-actions bot

Loading

NickLucche left a comment •

edited

Loading

NickLucche commented Nov 12, 2025 •

edited

Loading