Skip to content

Add FULL CUDA-Graph support for KV Connector path#27026

Draft
sdavidbd wants to merge 2 commits intovllm-project:mainfrom
sdavidbd:fix/dummy_run_kv_connector_path
Draft

Add FULL CUDA-Graph support for KV Connector path#27026
sdavidbd wants to merge 2 commits intovllm-project:mainfrom
sdavidbd:fix/dummy_run_kv_connector_path

Conversation

@sdavidbd
Copy link
Contributor

@sdavidbd sdavidbd commented Oct 16, 2025

Purpose

Fix #26675

During a dummy run, KV connector APIs such as wait_for_layer_load() and save_kv_layer() are currently invoked outside the usual connector context manager.
This causes the connector metadata to remain unset, breaking the invariant that KVConnectorMetadata must always be initialized before model execution, and leading to assertion failures.

This PR fixes the KV connector path for dummy runs, ensuring that connector APIs are invoked through the same code path as in normal runs and that connector invariants (such as setting metadata) are preserved.

Test Plan

Reproduce the scenario described in #26675 (full-graph capture with KV connector).
Run with and without the fix applied.

Test Result

With the fix applied, the assertion reported in #26675 no longer reproduces.
Full-graph capture completes successfully with the KV connector enabled.

@mergify
Copy link

mergify bot commented Oct 16, 2025

Documentation preview: https://vllm--27026.org.readthedocs.build/en/27026/

@mergify mergify bot added documentation Improvements or additions to documentation v1 kv-connector labels Oct 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves an assertion failure during dummy runs with KV connectors by introducing a dedicated dummy metadata instance. The approach of using a sentinel DUMMY_CONNECTOR_METADATA object and an _is_dummy_run() helper method is a clean solution that ensures the connector's code path is consistent for both normal and dummy runs. The changes are correctly propagated through various connector implementations, and the core fix in kv_connector_model_runner_mixin.py is well-implemented. I've identified one critical issue in an example connector where a method is accessed as a property, which will lead to incorrect behavior.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR.

@sdavidbd sdavidbd force-pushed the fix/dummy_run_kv_connector_path branch from 9e4c041 to 8446d7c Compare October 16, 2025 14:09
Copy link
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, I think we should easily be able to find a way to not call these KV connector APIs in the context of a dummy run ... and that way we don't need to change connectors

e.g. can we make maybe_save_kv_layer_to_connector() just return if there is no metadata bound to the connector?

Like ... why would we want the connector to be called at all for a warm-up run?

# In a dummy run, the connector has no real metadata; this sentinel is bound to the
# connector to preserve the invariant that the connector always has metadata bound
# before model execution.
DUMMY_CONNECTOR_METADATA: Final = _DummyKVConnectorMetadata()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty dubious about this - it would seem an obvious invariant that a connector should expect to only ever see metadata returned from its build_connector_meta() method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree — a cleaner approach would be to have build_connector_meta() construct a concrete metadata object for the dummy run, for example by invoking it with an empty SchedulerOutput.

I’ll work on such an alternative. The only concern is that, in the dummy run, build_connector_metadata() would be invoked from the worker process — while this API is currently expected to be called only from the Scheduler process, causing assertions in some connectors like NixlConnector.

@@ -256,6 +256,9 @@ def build_kv_connector_stats(
)

def start_load_kv(self, forward_context: "ForwardContext", **kwargs) -> None:
if self._is_dummy_run():
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to existing out-of-tree connectors that don't have this code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, such connectors would hit an assertion if they attempt to access metadata during a dummy run. Out-of-tree connectors may therefore need minor adjustments to use is_dummy_run() and handle this case explicitly. I agree that the more robust alternative — invoking build_connector_metadata() to create a concrete metadata object for the dummy run instead of using DUMMY_CONNECTOR_METADATA — would make this seamless for concrete connectors.

@harrisonyhq
Copy link

If I'm understanding this correctly, I think we should easily be able to find a way to not call these KV connector APIs in the context of a dummy run ... and that way we don't need to change connectors

e.g. can we make maybe_save_kv_layer_to_connector() just return if there is no metadata bound to the connector?

Like ... why would we want the connector to be called at all for a warm-up run?

Maybe we want to capture the dump and load behavior of connector into the full cuda graph? Because here it is not just a simply warm up, but is a dummy run called by cuda graph capturing.

@markmc
Copy link
Member

markmc commented Oct 28, 2025

Maybe we want to capture the dump and load behavior of connector into the full cuda graph? Because here it is not just a simply warm up, but is a dummy run called by cuda graph capturing.

You may well be correct, but it's not obvious to me 🤷 Care to explain in more detail with examples of what connector behavior should be captured in the cuda graph?

I feel like we should start with a simple fix for #26675, however. So to re-state my proposal in more detail:

  1. It should be valid to call self.model() in GPUModelRunner with no connector metadata bound - we should simply skip any calls into the connector in that case
  2. A KVConnector.has_connector_metadata() method could be added to check whether there is bound metadata
  3. We can skip connector.save_kv_layer() in maybe_save_kv_layer_to_connector() if there is no connector metadata bound
  4. Same for connector.wait_for_layer_load in wait_for_kv_layer_from_connector()

David Ben-David added 2 commits October 29, 2025 14:03
Signed-off-by: David Ben-David <davidb@pliops.com>
…R_METADATA

Signed-off-by: David Ben-David <davidb@pliops.com>
Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for you work @sdavidbd !

After some thinking, while I am not a huge fan of the implementation due to it affecting OOT connectors, I think this approach is needed if we want to guarantee cuda-graphable save/load ops in the connector contract.

Afaik tracing these methods could save us the overhead of issuing a bunch of memcpy from cache tensors to a buffer (when present) up to recording direct nccl collective for synchronous scenarios (best-case?).
Despite that these might not be primary use-cases from what I've seen so far, in terms of transfer patterns. Happy to look at some reference of useful cuda-graphed save/load if you have any @sdavidbd to make up my mind.

The only way I see we could side-step the issue as @markmc was suggesting, is if we change the contract of these methods to be non-grapheable (ie we just skip the connector call during tracing).

I feel like a brief comment on the need for recoding these methods into the graph by @LucasWilkinson @ProExpertProg could shed some light.

PS requested changes just to be sure these points are discussed

Comment on lines +31 to +39
_EMPTY_SCHEDULER_OUTPUT: Final[SchedulerOutput] = SchedulerOutput(
scheduled_new_reqs=[],
scheduled_cached_reqs=CachedRequestData.make_empty(),
num_scheduled_tokens={},
total_num_scheduled_tokens=0,
scheduled_spec_decode_tokens={},
scheduled_encoder_inputs={},
num_common_prefix_blocks=[],
finished_req_ids=set(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a fan of having to do this, also we probably should not maintain this structure here as it belongs with the scheduler "namespace".
Furthermore, I am not sure this pre-init has any benefit given it should only be called once during the dummy run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that SchedulerOutput is already an input to GPUModelRunner, I think it’s acceptable for the runner to construct a dummy instance for its own internal use.
Also, the dummy run may be invoked multiple times (e.g., for model warm-up or for capturing different CUDA graphs), so having a pre-initialized SchedulerOutput keeps this path simple and consistent.

if has_kv_transfer_group():
kv_connector = get_kv_transfer_group()
meta = kv_connector.build_connector_meta(_EMPTY_SCHEDULER_OUTPUT)
_EMPTY_SCHEDULER_OUTPUT.kv_connector_metadata = meta
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: behavior does not (rightfully) match the other EMPTY_* objects which are copied to guarantee they remain EMPTY_*, might be misleading to some level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True — unlike other EMPTY_* objects, this one is only used internally for this specific path, and the leading _ indicates it’s a module-level private constant.

@sdavidbd
Copy link
Contributor Author

sdavidbd commented Nov 6, 2025

Thanks, @NickLucche , for taking the time to dive into this and for sharing your thoughts.

I agree — it makes sense to first clarify whether graph capture for the KV connector is truly valuable before we commit to supporting it in the connector contract.

To move things forward, I’ll open a separate PR to fix #26675 by following @markmc’s suggestion — temporarily disabling graph capture for the KV connector path.
We can then keep this PR focused on enabling CUDA graph support and validating its benefits. I’ll also prepare a small proof-of-concept example to demonstrate cases where graphed save/load operations could be useful.

@NickLucche
Copy link
Collaborator

Thanks a lot for your work @sdavidbd !

@NickLucche
Copy link
Collaborator

NickLucche commented Nov 12, 2025

Closing for now as you landed #28253. Feel free to re-open at any time for more discussion @sdavidbd.
Nvm let's keep this open.

@NickLucche NickLucche closed this Nov 12, 2025
@NickLucche NickLucche reopened this Nov 12, 2025
@sdavidbd sdavidbd marked this pull request as draft November 17, 2025 15:24
@sdavidbd sdavidbd changed the title [Bugfix] Fix KV connector path for dummy runs Add FULL CUDA-Graph support for KV Connector path Nov 17, 2025
@mergify mergify bot added the nvidia label Nov 17, 2025
@mergify
Copy link

mergify bot commented Nov 24, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sdavidbd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation kv-connector needs-rebase nvidia v1

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Bug]:AssertionError: assert self._connector_metadata is not None when using graph mode

4 participants