Add FULL CUDA-Graph support for KV Connector path#27026
Add FULL CUDA-Graph support for KV Connector path#27026sdavidbd wants to merge 2 commits intovllm-project:mainfrom
Conversation
|
Documentation preview: https://vllm--27026.org.readthedocs.build/en/27026/ |
There was a problem hiding this comment.
Code Review
This pull request effectively resolves an assertion failure during dummy runs with KV connectors by introducing a dedicated dummy metadata instance. The approach of using a sentinel DUMMY_CONNECTOR_METADATA object and an _is_dummy_run() helper method is a clean solution that ensures the connector's code path is consistent for both normal and dummy runs. The changes are correctly propagated through various connector implementations, and the core fix in kv_connector_model_runner_mixin.py is well-implemented. I've identified one critical issue in an example connector where a method is accessed as a property, which will lead to incorrect behavior.
examples/offline_inference/kv_load_failure_recovery/rogue_shared_storage_connector.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR.
examples/offline_inference/kv_load_failure_recovery/rogue_shared_storage_connector.py
Outdated
Show resolved
Hide resolved
9e4c041 to
8446d7c
Compare
markmc
left a comment
There was a problem hiding this comment.
If I'm understanding this correctly, I think we should easily be able to find a way to not call these KV connector APIs in the context of a dummy run ... and that way we don't need to change connectors
e.g. can we make maybe_save_kv_layer_to_connector() just return if there is no metadata bound to the connector?
Like ... why would we want the connector to be called at all for a warm-up run?
| # In a dummy run, the connector has no real metadata; this sentinel is bound to the | ||
| # connector to preserve the invariant that the connector always has metadata bound | ||
| # before model execution. | ||
| DUMMY_CONNECTOR_METADATA: Final = _DummyKVConnectorMetadata() |
There was a problem hiding this comment.
I'm pretty dubious about this - it would seem an obvious invariant that a connector should expect to only ever see metadata returned from its build_connector_meta() method
There was a problem hiding this comment.
I agree — a cleaner approach would be to have build_connector_meta() construct a concrete metadata object for the dummy run, for example by invoking it with an empty SchedulerOutput.
I’ll work on such an alternative. The only concern is that, in the dummy run, build_connector_metadata() would be invoked from the worker process — while this API is currently expected to be called only from the Scheduler process, causing assertions in some connectors like NixlConnector.
| @@ -256,6 +256,9 @@ def build_kv_connector_stats( | |||
| ) | |||
|
|
|||
| def start_load_kv(self, forward_context: "ForwardContext", **kwargs) -> None: | |||
| if self._is_dummy_run(): | |||
| return | |||
There was a problem hiding this comment.
What happens to existing out-of-tree connectors that don't have this code?
There was a problem hiding this comment.
Currently, such connectors would hit an assertion if they attempt to access metadata during a dummy run. Out-of-tree connectors may therefore need minor adjustments to use is_dummy_run() and handle this case explicitly. I agree that the more robust alternative — invoking build_connector_metadata() to create a concrete metadata object for the dummy run instead of using DUMMY_CONNECTOR_METADATA — would make this seamless for concrete connectors.
Maybe we want to capture the dump and load behavior of connector into the full cuda graph? Because here it is not just a simply warm up, but is a dummy run called by cuda graph capturing. |
You may well be correct, but it's not obvious to me 🤷 Care to explain in more detail with examples of what connector behavior should be captured in the cuda graph? I feel like we should start with a simple fix for #26675, however. So to re-state my proposal in more detail:
|
Signed-off-by: David Ben-David <davidb@pliops.com>
…R_METADATA Signed-off-by: David Ben-David <davidb@pliops.com>
8446d7c to
350d3f1
Compare
There was a problem hiding this comment.
Thank you for you work @sdavidbd !
After some thinking, while I am not a huge fan of the implementation due to it affecting OOT connectors, I think this approach is needed if we want to guarantee cuda-graphable save/load ops in the connector contract.
Afaik tracing these methods could save us the overhead of issuing a bunch of memcpy from cache tensors to a buffer (when present) up to recording direct nccl collective for synchronous scenarios (best-case?).
Despite that these might not be primary use-cases from what I've seen so far, in terms of transfer patterns. Happy to look at some reference of useful cuda-graphed save/load if you have any @sdavidbd to make up my mind.
The only way I see we could side-step the issue as @markmc was suggesting, is if we change the contract of these methods to be non-grapheable (ie we just skip the connector call during tracing).
I feel like a brief comment on the need for recoding these methods into the graph by @LucasWilkinson @ProExpertProg could shed some light.
PS requested changes just to be sure these points are discussed
| _EMPTY_SCHEDULER_OUTPUT: Final[SchedulerOutput] = SchedulerOutput( | ||
| scheduled_new_reqs=[], | ||
| scheduled_cached_reqs=CachedRequestData.make_empty(), | ||
| num_scheduled_tokens={}, | ||
| total_num_scheduled_tokens=0, | ||
| scheduled_spec_decode_tokens={}, | ||
| scheduled_encoder_inputs={}, | ||
| num_common_prefix_blocks=[], | ||
| finished_req_ids=set(), |
There was a problem hiding this comment.
not a fan of having to do this, also we probably should not maintain this structure here as it belongs with the scheduler "namespace".
Furthermore, I am not sure this pre-init has any benefit given it should only be called once during the dummy run.
There was a problem hiding this comment.
Given that SchedulerOutput is already an input to GPUModelRunner, I think it’s acceptable for the runner to construct a dummy instance for its own internal use.
Also, the dummy run may be invoked multiple times (e.g., for model warm-up or for capturing different CUDA graphs), so having a pre-initialized SchedulerOutput keeps this path simple and consistent.
| if has_kv_transfer_group(): | ||
| kv_connector = get_kv_transfer_group() | ||
| meta = kv_connector.build_connector_meta(_EMPTY_SCHEDULER_OUTPUT) | ||
| _EMPTY_SCHEDULER_OUTPUT.kv_connector_metadata = meta |
There was a problem hiding this comment.
nit: behavior does not (rightfully) match the other EMPTY_* objects which are copied to guarantee they remain EMPTY_*, might be misleading to some level
There was a problem hiding this comment.
True — unlike other EMPTY_* objects, this one is only used internally for this specific path, and the leading _ indicates it’s a module-level private constant.
|
Thanks, @NickLucche , for taking the time to dive into this and for sharing your thoughts. I agree — it makes sense to first clarify whether graph capture for the KV connector is truly valuable before we commit to supporting it in the connector contract. To move things forward, I’ll open a separate PR to fix #26675 by following @markmc’s suggestion — temporarily disabling graph capture for the KV connector path. |
|
Thanks a lot for your work @sdavidbd ! |
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Fix #26675
During a dummy run, KV connector APIs such as
wait_for_layer_load()andsave_kv_layer()are currently invoked outside the usual connector context manager.This causes the connector metadata to remain unset, breaking the invariant that
KVConnectorMetadatamust always be initialized before model execution, and leading to assertion failures.This PR fixes the KV connector path for dummy runs, ensuring that connector APIs are invoked through the same code path as in normal runs and that connector invariants (such as setting metadata) are preserved.
Test Plan
Reproduce the scenario described in #26675 (full-graph capture with KV connector).
Run with and without the fix applied.
Test Result
With the fix applied, the assertion reported in #26675 no longer reproduces.
Full-graph capture completes successfully with the KV connector enabled.