[V1] [P/D] Refactor KV Connector Path#21980
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable refactoring by encapsulating the KV connector lifecycle within a context manager in GPUModelRunner. This significantly improves code clarity and maintainability. The consolidation of KV-related fields into a single kv_connector_output in ModelRunnerOutput and IntermediateTensors is also a welcome change that enhances readability.
However, a critical issue has been introduced in the TPUModelRunner. The refactoring was not applied to it, and it now calls methods that have been removed from KVConnectorModelRunnerMixin, which will cause runtime failures. This needs to be addressed before merging.
NickLucche
left a comment
There was a problem hiding this comment.
Hey thanks for your work!
I am just wondering why can't we have KVConnector.get_finished return a KVConnectorOutput? That would make for easier extensibility as we need to move more stuff from workers to executor.
Thanks @NickLucche ! That's a great question. Shaping the output returned from the connector into a general structure is still a work in progress. To avoid locking in a premature design, I believe it’s best to construct |
There was a problem hiding this comment.
Thanks @sdavidbd, this looks great to me.
I also feel it would make sense to return KVConnectorOutput from get_finished(), but since there's not yet agreement on that we could get this merged asap and handle that as a follow-on.
I think calling "additional API" brings complexity of keeping atomicity? If you get Regarding "third-party impl.", if this is referring the implementation inside vllm/ repo, I guess we just change all of them altogether at once? Never the less, I'm ok merging this PR. At least in follow-up PR, we don't need to touch |
|
We have to fix tests |
|
There are more failing tests in V1 tests, please fix them. The rest should be fixed if you merge from main |
5d3ceee to
373df31
Compare
|
@DarkLight1337 Failed checks appear to be caused by known issues unrelated to this PR:
|
… connector path Signed-off-by: David Ben-David <davidb@pliops.com>
Signed-off-by: David Ben-David <davidb@pliops.com>
Signed-off-by: David Ben-David <davidb@pliops.com>
Signed-off-by: David Ben-David <davidb@pliops.com>
Signed-off-by: David Ben-David <davidb@pliops.com>
Signed-off-by: David Ben-David <davidb@pliops.com>
373df31 to
25f2873
Compare
Thanks, @lk-chen! A significant part of this PR is the introduction of the KV connector context manager, which manages the connector lifecycle over a single model execution (i.e., a scheduling step). This provides a natural boundary for atomicity - between Regarding third-party implementations, I was referring to out-of-tree connectors that are dynamically loaded via |
Signed-off-by: David Ben-David <davidb@pliops.com> Co-authored-by: David Ben-David <davidb@pliops.com>
Signed-off-by: David Ben-David <davidb@pliops.com> Co-authored-by: David Ben-David <davidb@pliops.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: David Ben-David <davidb@pliops.com> Co-authored-by: David Ben-David <davidb@pliops.com> Signed-off-by: Noam Gat <noamgat@gmail.com>
Signed-off-by: David Ben-David <davidb@pliops.com> Co-authored-by: David Ben-David <davidb@pliops.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: David Ben-David <davidb@pliops.com> Co-authored-by: David Ben-David <davidb@pliops.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: David Ben-David <davidb@pliops.com> Co-authored-by: David Ben-David <davidb@pliops.com>
Signed-off-by: David Ben-David <davidb@pliops.com> Co-authored-by: David Ben-David <davidb@pliops.com>
| ): | ||
| self.maybe_setup_kv_connector(scheduler_output) | ||
| ), self.maybe_get_kv_connector_output( | ||
| scheduler_output) as kv_connector_output: |
There was a problem hiding this comment.
With Always call connector clear_metadata() at end of step, clear_connector_meta was invoked before the step ended, meaning it was executed after the draft model forward.
In the current change, it is invoked before the draft model forward, which causes the connector to be unable to send the draft layer KV cache.


Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
This PR refactors the KV connector integration in
GPUModelRunner.execute_modelby introducing a context manager that encapsulates the lifecycle of the KV Connector. This clarifies the execution flow and improves modularity.Additionally, this PR simplifies
IntermediateTensorsandModelRunnerOutputby consolidating multiple ad-hoc KV-related fields into a singlekv_connector_outputfield of typeKVConnectorOutput, which improves readability and maintainability.Test Plan
Run all existing tests.
Test Result
All tests pass.
(Optional) Documentation Update