[Core] Add register_model() to KVConnectorBase_V1 for CacheBlend#37339
[Core] Add register_model() to KVConnectorBase_V1 for CacheBlend#37339zbennett10 wants to merge 9 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request effectively adds a register_model() method to the KVConnector interface, enabling features like LMCache's CacheBlend to access model weights. The changes are well-implemented, maintaining backward compatibility by providing a no-op default in the base class and correctly propagating the model reference through the call stack. The inclusion of comprehensive tests is commendable. I have one suggestion to improve consistency and reduce code duplication in the LMCacheConnectorV1 implementation.
| def register_model(self, model: "torch.nn.Module") -> None: | ||
| """Register model with LMCache's VLLMModelTracker for CacheBlend. | ||
|
|
||
| CacheBlend's blender needs access to model weights for selective | ||
| layer recomputation. This method is called automatically by vLLM | ||
| after model loading. | ||
| """ | ||
| try: | ||
| from lmcache.v1.compute.models.utils import VLLMModelTracker | ||
|
|
||
| from vllm.distributed.kv_transfer.kv_connector.v1.\ | ||
| lmcache_integration.utils import ENGINE_NAME | ||
| VLLMModelTracker.register_model(ENGINE_NAME, model) | ||
| logger.info("Registered model with LMCache VLLMModelTracker") | ||
| except ImportError: | ||
| logger.debug( | ||
| "LMCache CacheBlend model registration not available" | ||
| ) | ||
| except Exception: | ||
| logger.warning( | ||
| "Failed to register model with VLLMModelTracker", | ||
| exc_info=True, | ||
| ) | ||
|
|
There was a problem hiding this comment.
The register_model method is implemented directly in LMCacheConnectorV1, but other methods in this class (e.g., register_kv_caches) delegate to self._lmcache_engine. This is inconsistent with the class's design and leads to code duplication, as the same register_model logic is also added to LMCacheConnectorV1Impl in vllm_v1_adapter.py.
To improve consistency and maintainability, register_model should delegate the call to self._lmcache_engine. This also makes the implementation in LMCacheConnectorV1Impl effective and avoids having the logic in two places.
def register_model(self, model: "torch.nn.Module") -> None:
"""Register model with LMCache's VLLMModelTracker for CacheBlend.
Delegates to the underlying LMCache engine implementation.
"""
if hasattr(self._lmcache_engine, "register_model"):
self._lmcache_engine.register_model(model)There was a problem hiding this comment.
Good catch. Updated in a26cd78: LMCacheConnectorV1.register_model now delegates to self._lmcache_engine.register_model() with a hasattr guard (same pattern as register_kv_caches). The VLLMModelTracker logic lives solely in LMCacheConnectorV1Impl. Tests updated accordingly.
|
Hi @zbennett10, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
vLLM never passes the model reference to KV connectors, so LMCache's CacheBlend blender cannot access model weights for selective layer recomputation. VLLMModelTracker.register_model() is never called. Add register_model(model) to KVConnectorBase_V1 as a separate method (no-op by default). ActiveKVConnector calls it after register_kv_caches when model is available. LMCache connectors override to call VLLMModelTracker.register_model(ENGINE_NAME, model). All other connectors inherit the no-op default — zero behavior change. Signed-off-by: Zach Bennett <zach@worldflow.ai> Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>
…torV1 - Remove duplicated VLLMModelTracker logic from LMCacheConnectorV1.register_model - Delegate to self._lmcache_engine.register_model() using hasattr guard (consistent with register_kv_caches pattern) - The implementation lives in LMCacheConnectorV1Impl (vllm_v1_adapter.py) - Update tests: LMCacheConnectorRegisterModel tests now verify delegation; add TestLMCacheConnectorV1ImplRegisterModel for VLLMModelTracker logic Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>
Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>
e1d79f1 to
86d9f44
Compare
|
@zbennett10 Can you somehow get the weights from the vllm config? |
@orozery Thank you for the quick feedback. With the amount of PRs going into this repo.. I don't know how you all keep up :)
The Attention objects stored in The believe the access pattern in LMCache is: Since Two options if you'd prefer to avoid threading model through:
Happy to go with whichever approach you prefer. If there's a way to reach the parent |
|
I'm actually in favor of passing information of the front door, instead of having connectors hack their way to get the information they need :) This API can be later extended both in information flowing from the connector and to the connector, without breaking existing connectors. |
…ta pattern Replace the specific `register_model(model: nn.Module)` method on KVConnectorBase_V1 with a generic dataclass-based initialization API: @DataClass class WorkerConnectorInitializationData: model: torch.nn.Module | None = None @DataClass class WorkerConnectorInitializationResponse: pass def initialize_worker_connector( self, initialization_data: WorkerConnectorInitializationData, ) -> WorkerConnectorInitializationResponse This addresses reviewer feedback (orozery) on PR vllm-project#37339. The dataclass pattern is extensible without breaking existing connectors: new optional fields can be added to WorkerConnectorInitializationData in the future (e.g. vllm_config, attn_backend) without changing any connector signatures. Changes: - base.py: add dataclasses, replace register_model no-op with initialize_worker_connector returning WorkerConnectorInitializationResponse - kv_connector.py: always call initialize_worker_connector(data), removing the conditional model-is-not-None guard (each connector decides) - lmcache_connector.py: override, extract data.model, guard before forwarding to _lmcache_engine.register_model - lmcache_mp_connector.py: delegate to worker_adapter - multi_connector.py: fan out to all sub-connectors with same data object - vllm_v1_adapter.py: override, extract data.model, call VLLMModelTracker - test_register_model.py: fully updated for new API (16 tests) Signed-off-by: Zach Bennett <zach@worldflowai.com> Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>
Great feedback.. check out this new interface. I think it maps much more cleanly - let me know what you think. |
| def initialize_worker_connector( | ||
| self, | ||
| initialization_data: WorkerConnectorInitializationData, | ||
| ) -> WorkerConnectorInitializationResponse: | ||
| """Register model with LMCache's VLLMModelTracker for CacheBlend. | ||
|
|
||
| CacheBlend's blender needs access to model weights for selective | ||
| layer recomputation. Called automatically by vLLM after model | ||
| loading. | ||
| """ | ||
| model = initialization_data.model | ||
| if model is not None: | ||
| try: | ||
| from lmcache.v1.compute.models.utils import VLLMModelTracker | ||
|
|
||
| from vllm.distributed.kv_transfer.kv_connector.v1.lmcache_integration.utils import ( # noqa: E501 | ||
| ENGINE_NAME, | ||
| ) | ||
|
|
||
| VLLMModelTracker.register_model(ENGINE_NAME, model) | ||
| logger.info("Registered model with VLLMModelTracker") | ||
| except ImportError: | ||
| logger.debug("LMCache CacheBlend model registration not available") | ||
| except Exception: | ||
| logger.warning( | ||
| "Failed to register model with VLLMModelTracker", | ||
| exc_info=True, | ||
| ) | ||
| return WorkerConnectorInitializationResponse() | ||
|
|
|
Thanks! @NickLucche @njhill your thoughts about this API? |
Taking that even further, we can add @zbennett10 BTW your current implementation only applies this new API to v2 model runner. I think we want also to add it to v1 ( |
Updated - thanks |
ecf9a82 to
be4243c
Compare
|
Hi @zbennett10, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
be4243c to
b638f98
Compare
|
@orozery We're working across the inference ecosystem to define a clean interface for this:
This vLLM PR is a foundational piece — CacheBlend's selective layer recomputation (which SemBlend uses to maintain output quality after KV injection) needs model weights. I think that the initialize_worker_connector pattern you suggested is exactly right for this and will generalize well as connectors need more initialization context. We're also collaborating with the NVIDIA cuVS team on GPU-accelerated vector search for semantic matching at fleet scale - they have agreed to help us co-author the "SemBlend" paper we are working on. Happy to share more details if helpful! This is just the beginning of semantic kv-cache reuse... :) |
|
@orozery Don't mean to pester you - need anything else from me with this one? Thanks :D |
I think the LMCache PRs still need a lot of review/design. Which I expect :) |
NickLucche
left a comment
There was a problem hiding this comment.
Thanks for your work @zbennett10 .
I think this new API can be arranged.
However I am not super comfortable adding changes to the ModelRunnerv2 side as I am not 100% sure the direction we want to go there is to copy paste all changes on both sides.
Would it be an issue to scope this PR for v1 only?
| from vllm.distributed.kv_transfer.kv_connector.v1.base import ( | ||
| WorkerConnectorInitializationData, | ||
| ) | ||
|
|
||
| kv_transfer_group.initialize_worker_connector( | ||
| WorkerConnectorInitializationData(model=self.model) | ||
| ) | ||
|
|
There was a problem hiding this comment.
could you move this into the KVConnector mixin?
Ideally I'd like to have everything that is behind a has_kv_transfer_group() check in there..but we can keep the scope narrow for this PR
torch is imported at runtime, not under TYPE_CHECKING, so the string annotation is unnecessary. Signed-off-by: Zach Bennett <zach@worldflowai.com> Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>
f624633 to
531cfa2
Compare
That's fair. Handled! @NickLucche |
orozery
left a comment
There was a problem hiding this comment.
@zbennett10 Let's revert the changes to model runner v2 (v1/worker/gpu) and restore the changes to v1 (gpu_model_runner.py).
The import for WorkerConnectorInitializationData should be at the top together with the other imports.
Also, we got no response from the LMCache maintainers on this.
I don't want us to make any changes to their connector without their approval.
So either reach out to them, or let's move the LMCache changes to a follow-up PR.
|
@ApostaC @chunxiaozheng @ziruiliu — This PR adds an The LMCache connector changes are:
Could you review the LMCache-specific changes? The approach uses the same guard patterns already established in the codebase. |
- Revert v2 model runner changes (v1/worker/gpu/), restore hook to v1 gpu_model_runner.py as requested by reviewer - Drop WorkerConnectorInitializationResponse; return None instead - Move tests from standalone test_register_model.py into test_multi_connector.py - Ensure all imports are at top level Signed-off-by: Zach Bennett <zach@worldflowai.com> Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>
Add LMCache-specific tests for initialize_worker_connector to test_multi_connector.py: delegation to engine, guard clauses, VLLMModelTracker integration, and graceful import error handling. Signed-off-by: Zach Bennett <zach@worldflowai.com> Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>
…nnector LMCache initialize_worker_connector tests belong alongside other LMCacheConnectorV1 tests, not in the MultiConnector test file. Signed-off-by: Zach Bennett <zach@worldflowai.com> Signed-off-by: Zachary Bennett <bennett.zachary@outlook.com>
Purpose
LMCache's CacheBlend feature requires access to the loaded model weights for selective layer recomputation via
VLLMModelTracker.register_model(). However, vLLM never passes the model reference to KV connectors —register_kv_caches()only receiveskv_caches: dict[str, torch.Tensor]. This means CacheBlend'sLMCBlenderBuilder.get_or_create()fails at runtime when it callsVLLMModelTracker.get_model()because the model was never registered.This PR adds a new
register_model(model)method toKVConnectorBase_V1(no-op by default). This avoids changing theregister_kv_cachessignature, which would require updating 16+ connector implementations.Changes:
KVConnectorBase_V1.register_model()— no-op defaultActiveKVConnector.__init__()— callsconnector.register_model(model)afterregister_kv_caches()get_kv_connector()— accepts optionalmodelparam, passes toActiveKVConnectorGPUModelRunner.init_kv_caches()— passesself.modeltoget_kv_connector()LMCacheConnectorV1.register_model()— callsVLLMModelTracker.register_model(ENGINE_NAME, model)LMCacheConnectorV1Impl.register_model()— same (bundled adapter)MultiConnector.register_model()— delegates to sub-connectorsLMCacheMPConnector.register_model()— delegates to worker adapterAll other connectors (NIXL, Mooncake, FlexKV, offloading, etc.) inherit the no-op default — zero behavior change.
Test Plan
Tests cover:
register_modeland it's a no-opActiveKVConnectorcallsregister_modelwhen model is providedActiveKVConnectorskipsregister_modelwhen model is Noneget_kv_connectorpasses model through toActiveKVConnectorget_kv_connectorreturns NoOp when no transfer groupLMCacheConnectorV1.register_modelcallsVLLMModelTrackerLMCacheConnectorV1.register_modelhandles ImportError gracefullyMultiConnector.register_modeldelegates to all sub-connectorsTest Result
All 9 tests pass. Ruff lint passes on all 8 changed files.