[KVConnector] Add KV events to KV Connectors#28309
[KVConnector] Add KV events to KV Connectors#28309NickLucche merged 16 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to send KV cache events from the worker side to the scheduler side, which is a crucial feature for connectors that generate these events. The changes are logical, but I've identified a critical bug in the lmcache_connector.py that could lead to event loss, and an incorrect type hint in kv_connector_model_runner_mixin.py. Addressing these issues will improve the correctness and maintainability of the new functionality.
vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py
Outdated
Show resolved
Hide resolved
fc375f8 to
499a425
Compare
This commit follows recommendation from @markmc to use a specific event property instead of piggybacking on stats. PR vllm-project#28309 adds the events property to KVConnectorOutput and this commit picks up the new property and uses it to pass the events from worker side to scheduler side. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
|
@hickeyma I think you're missing aggregation code in @njhill Let's think how this change fits to the longer-term future.
At the time when I introduced worker->scheduler output aggregation (#19555), my initial suggestion was to allow abstract metadata to flow from worker connectors to the scheduler connector.
My view is that we should rename cc @sdavidbd |
I agree with this in general - e.g. in an earlier iteration of #26172 I found myself adding NIXL-specific semantics to the contents of
I'm not sure I agree with this, though:
But I do think it could be positive to add |
|
Thanks for the great breakdown @orozery.
I understand your point here, but I believe that adding more fields to |
|
I agree with the usefulness of My main point is actually on the other hand, for fields that are useful only for a single connector (and I think this may be the case here?), we should have an abstract field that each connector can use freely (similar to |
Agree 👍
Honestly, I don't yet fully understand the motivation for the lmcache connector to emit its own events (as per #28252 and LMCache/LMCache#1846) in addition to the KV events already emitted by vLLM ... so I don't have a strong sense either way whether this need is highly-specific to lmcache |
|
Thanks @njhill @orozery @markmc @NickLucche for the feedback. I will get to it in due course as bit delayed this week because I'm at KubeCon.
@markmc The reason that a connector like LMCache would need to generate their own events is down to specific details or information. How LMcache stores the tokens/hashes (e.g. LMcache stores tokens with a granularity of 256 as a block and vLLM granularity is 16 as a block) and where it stores them (e.g. LMCache might store them to disk) will differ to vLLM. Therefore, a consumer of the events would be missing the full context of the cache if it only consumed the vLLM events. |
I agree with @NickLucche about having a separate field. @orozery I understand you reservation about only 1 connector possibly using it but still think the separation is best to clearly identify it. |
On LMCache-side, the motivation for LMCache to emit |
|
@ApostaC Can you also take a look? |
|
I like the idea of separating the KVEvents. My understanding is that KVEvents is becoming a standard format of fine-grained KV cache monitoring, and there will be other components using it in the future. Therefore, leaving a standard interface for it makes a lot of sense to me. Regarding |
1763194 to
d6a23a6
Compare
|
Quick follow-up regarding the KV event aggregator: @hickeyma There is a class that combines worker processes' the connector outputs to a single output and send it back to the scheduler class: vllm/vllm/distributed/kv_transfer/kv_connector/utils.py Lines 123 to 125 in d8874c6 IIUC, since KV events should work for all the connectors, it's better to add such aggregation logic for KV events in cc @orozery @NickLucche @njhill (please correct me if my understanding is wrong). |
7bcaa13 to
7ad9600
Compare
7a2e362 to
6487bda
Compare
|
Thanks for the review @NickLucche. Unit tests now added. |
dc3c961 to
f10c419
Compare
|
Thanks @NickLucche for the review! |
|
Hey @hickeyma , can you do "update branch" to include the fix of LMCache unit tests? Otherwise, this looks good to me! |
ApostaC
left a comment
There was a problem hiding this comment.
Thanks for the contribution and thanks @KuntaiDu @NickLucche for reviewing!
There was a problem hiding this comment.
I think the only missing part here is to provide a generic implementation of get_kv_connector_kv_cache_events for the MultiConnector, such that it handles getting these events from multiple connectors, handling the case where only a subset of the requested connectors implement the get_kv_connector_kv_cache_events method (+quick test case).
I believe changes to multi_connector.py here #22188 could be a good reference.
@hickeyma
This is required for when worker side operations like CPU offloading generate KV cache events. This commit enables theses events to be passed to the scheduler side so that they can be published by the engine. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Update comments: - vllm-project#28309 (review) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
The changes to the connector is for a separate PR and this PR is independent of it for now. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
LMCache connector changes are not part of this PR as this PR deals with adding ability to pass KV events between worker and scheduler sides and not a specific connector usage of this. The connector implementation is in a separate PR. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
It is part of the aggregation of kv_connector_output from all workers. For KV cache events, this means combining events from all workers, remvoing any duplications. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Clarify that it is the workser side that calls the method. Review comment: - vllm-project#28309 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Equality methods are unnecessary as its provided by the parent class 'msgspec.Struct' which they extend from. Review comment: https://github.com/vllm-project/vllm/pull/28309/files#r2539044714 Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Review comments: - vllm-project#28309 (comment) - vllm-project#28309 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Workers will generate duplicate kv events in LMCache. This commit adds capability to aggregate on events from workers by returning only those that were emitted by all workers. It also provide an abstract class KVConnectorKVEevnts that is implemented by connectors to handle how they emit events. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Review comments: - vllm-project#28309 (comment) - vllm-project#28309 (comment) - vllm-project#28309 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
All workers may not be processed by the same instance running `aggregate()` for the model runner. Therefore, events needs to be persisted across KVConnector output processing until the events have been retrieved in `take_events()`. This commit persisted the events until all worker have been completed. Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Review comment: vllm-project#28309 (comment) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Comment: vllm-project#28309 (review) Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Discussed this with @NickLucche on slack, and its agreed that the implementation to MultiConnector can be pushed out to another PR. I have added a TODO comment to MultiConnector as a placeholder as suggested. I'll push a separate PR shortly for this. |
NickLucche
left a comment
There was a problem hiding this comment.
LGTM as per @hickeyma comment on follow up work
Purpose
Enable KV connectors to pass KV Events to vLLM. vLLM can then publish the events using its existing KV Events publisher in the engine.
The motivation to emit KV Events by connectors is to help KV-cache-aware routing make more efficient decisions, based on not only KV caches in GPU, but also KV caches in CPU and local disk.
This PR includes updates to the the LMCache connector to implement the
KVConnectorKVEventsinterface that enables KV events to be passed to vLLM from LMCache.This is part of the solution for:
Note: This is related to #28252 and from feedback by @markmc in #28252 (comment)
Test Plan
pytest tests/v1/kv_connector/unit/test_lmcache_connector.pyTest Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.