[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector#30166
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for using the hybrid KV cache allocator with a KV cache connector, which is a significant enhancement for models with sliding window attention. The goal is to reduce memory pressure and prevent data contention by allocating KV cache blocks more precisely. The changes are extensive, modifying the core allocation logic in SingleTypeKVCacheManager and propagating these changes up to the KVCacheCoordinator and Scheduler. While the overall approach is sound, the implementation contains several temporary workarounds and comments marked as "REMOVE BEFORE MERGE", which are critical to address. I've identified issues in the KV connector factory, the LMCache connector implementation, and potential bugs or data correctness concerns in single_type_kv_cache_manager.py and block_pool.py. These must be resolved to ensure the stability and correctness of the new functionality.
| ## REMOVE BEFORE MERGE (YIFAN): Revert this warning back to raising | ||
| # an ValueError. | ||
| logger.warning( | ||
| "Connector %s does not support HMA but HMA is enabled. Please set " | ||
| "--disable-hybrid-kv-cache-manager to disable HMA.", | ||
| connector_cls.__name__, | ||
| ) |
There was a problem hiding this comment.
This change from raising a ValueError to a logger.warning is marked with a "REMOVE BEFORE MERGE" comment. Using a connector that does not support Hybrid Memory Allocation (HMA) when HMA is enabled can lead to incorrect behavior or hard-to-debug runtime errors. It is much safer to fail fast with an exception. This change should be reverted to raise ValueError before merging to prevent potential issues in production.
raise ValueError(
f"Connector {connector_cls.__name__} does not support HMA but "
f"HMA is enabled. Please set `--disable-hybrid-kv-cache-manager`.
)| ## REMOVE BEFORE MERGE (YIFAN): this is temporary workaround to work with | ||
| # LMCache. Remove this once having LMCache-side support for new interfaces. | ||
| vllm_config.kv_cache_config = kv_cache_config |
There was a problem hiding this comment.
This block contains a "REMOVE BEFORE MERGE" comment, indicating a temporary workaround. Directly modifying vllm_config by assigning to kv_cache_config is a side effect that can lead to unexpected behavior elsewhere in the system. This workaround should be removed, and a proper solution that avoids mutating the config object should be implemented as noted in the comment.
| ## REMOVE BEFORE MERGE (YIFAN): this is temporary workaround to work with | ||
| # LMCache. Remove this once having LMCache-side support for new interfaces. | ||
| def request_finished_all_groups( | ||
| self, | ||
| request: "Request", | ||
| block_ids: tuple[list[int], ...], | ||
| ) -> tuple[bool, dict[str, Any] | None]: | ||
| # NOTE: LMCache overloads request_finished so `block_ids` here can be | ||
| # either list[int] or tuple[list[int], ...]. This could be changed in | ||
| # the future to separate these two methods. | ||
| return self._lmcache_engine.request_finished(request, block_ids) |
There was a problem hiding this comment.
The request_finished_all_groups method is marked as a temporary workaround with a "REMOVE BEFORE MERGE" comment. It appears to be a shim for a new interface required by the hybrid allocator. This temporary implementation should be replaced with a proper solution, and the dependency on this fix in LMCache should be resolved before this pull request is merged.
|
Hi @ivanium, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
223fb4d to
fa53140
Compare
|
Good work! In terms of landing this PR, @heheda12345 previously suggested me to separate into small PRs and I would prefer the same for this PR. Example: |
…efix_lm issue Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
…a request finishes Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
…ecoding cases, where draft tokens are allocated and later rejected Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
9d0bed1 to
006d49c
Compare
…che connector (vllm-project#30166) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Co-authored-by: KuntaiDu <kuntai@uchicago.edu>
…che connector (vllm-project#30166) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Co-authored-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Purpose
This is a contiuation work along PR #23624 to support hybrid KV cache manager + KV cache connector.
Design doc with details drafted by @KuntaiDu: link
In short, the current hybrid KV cache manager will try to allocate all tokens for sliding window layers similar to full attention layers, and then in the next scheduling step, the manager will free unuseful tokens (those outside the sliding window) and turn them into prefix cache in GRAM. This PR, instead, aims to allocate KV cache only for tokens in the sliding window for sliding window layers. This addresses two issues:
allocate-then-freepattern will cause data contention, where the connector might copy some KV cache blocks for one request in the background but the manager frees and reuses them for another request.This PR currently supports only LMCache connector. The support for the other connectors will be added in follow-up PRs.
cc @KuntaiDu @heheda12345
Test Plan
The test script is a modification from the one in PR #25712.
The script should be run with LMCache-side support: LMCache/LMCache#1436.
Caution
Please apply the following patch to LMCache if getting import errors for
cdiv:vLLM patch
lmcache-patch
diff --git a/lmcache/integration/vllm/vllm_v1_adapter.py b/lmcache/integration/vllm/vllm_v1_adapter.py index a849097..4db64df 100644 --- a/lmcache/integration/vllm/vllm_v1_adapter.py +++ b/lmcache/integration/vllm/vllm_v1_adapter.py @@ -18,7 +18,10 @@ from vllm.distributed.parallel_state import ( get_tp_group, ) from vllm.sampling_params import SamplingParams -from vllm.utils import cdiv +try: + from vllm.utils import cdiv +except ImportError: + from vllm.utils.math_utils import cdivTo run this script on H100, please save the following code into
test_connector_w_hybrid_kv_allocator.py, andpython test_connector_w_hybrid_kv_allocator.py.`test_connector_w_hybrid_kv_allocator.py`
Test Result
Previous, we cannot allocate KV cache for the 3rd request which tries to allocate long prefixes and load external KV cache even for sliding window layers. With this PR, the 3rd request can allocate only KV caches needed for sliding window layers and is able to be scheduled and finish with correct results.
Detailed output
I also benchmarked the performance impact of this PR with gemma3 series. There is barely no latency and throughput difference.
This branch
main
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.