[Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator#31707
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a significant improvement to the HybridKVCacheCoordinator by enabling support for multiple, interleaved KV cache groups, moving beyond the previous two-type limitation. The core logic in find_longest_cache_hit has been thoughtfully re-implemented with an iterative approach to correctly identify common cache prefixes across various attention mechanisms, including those with non-monotonic hit properties like sliding window and Mamba. The new tests are thorough, covering configurations with three attention types and interleaved group IDs, which validates the increased flexibility. The changes are well-structured and appear robust. Overall, this is a solid enhancement to support more complex model architectures.
heheda12345
left a comment
There was a problem hiding this comment.
Per offline discussion, we'll also consider models without full attention in this PR.
|
Hi @ivanium, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
6139aad to
2bc630a
Compare
|
Hi @ivanium, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @ivanium, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
7045a1d to
2f2d05b
Compare
| is_full_attn = isinstance(spec, FullAttentionSpec) | ||
|
|
||
| # Full attention: reuse cached blocks (downward-closed property) | ||
| cached_blocks = hit_blocks_by_group[group_ids[0]] |
There was a problem hiding this comment.
use kv_cache_spec as key?
There was a problem hiding this comment.
But we need hit_blocks_by_group as a list for return values anyway
vllm/v1/core/kv_cache_coordinator.py
Outdated
| for group_id in group_ids: | ||
| group_blocks = hit_blocks_by_group[group_id] | ||
| if group_blocks is not None: | ||
| del group_blocks[num_blocks:] |
There was a problem hiding this comment.
I think trim full attention in every iteration is more clear and should have similar efficiency
vllm/v1/core/kv_cache_coordinator.py
Outdated
| if is_full_attn and cached_blocks is not None: | ||
| # Full attention is downward-closed; if the candidate | ||
| # `hit_length` was reduced by other groups, trim cached blocks | ||
| # so subsequent reuse reflects the current candidate length. |
There was a problem hiding this comment.
What about this flow?
We only need to compute the cache hit length for full attention once. Starting from the second iteration, we can simply keep the first hit_length // block_size blocks of the last iteration where the hit_length is reducing in each step.
vllm/v1/core/kv_cache_coordinator.py
Outdated
| if curr_hit_length < hit_length: | ||
| hit_length = curr_hit_length | ||
| reduced = True | ||
| break |
There was a problem hiding this comment.
Should we add a break here? IMO it makes sense to iterate over all groups to get the minimum hit length in every while-loop iteration.
Head branch was pushed to by a user without write access
7134ce4 to
3e09884
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
3e09884 to
d1a842f
Compare
231ae3b to
35de555
Compare
35de555 to
be88dbe
Compare
find_longest_cache_hit Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
be88dbe to
34ca454
Compare
vllm-project#31707) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
vllm-project#31707) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
vllm-project#31707) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Purpose
The current hybrid KV cache coordinator supports at most two attention types (full attention + another sliding-window/mamba attention). However, emerging models may need more flexible support. For example, full attention + sliding window attn with various window sizes, etc., as required in #31592 and #30263.
Since prefix caching for sliding window and mamba does not have the monotonic prefix cache hit property, viz., a cache hit at position i does not imply a cache hit at position j where j < i,
find_longest_cache_hitneeds to check all attention groups until a prefix gets cache hits from all of them. This is what implemented by this PR.Test Plan
pytest tests/v1/core/test_prefix_caching.py -qTest Result
Passed
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Note
Expands hybrid KV caching beyond the prior 2-type (full+other) limit and introduces a unified cache-hit algorithm across arbitrary attention groups.
Refactors
HybridKVCacheCoordinatorto group KV cache groups by identicalKVCacheSpec, prioritizeFullAttentionSpec, and compute LCM across all block sizes for alignmentImplements an iterative fixed-point
find_longest_cache_hitthat iteratively constrains hit length across attention types and reuses cached full-attention hits when possibleUpdates imports to use
SingleTypeKVCacheManagerand removes the hardcoded 2-group/full-attn assumptionsKeeps divisibility validation and DCP/PCP constraints; returns per-group hit blocks as a tuple aligned to group indices
Tests: adds helpers to build mixed-spec configs and a parameterized
test_prefill_hybrid_model_combinationscovering 2–4 groups, interleaving, sliding-window variants, and Mamba; updates existing hybrid tests and fixtures accordinglyWritten by Cursor Bugbot for commit 35de55507998323ec4bf15eac3f9cca8f5ff504a. This will update automatically on new commits. Configure here.
Note
Expands hybrid KV caching beyond the prior 2-type limit and introduces a unified cache-hit algorithm across arbitrary attention groups.
HybridKVCacheCoordinatorto group KV cache groups by identicalKVCacheSpec, prioritizeFullAttentionSpec, and compute LCM across all group block sizes for alignmentfind_longest_cache_hitthat reconciles hits across all groups and reuses full-attn hits when shrinkingSingleTypeKVCacheManager; keeps divisibility validation and DCP/PCP constraints; returns per-group hit blocks aligned to original group indicestest_prefill_hybrid_model_combinationscovering 2–4 groups, interleaving, sliding-window variants, and Mamba; updates existing hybrid tests accordingly (tests/v1/core/test_prefix_caching.py)Written by Cursor Bugbot for commit 34ca454. This will update automatically on new commits. Configure here.