Skip to content

[Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator#31707

Merged
simon-mo merged 9 commits intovllm-project:mainfrom
ivanium:feat/general-hybrid-prefix-caching
Jan 9, 2026
Merged

[Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator#31707
simon-mo merged 9 commits intovllm-project:mainfrom
ivanium:feat/general-hybrid-prefix-caching

Conversation

@ivanium
Copy link
Copy Markdown
Contributor

@ivanium ivanium commented Jan 5, 2026

Purpose

The current hybrid KV cache coordinator supports at most two attention types (full attention + another sliding-window/mamba attention). However, emerging models may need more flexible support. For example, full attention + sliding window attn with various window sizes, etc., as required in #31592 and #30263.

Since prefix caching for sliding window and mamba does not have the monotonic prefix cache hit property, viz., a cache hit at position i does not imply a cache hit at position j where j < i, find_longest_cache_hit needs to check all attention groups until a prefix gets cache hits from all of them. This is what implemented by this PR.

Test Plan

pytest tests/v1/core/test_prefix_caching.py -q

Test Result

Passed


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Expands hybrid KV caching beyond the prior 2-type (full+other) limit and introduces a unified cache-hit algorithm across arbitrary attention groups.

  • Refactors HybridKVCacheCoordinator to group KV cache groups by identical KVCacheSpec, prioritize FullAttentionSpec, and compute LCM across all block sizes for alignment

  • Implements an iterative fixed-point find_longest_cache_hit that iteratively constrains hit length across attention types and reuses cached full-attention hits when possible

  • Updates imports to use SingleTypeKVCacheManager and removes the hardcoded 2-group/full-attn assumptions

  • Keeps divisibility validation and DCP/PCP constraints; returns per-group hit blocks as a tuple aligned to group indices

  • Tests: adds helpers to build mixed-spec configs and a parameterized test_prefill_hybrid_model_combinations covering 2–4 groups, interleaving, sliding-window variants, and Mamba; updates existing hybrid tests and fixtures accordingly

Written by Cursor Bugbot for commit 35de55507998323ec4bf15eac3f9cca8f5ff504a. This will update automatically on new commits. Configure here.


Note

Expands hybrid KV caching beyond the prior 2-type limit and introduces a unified cache-hit algorithm across arbitrary attention groups.

  • Refactors HybridKVCacheCoordinator to group KV cache groups by identical KVCacheSpec, prioritize FullAttentionSpec, and compute LCM across all group block sizes for alignment
  • Replaces the 2-group/full-attn assumption with a generic iterative fixed-point find_longest_cache_hit that reconciles hits across all groups and reuses full-attn hits when shrinking
  • Updates imports to use SingleTypeKVCacheManager; keeps divisibility validation and DCP/PCP constraints; returns per-group hit blocks aligned to original group indices
  • Tests: adds hybrid helpers and a parameterized test_prefill_hybrid_model_combinations covering 2–4 groups, interleaving, sliding-window variants, and Mamba; updates existing hybrid tests accordingly (tests/v1/core/test_prefix_caching.py)

Written by Cursor Bugbot for commit 34ca454. This will update automatically on new commits. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement to the HybridKVCacheCoordinator by enabling support for multiple, interleaved KV cache groups, moving beyond the previous two-type limitation. The core logic in find_longest_cache_hit has been thoughtfully re-implemented with an iterative approach to correctly identify common cache prefixes across various attention mechanisms, including those with non-monotonic hit properties like sliding window and Mamba. The new tests are thorough, covering configurations with three attention types and interleaved group IDs, which validates the increased flexibility. The changes are well-structured and appear robust. Overall, this is a solid enhancement to support more complex model architectures.

Copy link
Copy Markdown
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per offline discussion, we'll also consider models without full attention in this PR.

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 5, 2026

Hi @ivanium, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@ivanium ivanium force-pushed the feat/general-hybrid-prefix-caching branch from 6139aad to 2bc630a Compare January 5, 2026 20:27
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 5, 2026

Hi @ivanium, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

1 similar comment
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 5, 2026

Hi @ivanium, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@ivanium ivanium force-pushed the feat/general-hybrid-prefix-caching branch from 7045a1d to 2f2d05b Compare January 5, 2026 22:05
is_full_attn = isinstance(spec, FullAttentionSpec)

# Full attention: reuse cached blocks (downward-closed property)
cached_blocks = hit_blocks_by_group[group_ids[0]]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use kv_cache_spec as key?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we need hit_blocks_by_group as a list for return values anyway

for group_id in group_ids:
group_blocks = hit_blocks_by_group[group_id]
if group_blocks is not None:
del group_blocks[num_blocks:]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think trim full attention in every iteration is more clear and should have similar efficiency

if is_full_attn and cached_blocks is not None:
# Full attention is downward-closed; if the candidate
# `hit_length` was reduced by other groups, trim cached blocks
# so subsequent reuse reflects the current candidate length.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this flow?

We only need to compute the cache hit length for full attention once. Starting from the second iteration, we can simply keep the first hit_length // block_size blocks of the last iteration where the hit_length is reducing in each step.

if curr_hit_length < hit_length:
hit_length = curr_hit_length
reduced = True
break
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a break here? IMO it makes sense to iterate over all groups to get the minimum hit length in every while-loop iteration.

Copy link
Copy Markdown
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@heheda12345 heheda12345 enabled auto-merge (squash) January 7, 2026 07:25
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 7, 2026
auto-merge was automatically disabled January 7, 2026 08:43

Head branch was pushed to by a user without write access

@ivanium ivanium force-pushed the feat/general-hybrid-prefix-caching branch 2 times, most recently from 7134ce4 to 3e09884 Compare January 7, 2026 18:43
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 8, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ivanium.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 8, 2026
@ivanium ivanium force-pushed the feat/general-hybrid-prefix-caching branch from 3e09884 to d1a842f Compare January 8, 2026 19:16
@mergify mergify bot removed the needs-rebase label Jan 8, 2026
@ivanium ivanium force-pushed the feat/general-hybrid-prefix-caching branch from 231ae3b to 35de555 Compare January 9, 2026 01:58
@ivanium ivanium force-pushed the feat/general-hybrid-prefix-caching branch from 35de555 to be88dbe Compare January 9, 2026 05:15
find_longest_cache_hit

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
@ivanium ivanium force-pushed the feat/general-hybrid-prefix-caching branch from be88dbe to 34ca454 Compare January 9, 2026 05:15
@simon-mo simon-mo merged commit cd4a95e into vllm-project:main Jan 9, 2026
47 checks passed
@ivanium ivanium deleted the feat/general-hybrid-prefix-caching branch January 9, 2026 19:27
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
vllm-project#31707)

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants