[Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator by ivanium · Pull Request #31707 · vllm-project/vllm

ivanium · 2026-01-05T07:14:53Z

Purpose

The current hybrid KV cache coordinator supports at most two attention types (full attention + another sliding-window/mamba attention). However, emerging models may need more flexible support. For example, full attention + sliding window attn with various window sizes, etc., as required in #31592 and #30263.

Since prefix caching for sliding window and mamba does not have the monotonic prefix cache hit property, viz., a cache hit at position i does not imply a cache hit at position j where j < i, find_longest_cache_hit needs to check all attention groups until a prefix gets cache hits from all of them. This is what implemented by this PR.

Test Plan

pytest tests/v1/core/test_prefix_caching.py -q

Test Result

Passed

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Expands hybrid KV caching beyond the prior 2-type (full+other) limit and introduces a unified cache-hit algorithm across arbitrary attention groups.

Refactors HybridKVCacheCoordinator to group KV cache groups by identical KVCacheSpec, prioritize FullAttentionSpec, and compute LCM across all block sizes for alignment
Implements an iterative fixed-point find_longest_cache_hit that iteratively constrains hit length across attention types and reuses cached full-attention hits when possible
Updates imports to use SingleTypeKVCacheManager and removes the hardcoded 2-group/full-attn assumptions
Keeps divisibility validation and DCP/PCP constraints; returns per-group hit blocks as a tuple aligned to group indices
Tests: adds helpers to build mixed-spec configs and a parameterized test_prefill_hybrid_model_combinations covering 2–4 groups, interleaving, sliding-window variants, and Mamba; updates existing hybrid tests and fixtures accordingly

^{Written by Cursor Bugbot for commit 35de55507998323ec4bf15eac3f9cca8f5ff504a. This will update automatically on new commits. Configure here.}

Note

Expands hybrid KV caching beyond the prior 2-type limit and introduces a unified cache-hit algorithm across arbitrary attention groups.

Refactors HybridKVCacheCoordinator to group KV cache groups by identical KVCacheSpec, prioritize FullAttentionSpec, and compute LCM across all group block sizes for alignment
Replaces the 2-group/full-attn assumption with a generic iterative fixed-point find_longest_cache_hit that reconciles hits across all groups and reuses full-attn hits when shrinking
Updates imports to use SingleTypeKVCacheManager; keeps divisibility validation and DCP/PCP constraints; returns per-group hit blocks aligned to original group indices
Tests: adds hybrid helpers and a parameterized test_prefill_hybrid_model_combinations covering 2–4 groups, interleaving, sliding-window variants, and Mamba; updates existing hybrid tests accordingly (tests/v1/core/test_prefix_caching.py)

^{Written by Cursor Bugbot for commit 34ca454. This will update automatically on new commits. Configure here.}

gemini-code-assist

Code Review

This pull request introduces a significant improvement to the HybridKVCacheCoordinator by enabling support for multiple, interleaved KV cache groups, moving beyond the previous two-type limitation. The core logic in find_longest_cache_hit has been thoughtfully re-implemented with an iterative approach to correctly identify common cache prefixes across various attention mechanisms, including those with non-monotonic hit properties like sliding window and Mamba. The new tests are thorough, covering configurations with three attention types and interleaved group IDs, which validates the increased flexibility. The changes are well-structured and appear robust. Overall, this is a solid enhancement to support more complex model architectures.

heheda12345

Per offline discussion, we'll also consider models without full attention in this PR.

mergify · 2026-01-05T09:15:04Z

Hi @ivanium, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-05T20:31:56Z

Hi @ivanium, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-05T21:58:45Z

Hi @ivanium, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

heheda12345 · 2026-01-06T05:39:54Z

vllm/v1/core/kv_cache_coordinator.py

+                is_full_attn = isinstance(spec, FullAttentionSpec)
+
+                # Full attention: reuse cached blocks (downward-closed property)
+                cached_blocks = hit_blocks_by_group[group_ids[0]]


use kv_cache_spec as key?

But we need hit_blocks_by_group as a list for return values anyway

heheda12345 · 2026-01-06T05:40:43Z

vllm/v1/core/kv_cache_coordinator.py

+                for group_id in group_ids:
+                    group_blocks = hit_blocks_by_group[group_id]
+                    if group_blocks is not None:
+                        del group_blocks[num_blocks:]


I think trim full attention in every iteration is more clear and should have similar efficiency

heheda12345 · 2026-01-07T05:14:03Z

vllm/v1/core/kv_cache_coordinator.py

+                if is_full_attn and cached_blocks is not None:
+                    # Full attention is downward-closed; if the candidate
+                    # `hit_length` was reduced by other groups, trim cached blocks
+                    # so subsequent reuse reflects the current candidate length.


What about this flow?

We only need to compute the cache hit length for full attention once. Starting from the second iteration, we can simply keep the first hit_length // block_size blocks of the last iteration where the hit_length is reducing in each step.

heheda12345 · 2026-01-07T05:16:31Z

vllm/v1/core/kv_cache_coordinator.py

+                if curr_hit_length < hit_length:
+                    hit_length = curr_hit_length
+                    reduced = True
+                    break


Should we add a break here? IMO it makes sense to iterate over all groups to get the minimum hit length in every while-loop iteration.

heheda12345

LGTM!

mergify · 2026-01-08T09:31:58Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ivanium.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

find_longest_cache_hit Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

vllm-project#31707) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

vllm-project#31707) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

vllm-project#31707) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

ivanium requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners January 5, 2026 07:14

mergify bot added the v1 label Jan 5, 2026

gemini-code-assist bot reviewed Jan 5, 2026

View reviewed changes

heheda12345 reviewed Jan 5, 2026

View reviewed changes

ivanium force-pushed the feat/general-hybrid-prefix-caching branch from 6139aad to 2bc630a Compare January 5, 2026 20:27

ivanium force-pushed the feat/general-hybrid-prefix-caching branch from 7045a1d to 2f2d05b Compare January 5, 2026 22:05

heheda12345 reviewed Jan 6, 2026

View reviewed changes

heheda12345 reviewed Jan 7, 2026

View reviewed changes

heheda12345 approved these changes Jan 7, 2026

View reviewed changes

heheda12345 enabled auto-merge (squash) January 7, 2026 07:25

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 7, 2026

auto-merge was automatically disabled January 7, 2026 08:43
Head branch was pushed to by a user without write access

ivanium force-pushed the feat/general-hybrid-prefix-caching branch 2 times, most recently from 7134ce4 to 3e09884 Compare January 7, 2026 18:43

mergify bot added the needs-rebase label Jan 8, 2026

ivanium force-pushed the feat/general-hybrid-prefix-caching branch from 3e09884 to d1a842f Compare January 8, 2026 19:16

mergify bot removed the needs-rebase label Jan 8, 2026

ivanium force-pushed the feat/general-hybrid-prefix-caching branch from 231ae3b to 35de555 Compare January 9, 2026 01:58

ivanium force-pushed the feat/general-hybrid-prefix-caching branch from 35de555 to be88dbe Compare January 9, 2026 05:15

ivanium added 9 commits January 8, 2026 21:15

feat: general support for multiple kv groups for

62ca94b

find_longest_cache_hit Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

feat: more general algorithm without assuming full attn exists

8a7e3cd

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

test: update and unify test cases

5dbc93b

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

format nits

91d8876

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

refactor: move trim inside loop

1a49351

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

perf: always start from full attn group if they exist

e6010e4

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

nits

168151f

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

refactor: simplify while loop a bit

5c9cbe0

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

fix (tests): resolve merge conflicts

34ca454

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

ivanium force-pushed the feat/general-hybrid-prefix-caching branch from be88dbe to 34ca454 Compare January 9, 2026 05:15

simon-mo merged commit cd4a95e into vllm-project:main Jan 9, 2026
47 checks passed

ivanium deleted the feat/general-hybrid-prefix-caching branch January 9, 2026 19:27

This was referenced Jan 11, 2026

feat(kv-cache): support multiple sliding window groups in HybridKVCac… #31592

Closed

Multiple Hybrid KV Cache Coordinator #30263

Closed

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator (

fe6669b

vllm-project#31707) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

[Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator (

43194dc

vllm-project#31707) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

xyang16 mentioned this pull request Jan 28, 2026

[Bugfix] Fix Hybrid KV cache hit length computation for eagle #33270

Closed

5 tasks

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator (

7e93984

vllm-project#31707) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>

Uh oh!

Conversation

ivanium commented Jan 5, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jan 5, 2026

Uh oh!

mergify bot commented Jan 5, 2026

Uh oh!

mergify bot commented Jan 5, 2026

Uh oh!

heheda12345 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

ivanium Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

heheda12345 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

heheda12345 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

heheda12345 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ivanium commented Jan 5, 2026 •

edited by github-actions bot

Loading