Canonical KV Cache Allocation for HMA Models by Etelis · Pull Request #37885 · vllm-project/vllm

Etelis · 2026-03-23T11:56:40Z

This is the first phase of a multi-phase effort to enable contiguous KV cache allocation for all model architectures. Currently, only single-group (uniform) models benefit from contiguous cross-layer blocks. This PR extends that to HMA models with uniform page sizes. Future phases will broaden support to models with varying page sizes and additional architectures.

The existing allocate_uniform_kv_caches path only supports single-group models (all layers identical). HMA models like Gemma 3 have multiple KV cache groups (full attention + sliding window) with different eviction policies but the same page size. Previously, these models fell back to per-layer allocation, which scatters block data across non-contiguous memory regions, making RDMA transfers inefficient.

This PR extends contiguous KV cache allocation to HMA models where all KV cache groups share the same page size.

Test plan

Unit tests: pytest tests/v1/kv_connector/unit/test_canonical_kv_caches.py -v -s
- Happy path for use_canonical_kv_caches
- Parametrized rejection cases (single group, no connector, no HMA, mamba layers, no stride order)
- Allocation correctness: shapes, memory sharing, physical contiguity, group refs, page sizes

Related PRs

#34373 — Original KVCacheTopology PR (closed, too complex).
#37339 — WorkerConnectorInitializationData pattern. We adopt their interface design -- Hopefully to be merged after that PR.

mergify · 2026-03-23T11:57:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Etelis.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

gemini-code-assist

Code Review

This pull request introduces canonical KV cache allocation for Hybrid Multi-Attention (HMA) models, specifically targeting those with uniform page sizes. This is a significant improvement as it enables contiguous cross-layer block allocation, which was previously limited to single-group models. The changes involve new data structures to represent canonical KV caches and their references, along with modifications to the KV cache allocation logic within the gpu_model_runner and kv_connector_model_runner_mixin. A comprehensive unit test suite has been added to validate the new allocation strategy under various conditions, including happy paths and rejection cases. The implementation appears well-considered and robust, addressing the stated goal of improving RDMA transfer efficiency by ensuring memory contiguity for HMA models.

mergify · 2026-03-23T12:09:50Z

Hi @Etelis, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery · 2026-03-23T12:15:06Z

+        # The connector must support HMA
+        if not supports_hma(get_kv_transfer_group()):
+            return False
+        if len(kv_cache_config.kv_cache_groups) <= 1:


Suggested change

if len(kv_cache_config.kv_cache_groups) <= 1:

if len(kv_cache_config.kv_cache_groups) < 1:

Done,
makes sense.

orozery · 2026-03-23T12:21:11Z

+        if len(kv_cache_config.kv_cache_groups) <= 1:
+            return False
+
+        # All groups must use AttentionSpec with uniform page size


Suggested change

# All groups must use AttentionSpec with uniform page size

# Currently, all groups must use AttentionSpec with uniform page size

# We plan to gradually relax this requirement to support other cases

Thanks, sorry.

orozery · 2026-03-23T12:27:31Z

+        spec = kv_cache_config.kv_cache_groups[0].kv_cache_spec
+        assert isinstance(spec, AttentionSpec)


Can we remove this and use the spec inside the loop per each group?

orozery · 2026-03-23T12:30:37Z

            )
            self.cross_layers_kv_cache = cross_layers_kv_cache
            self.cross_layers_attn_backend = attn_backend
+        elif self.use_canonical_kv_caches(


Let's move this check before checking use_uniform_kv_cache.

orozery · 2026-03-23T13:28:33Z

+        kernel_num_blocks = num_blocks * num_blocks_per_kv_block
+
+        # prepend a group_size dimension into the shape
+        kv_cache_shape = attn_backend.get_kv_cache_shape(


Can we move this logic AFTER we allocate the single tensor?

Then, inside the layer loop, we can reshape?
I think we can also remove assert len(unique_kernel_bs) == 1.

I think it's better to also build the group_data_refs inside the same loop.

orozery · 2026-03-23T13:29:57Z

    @property
    def needs_kv_cache_zeroing(self) -> bool:
        return self.has_mamba_layers
+


These classes are currently specific to connector usage.
I think we should move them to base.py.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

…nector Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery · 2026-03-23T16:49:10Z

+                    WorkerConnectorInitializationData,
+                )
+
+                kv_transfer_group.initialize_worker_connector(


Actually, initialize_worker_connector is needed for the CacheBlend use-case.
Let's try to call it exactly as in #37339.
But keep this if here and simply pass, commenting that the canonical kv caches will be registered below.

I thought they'd add it themselves afterwards,
nvm I will fix it.

Combine the kv_caches population, block tensor splitting, and layer-to-position mapping into a single pass over positions. Remove the unique kernel block size assertion. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Call initialize_worker_connector unconditionally so connectors like CacheBlend can use it regardless of the allocation path taken. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery · 2026-03-24T08:22:20Z

+                canonical_kv_caches is the CanonicalKVCaches wrapping
+                    for the connector.
+        """
+        # all tensors have the same size (validated by use_canonical_kv_caches)


Where did we validate this?

@orozery sharp eye

fixed.

Move the uniform tensor size check into use_canonical_kv_caches so the precondition is validated before entering the allocation path, keeping the assert in allocate_canonical_kv_caches as a safety net. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Use contiguous_buffer.select(group_dim, i) to obtain per-position canonical block tensors where num_blocks is always the leading dimension. This eliminates the block_dim splitting loop and multi-dimensional index arithmetic. Also strengthen use_canonical_kv_caches to explicitly verify num_blocks is the leading physical dimension, and restore the single-group rejection (< 2) so single-group models correctly use the uniform path. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery · 2026-04-03T05:47:28Z

+        kernel_block_size = kernel_block_sizes[0]
+        num_blocks_per_kv_block = kv_cache_spec.block_size // kernel_block_size
+        kernel_num_blocks = num_blocks * num_blocks_per_kv_block
+
+        attn_backend = attn_groups[0][0].backend
+        kv_cache_shape = attn_backend.get_kv_cache_shape(
+            kernel_num_blocks,
+            kernel_block_size,
+            kv_cache_spec.num_kv_heads,
+            kv_cache_spec.head_size,
+            cache_dtype_str=cache_dtype,
+        )
+
+        # prepend a group_size dimension into the shape
+        kv_cache_shape = (group_size,) + kv_cache_shape
+
+        try:
+            kv_cache_stride_order = attn_backend.get_kv_cache_stride_order(
+                include_num_layers_dimension=True
+            )
+            assert len(kv_cache_stride_order) == len(kv_cache_shape)
+        except (AttributeError, NotImplementedError):
+            kv_cache_stride_order = tuple(range(len(kv_cache_shape)))
+
+        physical_shape = tuple(kv_cache_shape[i] for i in kv_cache_stride_order)
+        assert physical_shape[0] == kernel_num_blocks


Can we move this logic inside the per-layer loop that sets kv_caches?

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery

Thanks @Etelis !
Can you please test this PR on top of this branch?
https://github.com/orozery/vllm/tree/kv-offload-hma
Specifically, verify test_cpu_offloading.py passes, and whether we see performance gains.

orozery · 2026-04-06T16:06:03Z

+            [] for _ in kv_cache_config.kv_cache_groups
+        ]
+
+        kernel_block_size = kernel_block_sizes[0]


Can we initialize kernel_block_size = kernel_block_sizes[gid] inside the loop?

Done, I didn't think of the fact backends could have different kernel block sizes.

orozery · 2026-04-06T16:16:28Z

+            block_tensor = typed_buffer.select(group_dim, i)
+            tensor_idx = len(block_tensors)
+            page_bytes = block_tensor[0].numel() * block_tensor.element_size()
+            block_tensors.append(


Aren't we expecting a single cross-layers tensor? With shape (num_blocks, page_size) and dtype int8?

Yeah, that's dumb.
Fixed.

Replace per-position KVCacheBlockTensor objects with a single (num_blocks, cross_layer_page_size) int8 tensor. This avoids recomputing block tensors per position and matches the pattern used by the offloading connector's register_cross_layers_kv_cache. Also use per-group kernel_block_sizes[gid] inside the loop instead of hardcoded kernel_block_sizes[0]. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis · 2026-04-12T15:33:45Z

Thanks @Etelis ! Can you please test this PR on top of this branch? https://github.com/orozery/vllm/tree/kv-offload-hma Specifically, verify test_cpu_offloading.py passes, and whether we see performance gains.

Running on your branch I have hit some issues with the connector not implementing theinitialize_worker_connector
I have fixed that here:
orozery#1

Ran on top of that branch with an A100
Gemma 3 (HMA)

Metric	Baseline (per-layer allocation)	With Canonical Allocation	Improvement
Cold start	56.12ms	51.60ms	-8.1%
GPU hit	12.65ms	12.42ms	-1.8%
CPU hit	20.68ms	18.75ms	-9.3%

Running other models as well so I'll update soon.

orozery · 2026-04-13T12:48:44Z

+            for layer_name in kv_cache_tensor.shared_by:
+                layer_gid = layer_to_group_idx[layer_name]
+                group_data_refs[layer_gid].append(
+                    KVCacheBlockDataRef(
+                        tensor_idx=0,
+                        page_size_bytes=page_size,
+                    )
+                )


We should have a single data reference per group.

Drop the duplicate KVCacheBlockTensor / KVCacheBlockDataRef / CanonicalKVCaches dataclasses from kv_connector/v1/base.py and import the existing types from vllm.v1.kv_offload.spec. Emit a single data reference per group instead of one per layer. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

…ge size Restore KVCacheBlockTensor / KVCacheBlockDataRef / CanonicalKVCaches in kv_connector/v1/base.py (these types are connector-owned) and fix KVCacheBlockDataRef.page_size_bytes to cover all layers in the group (page_size * group_size) now that we emit a single ref per group. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

- use_canonical_kv_caches: < 2 -> < 1 to allow single-group HMA models - allocate_canonical_kv_caches: read num_blocks from config and assert - allocate_canonical_kv_caches: drop unreachable try/except around get_kv_cache_stride_order (guard already validates it succeeds) Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery

LGTM. Thanks @Etelis !

@NickLucche @heheda12345 WDYT?

This PR extends cross-layers layout to models with multiple groups, but all attention (e.g. gpt-oss).
More importantly, it defines a generic API (CanonicalKVCaches) for describing the KV caches (either cross layers or not) to the connector.
It is meant to replace register_cross_layers_kv_cache, which is kept for now for backward compatibility.
This API could support (without any extending) models using mamba or hybrid mamba/attention. (We plan that to be a follow-up to this PR).
Also, this API can later be extended to include striding information for connectors doing hetro-TP transfers

Etelis requested review from ApostaC, NickLucche, heheda12345, njhill and orozery as code owners March 23, 2026 11:56

mergify Bot added the v1 label Mar 23, 2026

mergify Bot added needs-rebase kv-connector labels Mar 23, 2026

EtelisIBM added 5 commits March 23, 2026 13:57

Add CanonicalKVCaches data classes for HMA KV cache representation

52688de

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Add WorkerConnectorInitializationData and initialize_worker_connector

e25e020

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Add canonical KV cache allocation for HMA models

01b7897

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Wire up canonical KV cache allocation in gpu_model_runner

03903f1

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Add unit tests for canonical KV cache allocation

9697d1c

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis force-pushed the canonical-kv-caches branch from aa5d414 to 9697d1c Compare March 23, 2026 11:58

mergify Bot removed the needs-rebase label Mar 23, 2026

gemini-code-assist Bot reviewed Mar 23, 2026

View reviewed changes

Fix mypy error: rename shadowed variable in use_canonical_kv_caches

d9f7203

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery requested changes Mar 23, 2026

View reviewed changes

EtelisIBM added 3 commits March 23, 2026 15:53

Move canonical KV cache dataclasses to connector base

f39ae32

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Address CR: relax group count check and use per-group spec

68ce39a

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Address CR: prioritize canonical path and scope initialize_worker_con…

5b2b3bc

…nector Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis force-pushed the canonical-kv-caches branch from db277a8 to 5b2b3bc Compare March 23, 2026 13:54

orozery reviewed Mar 23, 2026

View reviewed changes

EtelisIBM added 2 commits March 23, 2026 19:23

Address CR: merge loops in allocate_canonical_kv_caches

3f90424

Combine the kv_caches population, block tensor splitting, and layer-to-position mapping into a single pass over positions. Remove the unique kernel block size assertion. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Address CR: always call initialize_worker_connector

432d002

Call initialize_worker_connector unconditionally so connectors like CacheBlend can use it regardless of the allocation path taken. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis force-pushed the canonical-kv-caches branch from 30cca9f to 432d002 Compare March 23, 2026 17:27

orozery reviewed Mar 24, 2026

View reviewed changes

Etelis added 2 commits March 25, 2026 11:54

Merge branch 'main' into canonical-kv-caches

24b90c8

Merge branch 'main' into canonical-kv-caches

d44b920

Etelis requested a review from orozery March 26, 2026 14:26

EtelisIBM added 2 commits March 30, 2026 15:21

Refactor canonical KV cache allocation into single-pass loop

5a209c2

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery reviewed Apr 3, 2026

View reviewed changes

arbi-dev mentioned this pull request Apr 5, 2026

[Core] Per-group BlockPool for hybrid Mamba/attention models #39031

Closed

5 tasks

EtelisIBM and others added 2 commits April 6, 2026 18:20

Move per-layer reshape logic into inner loop

94fa930

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Merge branch 'main' into canonical-kv-caches

2119286

orozery reviewed Apr 6, 2026

View reviewed changes

Etelis and others added 3 commits April 12, 2026 15:56

Merge branch 'main' into canonical-kv-caches

d6fbfbf

Merge branch 'main' into canonical-kv-caches

16e28c2

Etelis mentioned this pull request Apr 12, 2026

Canonical KV cache allocation + offloading integration orozery/vllm#1

Open

Merge branch 'main' into canonical-kv-caches

fa03305

orozery reviewed Apr 13, 2026

View reviewed changes

Etelis requested a review from xuechendi as a code owner April 19, 2026 15:31

Etelis requested a review from orozery April 19, 2026 15:35

Merge branch 'main' into canonical-kv-caches

4673d60

orozery reviewed Apr 20, 2026

View reviewed changes

Comment thread vllm/v1/worker/kv_connector_model_runner_mixin.py Outdated

Etelis requested a review from orozery April 23, 2026 19:10

Merge branch 'main' into canonical-kv-caches

7e6f1e5

orozery reviewed Apr 27, 2026

View reviewed changes

Comment thread vllm/v1/worker/kv_connector_model_runner_mixin.py Outdated

Comment thread vllm/v1/worker/kv_connector_model_runner_mixin.py Outdated

Comment thread vllm/v1/worker/kv_connector_model_runner_mixin.py Outdated

EtelisIBM and others added 2 commits April 28, 2026 10:48

Merge branch 'main' into canonical-kv-caches

7e6c93d

orozery approved these changes Apr 28, 2026

View reviewed changes

	if len(kv_cache_config.kv_cache_groups) <= 1:
	if len(kv_cache_config.kv_cache_groups) < 1:

	# All groups must use AttentionSpec with uniform page size
	# Currently, all groups must use AttentionSpec with uniform page size
	# We plan to gradually relax this requirement to support other cases

		spec = kv_cache_config.kv_cache_groups[0].kv_cache_spec
		assert isinstance(spec, AttentionSpec)

Uh oh!

Conversation

Etelis commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Related PRs

Uh oh!

mergify Bot commented Mar 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify Bot commented Mar 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orozery left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Etelis commented Apr 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

orozery left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Etelis commented Mar 23, 2026 •

edited

Loading