[Feature] Enable uniform KV cache allocation for multi-group HMA models#34373
[Feature] Enable uniform KV cache allocation for multi-group HMA models#34373Etelis wants to merge 33 commits intovllm-project:mainfrom
Conversation
Relax the single-group constraint in use_uniform_kv_cache() so that hybrid-attention models (e.g. Gemma 2 with alternating full + sliding-window layers) can benefit from the contiguous cross-layer KV cache layout used for efficient KV transfers. Instead of requiring exactly one group, loop over all groups and verify they share the same backend shape and stride order. Also relax the kernel_block_sizes assertion in allocate_uniform_kv_caches() to accept multiple groups with the same block size. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
There was a problem hiding this comment.
Code Review
This pull request enables uniform KV cache allocation for models with multiple attention groups, such as Hybrid-Attention Models (HMA). This is achieved by relaxing the single-group constraint and instead checking for compatibility (e.g., same shape and stride order) across all groups. The changes in use_uniform_kv_cache are logical and well-supported by a comprehensive new test suite. I have one suggestion to make the compatibility check more robust by also ensuring all attention groups use the same backend, which is implied by the docstring and the subsequent allocation logic.
| if not attn_groups: | ||
| return False | ||
|
|
||
| attn_group = attn_groups[0][0] | ||
| kv_cache_spec = attn_group.kv_cache_spec | ||
| if not isinstance(kv_cache_spec, AttentionSpec): | ||
| return False | ||
| reference_shape = None | ||
| reference_stride_order = None | ||
|
|
||
| for subgroups in attn_groups: | ||
| if len(subgroups) != 1: | ||
| return False | ||
|
|
||
| attn_group = subgroups[0] | ||
| kv_cache_spec = attn_group.kv_cache_spec | ||
| if not isinstance(kv_cache_spec, AttentionSpec): | ||
| return False | ||
|
|
||
| attn_backend = attn_group.backend | ||
| kv_cache_shape = attn_backend.get_kv_cache_shape( | ||
| 1234, | ||
| kv_cache_spec.block_size, | ||
| kv_cache_spec.num_kv_heads, | ||
| kv_cache_spec.head_size, | ||
| cache_dtype_str=cache_dtype, | ||
| ) | ||
|
|
||
| attn_backend = attn_group.backend | ||
| kv_cache_shape = attn_backend.get_kv_cache_shape( | ||
| 1234, | ||
| kv_cache_spec.block_size, | ||
| kv_cache_spec.num_kv_heads, | ||
| kv_cache_spec.head_size, | ||
| cache_dtype_str=cache_dtype, | ||
| ) | ||
| try: | ||
| kv_cache_stride_order = attn_backend.get_kv_cache_stride_order( | ||
| include_num_layers_dimension=True | ||
| ) | ||
| except (AttributeError, NotImplementedError): | ||
| return False | ||
|
|
||
| try: | ||
| kv_cache_stride_order = attn_backend.get_kv_cache_stride_order( | ||
| include_num_layers_dimension=True | ||
| ) | ||
| except (AttributeError, NotImplementedError): | ||
| return False | ||
| # check that attention backend includes a layers dimension | ||
| if len(kv_cache_stride_order) != len(kv_cache_shape) + 1: | ||
| return False | ||
|
|
||
| # check that attention backend include a layers dimension | ||
| return len(kv_cache_stride_order) == len(kv_cache_shape) + 1 | ||
| if reference_shape is None: | ||
| reference_shape = kv_cache_shape | ||
| reference_stride_order = kv_cache_stride_order | ||
| elif ( | ||
| kv_cache_shape != reference_shape | ||
| or kv_cache_stride_order != reference_stride_order | ||
| ): | ||
| return False | ||
|
|
||
| return True |
There was a problem hiding this comment.
The docstring for use_uniform_kv_cache states that for a uniform layout, all KV cache groups must have the same backend. However, the current implementation only checks for compatible kv_cache_shape and kv_cache_stride_order, but not that the attn_backend is the same across all groups.
The subsequent allocate_uniform_kv_caches function uses the backend from the first attention group, which could lead to incorrect behavior or runtime errors if other groups use a different backend.
To prevent this potential issue and align with the documentation, I suggest also checking that all attention groups share the same backend instance.
if not attn_groups:
return False
reference_shape = None
reference_stride_order = None
reference_backend = None
for subgroups in attn_groups:
if len(subgroups) != 1:
return False
attn_group = subgroups[0]
kv_cache_spec = attn_group.kv_cache_spec
if not isinstance(kv_cache_spec, AttentionSpec):
return False
attn_backend = attn_group.backend
kv_cache_shape = attn_backend.get_kv_cache_shape(
1234,
kv_cache_spec.block_size,
kv_cache_spec.num_kv_heads,
kv_cache_spec.head_size,
cache_dtype_str=cache_dtype,
)
try:
kv_cache_stride_order = attn_backend.get_kv_cache_stride_order(
include_num_layers_dimension=True
)
except (AttributeError, NotImplementedError):
return False
# check that attention backend includes a layers dimension
if len(kv_cache_stride_order) != len(kv_cache_shape) + 1:
return False
if reference_backend is None:
reference_shape = kv_cache_shape
reference_stride_order = kv_cache_stride_order
reference_backend = attn_backend
elif (
kv_cache_shape != reference_shape
or kv_cache_stride_order != reference_stride_order
or attn_backend is not reference_backend
):
return False
return True|
Looks great for a start! Thanks @Etelis ! IIUC right now you handle the case of multiple groups, but requiring:
Take a look at the current options for defining KV cache tensors: There are 2 cases:
For case 1 (which is less common I think), we can group layers by their page size and create a single tensor per each group of layers with the same page size. Does that make sense? |
0de79af to
8fab499
Compare
…dels Group layers by page_size_bytes and allocate one contiguous int8 tensor per group. Each layer gets its own zero-copy view via view->slice->permute (attention) or as_strided (mamba). This relaxes the previous constraint that all layers must share identical shapes and stride orders. Introduces CrossLayerGroup dataclass to bundle backing tensors with metadata. Supports AttentionSpec (all subclasses) and MambaSpec layers. Signed-off-by: Itay Etlis <itayetlis@gmail.com> Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Replace cross_layers_kv_cache + cross_layers_attn_backend with a list[CrossLayerGroup]. Single pure-attention groups use the optimized register_cross_layers_kv_cache path; otherwise fall back to register_kv_caches. Signed-off-by: Itay Etlis <itayetlis@gmail.com> Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Update existing tests for new CrossLayerGroup return type. Add Mamba allocation test with shape verification and data isolation. Replace incompatible-page-size rejection test with acceptance test (different page sizes now produce separate groups). Signed-off-by: Itay Etlis <itayetlis@gmail.com> Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Thanks @orozery! Went ahead and implemented this. The main idea is what you described — layers are grouped by For per-layer views I'm doing the view → slice → permute pipeline you suggested: I also extended this to handle Tested on H100 with a hybrid setup (4 attention + 2 Mamba layers, 2 page_size groups), all passing. No connector changes yet — will handle |
The backing tensor is always int8 by construction — storing it as a field adds no information. Remove from dataclass and test assertion. Signed-off-by: Itay Etlis <itayetlis@gmail.com> Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
When TP is active, attention layers are laid out as (num_blocks, num_kv_heads, num_layers, per_head_page_bytes) so head-based slicing is contiguous for RDMA transfers. Unifies sentinel probes into _find_kv_cache_dims, generalizes _per_layer_permutation for any number of extracted dims, and _create_attention_layer_view handles both layouts via tp flag. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
|
Checked with GPT-OSS-20B, alternating 12 SW + 12 FA( All 24 layers share: ( so they should merge into a single cross-layer group. H100 80GB — TP=1 (default layout)Model loaded and generated text. 8x RTX 3090 — TP=2 (TP layout)With |
… order Replace the external tp flag with per-layer backend probing. Each layer is classified into one of three groups: - ordered: blocks first, heads before layers (e.g. HND backends) - default: grouped by page_size (e.g. NHD backends, Mamba) - solo: blocks not outermost (fallback, one layer per group) This removes the need for callers to know about TP configuration and lets the allocation follow the backend's preferred physical layout exactly. Signed-off-by: Itay Etlis <itayetlis@gmail.com> Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
e3027ec to
d69c498
Compare
…roup Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
4218db1 to
31e13a1
Compare
|
Hi @Etelis, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
All backends place blocks at physical position 0 in the with-layers stride order, making the blocks_phys != 0 guard unreachable. Remove the solo key, the blocks_phys check, and the now-unused tensor_idx parameter from _cross_layer_group_key. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
|
Hi @Etelis, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Merge register_kv_caches and register_cross_layers_kv_cache into a single register_kv_caches method that accepts an optional cross_layer_groups parameter. This enables connectors to handle multiple cross-layer groups for HMA models with heterogeneous attention types (full + sliding-window) and mixed layer kinds (attention + Mamba) without falling back to the per-layer path. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
|
Hi @Etelis, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Add back register_cross_layers_kv_cache to nixl, offloading, and multi connectors alongside the unified register_kv_caches API. This restores the legacy single-tensor registration path for connectors that set prefer_cross_layer_blocks. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Add KVCacheTopology and register_hybrid_kv_caches to the connector base class for multi-group hybrid attention models. Dual-path gating in use_uniform_kv_cache: - Hybrid path (register_hybrid_kv_caches): multi-group, Attention+Mamba - Legacy path (prefer_cross_layer_blocks): single-group, AttentionSpec only Restore allocate_uniform_kv_caches (original single-tensor allocation) and rename multi-group allocation to allocate_hybrid_kv_caches. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
…ides The base class provides a default no-op; connectors will add their own overrides independently when they adopt the legacy path. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Restore all connector implementations to their main branch state. The register_kv_caches base class signature is reverted to accept only kv_caches dict, matching the connector overrides. Cross-layer registration now uses register_cross_layers_kv_cache (legacy) or register_hybrid_kv_caches (new) instead. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
os.sched_setaffinity is not available on all platforms (e.g. macOS). Add a hasattr guard to avoid AttributeError at runtime and a clear NotImplementedError message. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
- Explicitly handle MambaSpec in _cross_layer_group_key; isolate unknown spec types instead of grouping them with others - Return isolated key when blocks dim is not first or layers dim is not after blocks in the physical stride order - Validate tensor size and key agreement for all layers sharing a tensor in allocate_hybrid_kv_caches - Fill num_heads_dim and block_size_dim in KVCacheTopology for ordered groups using sentinel-value probing - Set num_layers_dim=None for isolated (non-shared) tensors - Remove dead fallback in _create_attention_layer_view (layers reaching that function are guaranteed to have a valid stride order) Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
- Add _MockBlocksNotFirstBackend to verify layers with blocks not in physical dim 0 are isolated (no cross-layer sharing) - Assert num_blocks_dim, num_layers_dim, num_heads_dim in HND topology - Add test_blocks_not_first_is_isolated covering the isolated path - Fix group.spec references to use group.page_size_bytes Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
…_mixin Rename cryptic variable names for readability: - raw -> buffer, spec -> attn_spec/mamba_spec, el -> element_size - npkb/knb -> kernel_blocks_per_spec_block/kernel_num_blocks - rep_name/rep_spec -> representative_name/representative_spec - gid -> group_id, log_to_phys -> logical_to_physical - _B/_H -> _SENTINEL_BLOCKS/_SENTINEL_HEADS Trim docstrings to match vLLM conventions: brief descriptions for private methods, concise Args/Returns for public methods. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Add Args section and trim to match vLLM docstring conventions. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
When get_kv_cache_stride_order(include_num_layers_dimension=True) is not supported, fall back to prepending the layers dimension to the base stride order instead of using an identity permutation. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
|
Resolved all CRs @orozery. |
|
This pull request has merge conflicts that must be resolved before it can be |
…nectors Replace dimension-index-based topology metadata with explicit byte offset/length references. Connectors now receive KVCacheTensorReference (physical tensors with page sizes) and KVCacheDataReference (per-group chunk layout with unpadded sizes and head strides). Adds build_kv_cache_references to convert CrossLayerGroups into the new types at registration time. Handles attention chunks, Mamba per-state chunks (conv/ssm), byte-level padding, and layer-level padding from uneven HMA groups. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
use_uniform_kv_cache()currently rejects any model with more than one KV cache group, which means hybrid-attention models (alternating full + sliding-window layers) cannot use the contiguous cross-layer layout for efficient KV transfers.This PR relaxes the single-group gate: instead of requiring exactly one group, we loop over all groups and check that they share the same backend shape and stride order.
Test Plan
Unit tests (
tests/v1/kv_connector/unit/test_uniform_kv_cache.py) — 4 tests:Test Result
E2E on an H100 with
google/gemma-2-2b:Single-group regression (HMA disabled, OffloadingConnector):
Multi-group (HMA enabled, SupportsHMA test connector):