[3/N][KV-Cache Layout Refactor] Standardize Mamba cache; drop `get_transfer_cache_regions` by LucasWilkinson · Pull Request #44456 · vllm-project/vllm

LucasWilkinson · 2026-06-03T22:22:37Z

PR #42374 (first part of RFC #42082) has been split into 4 PRs:

#44454 [1/N][KV-Cache Layout Refactor] Refactor DSV4 KV cache config
#44455 [2/N][KV-Cache Layout Refactor] Pack K/V into the content dim across attention backends
-> #44456 [3/N][KV-Cache Layout Refactor] Standardize Mamba cache; drop get_transfer_cache_regions
#44458 [4/N][KV-Cache Layout Refactor] Standardize KV cache layout

Summary

Move Mamba (and conv/ssm) reshaping into a new bind_kv_cache hook; drop the now-unnecessary TransferTopology.get_transfer_cache_regions from the connector. This is an intermediate step toward #42374 that does not yet introduce the cross-layer L dim.

Stacked on top of #44455 (pack K/V into the content dim).

Testing

Mamba bind_kv_cache unpacking and the single-tensor NIXL/mooncake Mamba registration require GPU + multi-node P/D setups to exercise (hybrid attention+Mamba models hit both bind paths). The submitter will validate these on appropriate hardware before merge.

AI assistance

This PR was prepared with AI assistance (Claude).

…APIs Final step of the KV-cache layout standardization ladder, stacked on top of bind_kv_cache (#44456). Introduces the standardized layout resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache, removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and removes the remaining cross-layer block machinery from the connector. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify · 2026-06-05T15:49:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Standardize the attention KV cache to a packed, blocks-first, head-major logical layout (num_blocks, num_kv_heads, block_size, 2*head_size) where K and V are concatenated in the content dim. Backends recover K/V via kv_cache.transpose(1, 2).split(head_size, dim=-1). nvfp4 instead stores K and V as separate head groups: (num_blocks, 2*num_kv_heads, block_size, full_dim), split on dim=1. get_kv_cache_shape / get_kv_cache_stride_order are retained (stride orders updated to the 4D packed layout, plus the layers-dim variants for the cross-layer/uniform-cache path); the generic allocation/reshape path and NHD/HND layout handling are unchanged. Their removal is deferred to a later PR. Backends: flash_attn, flex_attention, triton_attn (incl. the per-token-head-quant inline-scale path, packed to 4D), flashinfer (incl. nvfp4 head-group layout, trtllm, cascade, kvfp8 dequant), rocm_attn (+ PagedAttention.split_kv_cache), rocm_aiter_fa, rocm_aiter_unified_attn, turboquant_attn, flash_attn_diffkv, cpu_attn. torch_utils exposes nvfp4_split_data_scale (single-side) and drops nvfp4_kv_cache_split_views. Tests updated to build the packed cache shapes. AI assistance (Claude) was used for this change. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

With K and V packed into one tensor per layer, KV connectors no longer split K and V into separate regions. In TransferTopology, recognize the 4D packed attention shape as blocks-first and drop split_k_and_v; the per-block sub-split (virtually_split_kv_in_blocks) now applies only to Mamba's conv/ssm state. nixl/worker and mooncake register a single region per layer. get_transfer_cache_regions and cache_list are retained (they still bridge Mamba's [conv, ssm] views to a registrable region); their removal — together with bind_kv_cache-based Mamba standardization — is deferred to a follow-up PR. Cross-layer block support is intentionally retained. AI assistance (Claude) was used for this change. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

flex_attention.get_kv_cache_stride_order still returned the pre-packing 5D/6D orderings while its shape is now 4D (B, H, N, 2*C), which breaks the runtime assertion len(stride_order) == len(shape) and corrupts the physical layout. Drop the trailing index to match the packed shape (matching flash_attn's NHD ordering). Also update test_kv_cache_stride_order to use 4-element strides. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

With K/V packed into a single contiguous region per block, the NIXL and Mooncake transfer paths register one region per layer and coalesce block transfers instead of emitting separate K/V halves. Update the unit tests to match: detect the 4D blocks-first layout, expect one entry per tensor, and expect coalesced (non-split) block transfers. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

The nixl connector was split into a `nixl/` package; the AMD CI source_file_dependencies still referenced the old single-file `nixl_connector.py` path, which no longer exists. Point them at the `nixl/` directory so the NIXL integration steps trigger correctly. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Quantized KV caches (fp8, nvfp4) now use (B, 2*H, N, hs) in FlashInfer, storing K and V as separate head groups. This allows zero-copy .view() to (B, 2, H, N, hs) for trtllm_prefill_attn_kvfp8_dequant, avoiding a full-cache .contiguous() copy. Non-quantized caches keep (B, H, N, 2*hs). Restores canonicalize_singleton_dim_strides in both TRTLLM paths (prefill fp8 dequant + decode) that were dropped during the layout refactor. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

The old split_kv_cache used .view() on a non-contiguous tensor (slice + transpose), which would crash at runtime. Instead, split K/V directly on the content dim with kv_cache.split(head_size, dim=-1) producing zero-copy (B, H, N, C) views. Updated the Triton kernels (kernel_paged_attention_2d, _fwd_kernel, _fwd_kernel_alibi) to use 4D stride-based K addressing, dropping the legacy x-factor tiling that was only needed for the old interleaved (B, H, C//x, N, x) layout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…APIs Final step of the KV-cache layout standardization ladder, stacked on top of bind_kv_cache (#44456). Introduces the standardized layout resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache, removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and removes the remaining cross-layer block machinery from the connector. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

- TurboQuant: pass kv_cache tensor directly to Triton kernels instead of view(-1) which fails on non-contiguous stride-permuted tensors. The kernels already compute offsets via explicit strides. - Fusion test: pass cache_dtype_str to get_kv_cache_shape so quantized caches (fp8) get the correct shape (2*H heads vs 2*C content dim). - Kernel test: update KV cache layout from legacy x-factor tiled (B,H,C/8,N,8) to standard BHNC (B,H,N,C) matching the updated Triton kernels. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…r_cache_regions Introduce a polymorphic `AttentionLayerBase.bind_kv_cache(kv_cache)` so each layer unpacks its own allocation: - default (attention/MLA): store the cache view as-is; - Mamba: unpack a single ``[B, 1, 1, page_size_bytes]`` int8 page tensor into its per-state (conv/ssm) views. The KV-cache bind orchestrator now calls `layer.bind_kv_cache(...)` instead of assigning `.kv_cache` directly, and the runner stores a single combined tensor per Mamba layer (rather than a [conv, ssm] view list). Because the Mamba cache is now a single registrable tensor, the KV connector no longer needs the [conv, ssm] -> region bridge: remove `TransferTopology.get_transfer_cache_regions` and register one region per layer in nixl/worker. Scope: standard attention `get_kv_cache_shape`/`get_kv_cache_stride_order` are unchanged (no reshape_kv_cache adoption, no cross-layer `L` dim); cross-layer block support is retained. Builds on the K/V content-packing PR. AI assistance (Claude) was used for this change. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ing) With K and V packed into the content dim, attention KV caches are always blocks-first (num_blocks is dim 0), so get_kv_cache_block_dim returns 0 and _update_hybrid_attention_layout short-circuits for every group -- it never re-strides anything. Drop the now-dead MRV2 helper, its call site, and the unused has_attn/has_mamba bookkeeping. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

…APIs Final step of the KV-cache layout standardization ladder, stacked on top of bind_kv_cache (#44456). Introduces the standardized layout resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache, removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and removes the remaining cross-layer block machinery from the connector. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mergify · 2026-06-12T22:00:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify Bot added v1 kv-connector labels Jun 3, 2026

LucasWilkinson mentioned this pull request Jun 3, 2026

[4/N][KV-Cache Layout Refactor] Standardize KV cache layout #44458

Draft

LucasWilkinson force-pushed the lwilkinson/kv-layout/bind-kv-cache branch from 6fd5bd1 to 217144b Compare June 4, 2026 00:08

LucasWilkinson force-pushed the lwilkinson/kv-layout/bind-kv-cache branch from 217144b to 3892e7c Compare June 4, 2026 03:24

LucasWilkinson changed the title ~~[KVCache] Standardize Mamba cache via bind_kv_cache; drop get_transfer_cache_regions~~ [3/N][KV-Cache Layout Refactor] Standardize Mamba cache; drop get_transfer_cache_regions Jun 4, 2026

This was referenced Jun 4, 2026

[1/N][KV-Cache Layout Refactor] Refactor DSV4 KV cache config construction #44454

Merged

[2/N][KV-Cache Layout Refactor] Pack K/V into the content dim across attention backends #44455

Open

LucasWilkinson mentioned this pull request Jun 4, 2026

[Core][WIP][1/N] Standardize kv layout #42374

Open

LucasWilkinson force-pushed the lwilkinson/kv-layout/kv-content-pack branch from 6bf3d4c to f8182a2 Compare June 4, 2026 04:10

LucasWilkinson force-pushed the lwilkinson/kv-layout/bind-kv-cache branch from 836522d to 5b204ae Compare June 4, 2026 04:10

LucasWilkinson force-pushed the lwilkinson/kv-layout/kv-content-pack branch 6 times, most recently from 3771e86 to 104e9bb Compare June 5, 2026 14:43

mergify Bot added the needs-rebase label Jun 5, 2026

LucasWilkinson and others added 7 commits June 7, 2026 17:22

cleanup

201628b

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson and others added 4 commits June 7, 2026 17:22

cleanup

2865af2

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

a7c6f83

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

94aba2f

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson force-pushed the lwilkinson/kv-layout/kv-content-pack branch from e010548 to dad7aa4 Compare June 7, 2026 21:23

LucasWilkinson force-pushed the lwilkinson/kv-layout/bind-kv-cache branch from 5b204ae to df58401 Compare June 7, 2026 21:38

mergify Bot removed the needs-rebase label Jun 7, 2026

LucasWilkinson and others added 3 commits June 7, 2026 23:26

LucasWilkinson force-pushed the lwilkinson/kv-layout/bind-kv-cache branch from df58401 to 9efca99 Compare June 8, 2026 03:27

LucasWilkinson force-pushed the lwilkinson/kv-layout/kv-content-pack branch from eebcf35 to 7bf955f Compare June 12, 2026 21:44

mergify Bot added needs-rebase ci/build nvidia rocm Related to AMD ROCm labels Jun 12, 2026

github-project-automation Bot added this to AMD and NVIDIA Jun 12, 2026

github-project-automation Bot moved this to Todo in AMD Jun 12, 2026

LucasWilkinson force-pushed the lwilkinson/kv-layout/kv-content-pack branch 2 times, most recently from c61f39f to ca5cf8a Compare June 13, 2026 04:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[3/N][KV-Cache Layout Refactor] Standardize Mamba cache; drop `get_transfer_cache_regions`#44456

[3/N][KV-Cache Layout Refactor] Standardize Mamba cache; drop `get_transfer_cache_regions`#44456
LucasWilkinson wants to merge 14 commits into
lwilkinson/kv-layout/kv-content-packfrom
lwilkinson/kv-layout/bind-kv-cache

LucasWilkinson commented Jun 3, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Jun 5, 2026

Uh oh!

mergify Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

LucasWilkinson commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

AI assistance

Uh oh!

mergify Bot commented Jun 5, 2026

Uh oh!

mergify Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LucasWilkinson commented Jun 3, 2026 •

edited

Loading