Skip to content

[3/N][KV-Cache Layout Refactor] Standardize Mamba cache; drop get_transfer_cache_regions#44456

Draft
LucasWilkinson wants to merge 14 commits into
lwilkinson/kv-layout/kv-content-packfrom
lwilkinson/kv-layout/bind-kv-cache
Draft

[3/N][KV-Cache Layout Refactor] Standardize Mamba cache; drop get_transfer_cache_regions#44456
LucasWilkinson wants to merge 14 commits into
lwilkinson/kv-layout/kv-content-packfrom
lwilkinson/kv-layout/bind-kv-cache

Conversation

@LucasWilkinson

@LucasWilkinson LucasWilkinson commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

PR #42374 (first part of RFC #42082) has been split into 4 PRs:

#44454 [1/N][KV-Cache Layout Refactor] Refactor DSV4 KV cache config
#44455 [2/N][KV-Cache Layout Refactor] Pack K/V into the content dim across attention backends
-> #44456 [3/N][KV-Cache Layout Refactor] Standardize Mamba cache; drop get_transfer_cache_regions
#44458 [4/N][KV-Cache Layout Refactor] Standardize KV cache layout

Summary

Move Mamba (and conv/ssm) reshaping into a new bind_kv_cache hook; drop the now-unnecessary TransferTopology.get_transfer_cache_regions from the connector. This is an intermediate step toward #42374 that does not yet introduce the cross-layer L dim.

Stacked on top of #44455 (pack K/V into the content dim).

Testing

Mamba bind_kv_cache unpacking and the single-tensor NIXL/mooncake Mamba registration require GPU + multi-node P/D setups to exercise (hybrid attention+Mamba models hit both bind paths). The submitter will validate these on appropriate hardware before merge.

AI assistance

This PR was prepared with AI assistance (Claude).

@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/kv-layout/bind-kv-cache branch from 6fd5bd1 to 217144b Compare June 4, 2026 00:08
LucasWilkinson added a commit that referenced this pull request Jun 4, 2026
…APIs

Final step of the KV-cache layout standardization ladder, stacked on
top of bind_kv_cache (#44456). Introduces the standardized layout
resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache,
removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and
removes the remaining cross-layer block machinery from the connector.

Co-authored-by: Claude

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/kv-layout/bind-kv-cache branch from 217144b to 3892e7c Compare June 4, 2026 03:24
LucasWilkinson added a commit that referenced this pull request Jun 4, 2026
…APIs

Final step of the KV-cache layout standardization ladder, stacked on
top of bind_kv_cache (#44456). Introduces the standardized layout
resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache,
removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and
removes the remaining cross-layer block machinery from the connector.

Co-authored-by: Claude

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson changed the title [KVCache] Standardize Mamba cache via bind_kv_cache; drop get_transfer_cache_regions [3/N][KV-Cache Layout Refactor] Standardize Mamba cache; drop get_transfer_cache_regions Jun 4, 2026
LucasWilkinson added a commit that referenced this pull request Jun 4, 2026
…APIs

Final step of the KV-cache layout standardization ladder, stacked on
top of bind_kv_cache (#44456). Introduces the standardized layout
resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache,
removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and
removes the remaining cross-layer block machinery from the connector.

Co-authored-by: Claude

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/kv-layout/kv-content-pack branch from 6bf3d4c to f8182a2 Compare June 4, 2026 04:10
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/kv-layout/bind-kv-cache branch from 836522d to 5b204ae Compare June 4, 2026 04:10
LucasWilkinson added a commit that referenced this pull request Jun 4, 2026
…APIs

Final step of the KV-cache layout standardization ladder, stacked on
top of bind_kv_cache (#44456). Introduces the standardized layout
resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache,
removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and
removes the remaining cross-layer block machinery from the connector.

Co-authored-by: Claude

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/kv-layout/kv-content-pack branch 6 times, most recently from 3771e86 to 104e9bb Compare June 5, 2026 14:43
@mergify

mergify Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 5, 2026
LucasWilkinson and others added 7 commits June 7, 2026 17:22
Standardize the attention KV cache to a packed, blocks-first, head-major
logical layout (num_blocks, num_kv_heads, block_size, 2*head_size) where K
and V are concatenated in the content dim. Backends recover K/V via
kv_cache.transpose(1, 2).split(head_size, dim=-1). nvfp4 instead stores K
and V as separate head groups: (num_blocks, 2*num_kv_heads, block_size,
full_dim), split on dim=1.

get_kv_cache_shape / get_kv_cache_stride_order are retained (stride orders
updated to the 4D packed layout, plus the layers-dim variants for the
cross-layer/uniform-cache path); the generic allocation/reshape path and
NHD/HND layout handling are unchanged. Their removal is deferred to a later
PR.

Backends: flash_attn, flex_attention, triton_attn (incl. the
per-token-head-quant inline-scale path, packed to 4D), flashinfer (incl.
nvfp4 head-group layout, trtllm, cascade, kvfp8 dequant), rocm_attn
(+ PagedAttention.split_kv_cache), rocm_aiter_fa, rocm_aiter_unified_attn,
turboquant_attn, flash_attn_diffkv, cpu_attn. torch_utils exposes
nvfp4_split_data_scale (single-side) and drops nvfp4_kv_cache_split_views.
Tests updated to build the packed cache shapes.

AI assistance (Claude) was used for this change.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With K and V packed into one tensor per layer, KV connectors no longer
split K and V into separate regions. In TransferTopology, recognize the 4D
packed attention shape as blocks-first and drop split_k_and_v; the per-block
sub-split (virtually_split_kv_in_blocks) now applies only to Mamba's
conv/ssm state. nixl/worker and mooncake register a single region per layer.

get_transfer_cache_regions and cache_list are retained (they still bridge
Mamba's [conv, ssm] views to a registrable region); their removal — together
with bind_kv_cache-based Mamba standardization — is deferred to a follow-up
PR. Cross-layer block support is intentionally retained.

AI assistance (Claude) was used for this change.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
flex_attention.get_kv_cache_stride_order still returned the pre-packing
5D/6D orderings while its shape is now 4D (B, H, N, 2*C), which breaks
the runtime assertion len(stride_order) == len(shape) and corrupts the
physical layout. Drop the trailing index to match the packed shape
(matching flash_attn's NHD ordering). Also update test_kv_cache_stride_order
to use 4-element strides.

Co-authored-by: Claude

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
With K/V packed into a single contiguous region per block, the NIXL and
Mooncake transfer paths register one region per layer and coalesce block
transfers instead of emitting separate K/V halves. Update the unit tests
to match: detect the 4D blocks-first layout, expect one entry per tensor,
and expect coalesced (non-split) block transfers.

Co-authored-by: Claude

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
The nixl connector was split into a `nixl/` package; the AMD CI
source_file_dependencies still referenced the old single-file
`nixl_connector.py` path, which no longer exists. Point them at the
`nixl/` directory so the NIXL integration steps trigger correctly.

Co-authored-by: Claude

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Quantized KV caches (fp8, nvfp4) now use (B, 2*H, N, hs) in FlashInfer,
storing K and V as separate head groups. This allows zero-copy .view()
to (B, 2, H, N, hs) for trtllm_prefill_attn_kvfp8_dequant, avoiding a
full-cache .contiguous() copy. Non-quantized caches keep (B, H, N, 2*hs).

Restores canonicalize_singleton_dim_strides in both TRTLLM paths
(prefill fp8 dequant + decode) that were dropped during the layout
refactor.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
LucasWilkinson and others added 4 commits June 7, 2026 17:22
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
The old split_kv_cache used .view() on a non-contiguous tensor
(slice + transpose), which would crash at runtime. Instead, split
K/V directly on the content dim with kv_cache.split(head_size, dim=-1)
producing zero-copy (B, H, N, C) views.

Updated the Triton kernels (kernel_paged_attention_2d, _fwd_kernel,
_fwd_kernel_alibi) to use 4D stride-based K addressing, dropping the
legacy x-factor tiling that was only needed for the old interleaved
(B, H, C//x, N, x) layout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/kv-layout/kv-content-pack branch from e010548 to dad7aa4 Compare June 7, 2026 21:23
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/kv-layout/bind-kv-cache branch from 5b204ae to df58401 Compare June 7, 2026 21:38
LucasWilkinson added a commit that referenced this pull request Jun 7, 2026
…APIs

Final step of the KV-cache layout standardization ladder, stacked on
top of bind_kv_cache (#44456). Introduces the standardized layout
resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache,
removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and
removes the remaining cross-layer block machinery from the connector.

Co-authored-by: Claude

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@mergify mergify Bot removed the needs-rebase label Jun 7, 2026
LucasWilkinson and others added 3 commits June 7, 2026 23:26
- TurboQuant: pass kv_cache tensor directly to Triton kernels instead
  of view(-1) which fails on non-contiguous stride-permuted tensors.
  The kernels already compute offsets via explicit strides.
- Fusion test: pass cache_dtype_str to get_kv_cache_shape so quantized
  caches (fp8) get the correct shape (2*H heads vs 2*C content dim).
- Kernel test: update KV cache layout from legacy x-factor tiled
  (B,H,C/8,N,8) to standard BHNC (B,H,N,C) matching the updated
  Triton kernels.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r_cache_regions

Introduce a polymorphic `AttentionLayerBase.bind_kv_cache(kv_cache)` so each
layer unpacks its own allocation:
  - default (attention/MLA): store the cache view as-is;
  - Mamba: unpack a single ``[B, 1, 1, page_size_bytes]`` int8 page tensor
    into its per-state (conv/ssm) views.
The KV-cache bind orchestrator now calls `layer.bind_kv_cache(...)` instead
of assigning `.kv_cache` directly, and the runner stores a single combined
tensor per Mamba layer (rather than a [conv, ssm] view list).

Because the Mamba cache is now a single registrable tensor, the KV connector
no longer needs the [conv, ssm] -> region bridge: remove
`TransferTopology.get_transfer_cache_regions` and register one region per
layer in nixl/worker.

Scope: standard attention `get_kv_cache_shape`/`get_kv_cache_stride_order`
are unchanged (no reshape_kv_cache adoption, no cross-layer `L` dim);
cross-layer block support is retained. Builds on the K/V content-packing PR.

AI assistance (Claude) was used for this change.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing)

With K and V packed into the content dim, attention KV caches are always
blocks-first (num_blocks is dim 0), so get_kv_cache_block_dim returns 0
and _update_hybrid_attention_layout short-circuits for every group --
it never re-strides anything. Drop the now-dead MRV2 helper, its call
site, and the unused has_attn/has_mamba bookkeeping.

Co-authored-by: Claude

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/kv-layout/bind-kv-cache branch from df58401 to 9efca99 Compare June 8, 2026 03:27
LucasWilkinson added a commit that referenced this pull request Jun 8, 2026
…APIs

Final step of the KV-cache layout standardization ladder, stacked on
top of bind_kv_cache (#44456). Introduces the standardized layout
resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache,
removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and
removes the remaining cross-layer block machinery from the connector.

Co-authored-by: Claude

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/kv-layout/kv-content-pack branch from eebcf35 to 7bf955f Compare June 12, 2026 21:44
@mergify

mergify Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@github-project-automation github-project-automation Bot moved this to Todo in AMD Jun 12, 2026
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/kv-layout/kv-content-pack branch 2 times, most recently from c61f39f to ca5cf8a Compare June 13, 2026 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Todo
Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant