[3/N][KV-Cache Layout Refactor] Standardize Mamba cache; drop get_transfer_cache_regions#44456
Draft
LucasWilkinson wants to merge 14 commits into
Draft
Conversation
6fd5bd1 to
217144b
Compare
LucasWilkinson
added a commit
that referenced
this pull request
Jun 4, 2026
…APIs Final step of the KV-cache layout standardization ladder, stacked on top of bind_kv_cache (#44456). Introduces the standardized layout resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache, removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and removes the remaining cross-layer block machinery from the connector. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
217144b to
3892e7c
Compare
LucasWilkinson
added a commit
that referenced
this pull request
Jun 4, 2026
…APIs Final step of the KV-cache layout standardization ladder, stacked on top of bind_kv_cache (#44456). Introduces the standardized layout resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache, removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and removes the remaining cross-layer block machinery from the connector. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
bind_kv_cache; drop get_transfer_cache_regionsget_transfer_cache_regions
This was referenced Jun 4, 2026
LucasWilkinson
added a commit
that referenced
this pull request
Jun 4, 2026
…APIs Final step of the KV-cache layout standardization ladder, stacked on top of bind_kv_cache (#44456). Introduces the standardized layout resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache, removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and removes the remaining cross-layer block machinery from the connector. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
6bf3d4c to
f8182a2
Compare
836522d to
5b204ae
Compare
LucasWilkinson
added a commit
that referenced
this pull request
Jun 4, 2026
…APIs Final step of the KV-cache layout standardization ladder, stacked on top of bind_kv_cache (#44456). Introduces the standardized layout resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache, removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and removes the remaining cross-layer block machinery from the connector. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
3771e86 to
104e9bb
Compare
Contributor
|
This pull request has merge conflicts that must be resolved before it can be |
Standardize the attention KV cache to a packed, blocks-first, head-major logical layout (num_blocks, num_kv_heads, block_size, 2*head_size) where K and V are concatenated in the content dim. Backends recover K/V via kv_cache.transpose(1, 2).split(head_size, dim=-1). nvfp4 instead stores K and V as separate head groups: (num_blocks, 2*num_kv_heads, block_size, full_dim), split on dim=1. get_kv_cache_shape / get_kv_cache_stride_order are retained (stride orders updated to the 4D packed layout, plus the layers-dim variants for the cross-layer/uniform-cache path); the generic allocation/reshape path and NHD/HND layout handling are unchanged. Their removal is deferred to a later PR. Backends: flash_attn, flex_attention, triton_attn (incl. the per-token-head-quant inline-scale path, packed to 4D), flashinfer (incl. nvfp4 head-group layout, trtllm, cascade, kvfp8 dequant), rocm_attn (+ PagedAttention.split_kv_cache), rocm_aiter_fa, rocm_aiter_unified_attn, turboquant_attn, flash_attn_diffkv, cpu_attn. torch_utils exposes nvfp4_split_data_scale (single-side) and drops nvfp4_kv_cache_split_views. Tests updated to build the packed cache shapes. AI assistance (Claude) was used for this change. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With K and V packed into one tensor per layer, KV connectors no longer split K and V into separate regions. In TransferTopology, recognize the 4D packed attention shape as blocks-first and drop split_k_and_v; the per-block sub-split (virtually_split_kv_in_blocks) now applies only to Mamba's conv/ssm state. nixl/worker and mooncake register a single region per layer. get_transfer_cache_regions and cache_list are retained (they still bridge Mamba's [conv, ssm] views to a registrable region); their removal — together with bind_kv_cache-based Mamba standardization — is deferred to a follow-up PR. Cross-layer block support is intentionally retained. AI assistance (Claude) was used for this change. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
flex_attention.get_kv_cache_stride_order still returned the pre-packing 5D/6D orderings while its shape is now 4D (B, H, N, 2*C), which breaks the runtime assertion len(stride_order) == len(shape) and corrupts the physical layout. Drop the trailing index to match the packed shape (matching flash_attn's NHD ordering). Also update test_kv_cache_stride_order to use 4-element strides. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
With K/V packed into a single contiguous region per block, the NIXL and Mooncake transfer paths register one region per layer and coalesce block transfers instead of emitting separate K/V halves. Update the unit tests to match: detect the 4D blocks-first layout, expect one entry per tensor, and expect coalesced (non-split) block transfers. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
The nixl connector was split into a `nixl/` package; the AMD CI source_file_dependencies still referenced the old single-file `nixl_connector.py` path, which no longer exists. Point them at the `nixl/` directory so the NIXL integration steps trigger correctly. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Quantized KV caches (fp8, nvfp4) now use (B, 2*H, N, hs) in FlashInfer, storing K and V as separate head groups. This allows zero-copy .view() to (B, 2, H, N, hs) for trtllm_prefill_attn_kvfp8_dequant, avoiding a full-cache .contiguous() copy. Non-quantized caches keep (B, H, N, 2*hs). Restores canonicalize_singleton_dim_strides in both TRTLLM paths (prefill fp8 dequant + decode) that were dropped during the layout refactor. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
The old split_kv_cache used .view() on a non-contiguous tensor (slice + transpose), which would crash at runtime. Instead, split K/V directly on the content dim with kv_cache.split(head_size, dim=-1) producing zero-copy (B, H, N, C) views. Updated the Triton kernels (kernel_paged_attention_2d, _fwd_kernel, _fwd_kernel_alibi) to use 4D stride-based K addressing, dropping the legacy x-factor tiling that was only needed for the old interleaved (B, H, C//x, N, x) layout. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e010548 to
dad7aa4
Compare
5b204ae to
df58401
Compare
LucasWilkinson
added a commit
that referenced
this pull request
Jun 7, 2026
…APIs Final step of the KV-cache layout standardization ladder, stacked on top of bind_kv_cache (#44456). Introduces the standardized layout resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache, removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and removes the remaining cross-layer block machinery from the connector. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
- TurboQuant: pass kv_cache tensor directly to Triton kernels instead of view(-1) which fails on non-contiguous stride-permuted tensors. The kernels already compute offsets via explicit strides. - Fusion test: pass cache_dtype_str to get_kv_cache_shape so quantized caches (fp8) get the correct shape (2*H heads vs 2*C content dim). - Kernel test: update KV cache layout from legacy x-factor tiled (B,H,C/8,N,8) to standard BHNC (B,H,N,C) matching the updated Triton kernels. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r_cache_regions
Introduce a polymorphic `AttentionLayerBase.bind_kv_cache(kv_cache)` so each
layer unpacks its own allocation:
- default (attention/MLA): store the cache view as-is;
- Mamba: unpack a single ``[B, 1, 1, page_size_bytes]`` int8 page tensor
into its per-state (conv/ssm) views.
The KV-cache bind orchestrator now calls `layer.bind_kv_cache(...)` instead
of assigning `.kv_cache` directly, and the runner stores a single combined
tensor per Mamba layer (rather than a [conv, ssm] view list).
Because the Mamba cache is now a single registrable tensor, the KV connector
no longer needs the [conv, ssm] -> region bridge: remove
`TransferTopology.get_transfer_cache_regions` and register one region per
layer in nixl/worker.
Scope: standard attention `get_kv_cache_shape`/`get_kv_cache_stride_order`
are unchanged (no reshape_kv_cache adoption, no cross-layer `L` dim);
cross-layer block support is retained. Builds on the K/V content-packing PR.
AI assistance (Claude) was used for this change.
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing) With K and V packed into the content dim, attention KV caches are always blocks-first (num_blocks is dim 0), so get_kv_cache_block_dim returns 0 and _update_hybrid_attention_layout short-circuits for every group -- it never re-strides anything. Drop the now-dead MRV2 helper, its call site, and the unused has_attn/has_mamba bookkeeping. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
df58401 to
9efca99
Compare
LucasWilkinson
added a commit
that referenced
this pull request
Jun 8, 2026
…APIs Final step of the KV-cache layout standardization ladder, stacked on top of bind_kv_cache (#44456). Introduces the standardized layout resolution (KVCacheLayout / resolve_kv_cache_layout) and reshape_kv_cache, removes get_kv_cache_shape / get_kv_cache_stride_order entirely, and removes the remaining cross-layer block machinery from the connector. Co-authored-by: Claude Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
eebcf35 to
7bf955f
Compare
Contributor
|
This pull request has merge conflicts that must be resolved before it can be |
c61f39f to
ca5cf8a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR #42374 (first part of RFC #42082) has been split into 4 PRs:
#44454 [1/N][KV-Cache Layout Refactor] Refactor DSV4 KV cache config
#44455 [2/N][KV-Cache Layout Refactor] Pack K/V into the content dim across attention backends
-> #44456 [3/N][KV-Cache Layout Refactor] Standardize Mamba cache; drop
get_transfer_cache_regions#44458 [4/N][KV-Cache Layout Refactor] Standardize KV cache layout
Summary
Move Mamba (and conv/ssm) reshaping into a new
bind_kv_cachehook; drop the now-unnecessaryTransferTopology.get_transfer_cache_regionsfrom the connector. This is an intermediate step toward #42374 that does not yet introduce the cross-layerLdim.Stacked on top of #44455 (pack K/V into the content dim).
Testing
Mamba
bind_kv_cacheunpacking and the single-tensor NIXL/mooncake Mamba registration require GPU + multi-node P/D setups to exercise (hybrid attention+Mamba models hit both bind paths). The submitter will validate these on appropriate hardware before merge.AI assistance
This PR was prepared with AI assistance (Claude).