[Core][WIP][1/N] Standardize kv layout by LucasWilkinson · Pull Request #42374 · vllm-project/vllm

LucasWilkinson · 2026-05-12T03:46:51Z

part 1 of: #42082

NOTE! This PR has been broken up into 4

#44454 [1/N][KV-Cache Layout Refactor] Refactor DSV4 KV cache config
#44455 [2/N][KV-Cache Layout Refactor] Pack K/V into the content dim across attention backends
#44456 [3/N][KV-Cache Layout Refactor] Standardize Mamba cache; drop get_transfer_cache_regions
#44458 [4/N][KV-Cache Layout Refactor] Standardize KV cache layout

mergify · 2026-05-12T03:47:31Z

Documentation preview: https://vllm--42374.org.readthedocs.build/en/42374/

mergify · 2026-05-12T03:47:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request standardizes the KV cache physical layout and allocation logic across all attention backends in the V1 engine, implementing RFC #42082. It introduces a unified KVCacheLayout descriptor and standardizes logical shapes to lead with the block dimension, typically packing key and value data into the final dimension. This consolidation simplifies the model runner and KV connector logic by removing backend-specific shape and stride overrides. Feedback highlights a critical regression in GPUModelRunner where a layout heuristic is ambiguous when num_blocks is 2, leading to potential crashes. Furthermore, performance regressions were identified in the FlashInfer and FlexAttention backends due to memory copies triggered by .contiguous() calls and non-contiguous reshape operations in the hot path.

gemini-code-assist · 2026-05-12T03:50:33Z

+                    assert kv_cache.shape[1] != 2, (
+                        f"Cannot determine layout for tensor of shape {kv_cache.shape}"
+                    )


The heuristic if kv_cache.shape[0] == 2 is ambiguous when num_blocks == 2. In this case, both the legacy (2, num_blocks, ...) and the standardized (num_blocks, 2, ...) layouts would have 2 as the first two dimensions. The current implementation uses an assertion that will cause a crash in this valid configuration. This is a regression from the previous implementation which used a mock num_blocks to determine the dimension index reliably. Consider using the KVCacheSpec or another reliable metadata source to distinguish the layout.

gemini-code-assist · 2026-05-12T03:50:33Z

+        kv_cache_tuple = (
+            kv_split[:, :, :, 0, :].contiguous(),
+            kv_split[:, :, :, 1, :].contiguous(),
+        )


The explicit calls to .contiguous() on KV cache slices in the forward pass will trigger memory copies every time the model runs. This is a significant performance regression for the FlashInfer backend, especially since this is in the hot path. While the TODO acknowledges this, standardizing the layout should ideally be compatible with the backend's requirements without needing copies. Consider updating the FlashInfer wrapper or the standardized allocation layout to avoid this overhead.

gemini-code-assist · 2026-05-12T03:50:33Z

+            key_cache = key_cache.reshape(-1, self.num_kv_heads, self.head_size)
+            value_cache = value_cache.reshape(-1, self.num_kv_heads, self.head_size)


Using reshape(-1, ...) on key_cache and value_cache will trigger a memory copy if the slices are not contiguous. This happens whenever the physical layout is not NHD (e.g., HND), as the N and H dimensions will be swapped in memory relative to the logical view. This introduces a performance regression in the forward pass for FlexAttention. Consider using a method that avoids copies or ensures the layout is compatible with flattening.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

…move-legacy Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

…' of https://github.com/neuralmagic/vllm into lwilkinson/kv-layout/core-standardize-and-remove-legacy

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

PagedAttention.split_kv_cache() produced views with dimensions [blocks, block_size, head_size//x, num_kv_heads, x] but the chunked_prefill_paged_decode kernel expects [blocks, num_kv_heads, head_size//x, block_size, x]. Dims 1 and 3 were swapped, causing the kernel to read head data at block offsets and vice versa — producing garbage output. Build cache views directly from the allocated [blocks, heads, block_size, 2*head_size] layout instead. Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- NIXL transfer: Mamba cache is now a single 4D tensor, not (conv, ssm) tuple. Remove old unpack and hybrid layout transpose. - chunked_prefill_paged_decode: restore block_size = value_cache.shape[3] (paged V cache is [B, H, D, N], not [B, H, N, D]). - ROCm attention: use PagedAttention.split_kv_cache to minimize diff with main. Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…core-standardize-and-remove-legacy # Conflicts: # vllm/v1/attention/backends/cpu_attn.py Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

bigPYJ1151 · 2026-06-03T14:31:48Z

Hi @LucasWilkinson I have standardized kv layout of the cpu_attn in #44393
Please use it, thanks :)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

depthfirst-app · 2026-06-03T14:58:40Z

            stride_v_cache_1=value_cache.stride(1),
-            stride_v_cache_2=value_cache.stride(2),
-            stride_v_cache_3=value_cache.stride(3),
+            stride_v_cache_2=value_cache.stride(3),


🟠 Severity: HIGH

The V-cache stride swap assigns the block-position stride to offs_d (element indexer) and the element stride to internal_offsets (position indexer). For the permuted [B, H, hs, N] view, this causes the Triton kernel to compute wrong memory offsets (e.g. offset 1283 vs correct 773), reading KV data from other requests' blocks.
Helpful? Add 👍 / 👎

💡 Fix Suggestion

Suggestion: Swap the V-cache stride assignments at lines 483-484. Currently stride_v_cache_2 is assigned value_cache.stride(3) (the position stride) and stride_v_cache_3 is assigned value_cache.stride(2) (the element stride), but in the kernel stride_v_cache_2 is used with offs_d (element indexer) and stride_v_cache_3 is used with internal_offsets (position indexer). The fix is to assign them in the correct order: stride_v_cache_2=value_cache.stride(2) and stride_v_cache_3=value_cache.stride(3).

⚠️ Experimental Feature: This code suggestion is automatically generated. Please review carefully.

Suggested change

stride_v_cache_2=value_cache.stride(3),

stride_v_cache_2=value_cache.stride(2),

stride_v_cache_3=value_cache.stride(3),

depthfirst-app · 2026-06-03T14:58:40Z

            " falling back to Triton implementation."
        )
-        real_block_size = value_cache.shape[3]
+        real_block_size = value_cache.shape[2]


🟡 Severity: MEDIUM

After PagedAttention.split_kv_cache refactor, value_cache is [B, H, hs, N]. shape[2] yields head_size (e.g. 128), not block_size (e.g. 16). This PHYSICAL_BLOCK_SIZE error makes the ROCm Triton decode kernel iterate far beyond allocated block boundaries, causing out-of-bounds GPU memory reads.
Helpful? Add 👍 / 👎

💡 Fix Suggestion

Suggestion: Change value_cache.shape[2] to value_cache.shape[3] to correctly extract block_size from the value cache tensor. After the split_kv_cache refactor, value_cache has shape [B, H, head_size, block_size], so shape[3] is the block size, not shape[2] (which is head_size). This is consistent with the correct usage at line 324.

⚠️ Experimental Feature: This code suggestion is automatically generated. Please review carefully.

Suggested change

real_block_size = value_cache.shape[2]

real_block_size = value_cache.shape[3]

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

depthfirst-app · 2026-06-03T15:32:55Z

            kv_cache.stride(0),
-            kv_cache.stride(1),
            kv_cache.stride(2),
+            kv_cache.stride(1),


🟡 Severity: MEDIUM

The _tq_full_dequant_kv kernel expects (stride_cache_block, stride_cache_pos, stride_cache_head). After transpose(1,2) gives [B,N,H,C], stride(2) is the small head-stride and stride(1) is the large position-stride — swapped from what the kernel needs, causing incorrect memory address computation and out-of-bounds reads.
Helpful? Add 👍 / 👎

💡 Fix Suggestion

Suggestion: After kv_cache.transpose(1, 2), the tensor shape is [B, N, H, C], so stride(1) is the position stride and stride(2) is the head stride. The current code passes them in the wrong order to the _tq_full_dequant_kv kernel (which expects stride_cache_block, stride_cache_pos, stride_cache_head). Swap kv_cache.stride(2) and kv_cache.stride(1) on lines 746-747 so that stride(1) (position) is passed as stride_cache_pos and stride(2) (head) is passed as stride_cache_head.

⚠️ Experimental Feature: This code suggestion is automatically generated. Please review carefully.

Suggested change

kv_cache.stride(0),

kv_cache.stride(1),

kv_cache.stride(2),

kv_cache.stride(1),

kv_cache.stride(0),

kv_cache.stride(1),

kv_cache.stride(2),

mergify · 2026-06-03T16:12:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify Bot added documentation Improvements or additions to documentation performance Performance-related issues nvidia rocm Related to AMD ROCm intel-gpu Related to Intel GPU labels May 12, 2026

github-project-automation Bot added this to NVIDIA and AMD May 12, 2026

mergify Bot added the cpu Related to CPU backends label May 12, 2026

github-project-automation Bot moved this to Todo in AMD May 12, 2026

mergify Bot added v1 kv-connector labels May 12, 2026

mergify Bot added the needs-rebase label May 12, 2026

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

LucasWilkinson marked this pull request as ready for review May 13, 2026 04:32

LucasWilkinson requested review from ApostaC, NickLucche, WoosukKwon, gshtras, heheda12345, mgoin, njhill, orozery, pavanimajety, tdoublep, tjtanaa and zyongye as code owners May 13, 2026 04:32

claude Bot reviewed May 13, 2026

View reviewed changes

LucasWilkinson requested a review from vadiklyutiy as a code owner May 13, 2026 04:32

MatthewBonanni and others added 15 commits June 2, 2026 10:51

Merge branch 'main' into lwilkinson/kv-layout/core-standardize-and-re…

3aeb6d6

…move-legacy Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Merge branch 'lwilkinson/kv-layout/core-standardize-and-remove-legacy…

31aae79

…' of https://github.com/neuralmagic/vllm into lwilkinson/kv-layout/core-standardize-and-remove-legacy

cleanup

e767c7b

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

2f01f38

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

0393c06

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

be85d37

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

5b40dc4

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

e021cd0

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

3d1363c

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

ed65016

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

9316862

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

8deb263

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Merge remote-tracking branch 'origin/main' into lwilkinson/kv-layout/…

ba18262

…core-standardize-and-remove-legacy # Conflicts: # vllm/v1/attention/backends/cpu_attn.py Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

5845525

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

depthfirst-app Bot reviewed Jun 3, 2026

View reviewed changes

cleanup

847b910

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

depthfirst-app Bot reviewed Jun 3, 2026

View reviewed changes

LucasWilkinson mentioned this pull request Jun 3, 2026

[Refactor] Extract _bucket_layers_by_page_size from DSV4 KV cache config neuralmagic/vllm#166

Closed

4 tasks

ZhanqiuHu mentioned this pull request Jun 4, 2026

Integrate New KV Cache Layout into NIXL worker ZhanqiuHu/vllm#10

Closed

sammshen mentioned this pull request Jun 6, 2026

[RFC]: Standardized KV Cache Layout LMCache/LMCache#3560

Open

ivanium mentioned this pull request Jun 10, 2026

[Attention] Re-enable cross-layer KV cache layout for MLA via stride-aware kernels #45111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Core][WIP][1/N] Standardize kv layout#42374

[Core][WIP][1/N] Standardize kv layout#42374
LucasWilkinson wants to merge 93 commits into
vllm-project:mainfrom
neuralmagic:lwilkinson/kv-layout/core-standardize-and-remove-legacy

LucasWilkinson commented May 12, 2026 •

edited

Loading

Uh oh!

mergify Bot commented May 12, 2026

Uh oh!

mergify Bot commented May 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

gemini-code-assist Bot May 12, 2026

Uh oh!

claude Bot left a comment

Uh oh!

bigPYJ1151 commented Jun 3, 2026

Uh oh!

depthfirst-app Bot Jun 3, 2026

Uh oh!

depthfirst-app Bot Jun 3, 2026

Uh oh!

depthfirst-app Bot Jun 3, 2026

Uh oh!

mergify Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		key_cache = key_cache.reshape(-1, self.num_kv_heads, self.head_size)
		value_cache = value_cache.reshape(-1, self.num_kv_heads, self.head_size)

	stride_v_cache_2=value_cache.stride(3),
	stride_v_cache_2=value_cache.stride(2),
	stride_v_cache_3=value_cache.stride(3),

	real_block_size = value_cache.shape[2]
	real_block_size = value_cache.shape[3]

Uh oh!

Conversation

LucasWilkinson commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented May 12, 2026

Uh oh!

mergify Bot commented May 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

bigPYJ1151 commented Jun 3, 2026

Uh oh!

depthfirst-app Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

depthfirst-app Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

depthfirst-app Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

LucasWilkinson commented May 12, 2026 •

edited

Loading