Skip to content

[Core][WIP][1/N] Standardize kv layout#42374

Open
LucasWilkinson wants to merge 93 commits into
vllm-project:mainfrom
neuralmagic:lwilkinson/kv-layout/core-standardize-and-remove-legacy
Open

[Core][WIP][1/N] Standardize kv layout#42374
LucasWilkinson wants to merge 93 commits into
vllm-project:mainfrom
neuralmagic:lwilkinson/kv-layout/core-standardize-and-remove-legacy

Conversation

@LucasWilkinson

@LucasWilkinson LucasWilkinson commented May 12, 2026

Copy link
Copy Markdown
Collaborator

part 1 of: #42082

NOTE! This PR has been broken up into 4

#44454 [1/N][KV-Cache Layout Refactor] Refactor DSV4 KV cache config
#44455 [2/N][KV-Cache Layout Refactor] Pack K/V into the content dim across attention backends
#44456 [3/N][KV-Cache Layout Refactor] Standardize Mamba cache; drop get_transfer_cache_regions
#44458 [4/N][KV-Cache Layout Refactor] Standardize KV cache layout

@mergify

mergify Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

Documentation preview: https://vllm--42374.org.readthedocs.build/en/42374/

@mergify mergify Bot added documentation Improvements or additions to documentation performance Performance-related issues nvidia rocm Related to AMD ROCm intel-gpu Related to Intel GPU labels May 12, 2026
@mergify mergify Bot added the cpu Related to CPU backends label May 12, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD May 12, 2026
@mergify

mergify Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 12, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request standardizes the KV cache physical layout and allocation logic across all attention backends in the V1 engine, implementing RFC #42082. It introduces a unified KVCacheLayout descriptor and standardizes logical shapes to lead with the block dimension, typically packing key and value data into the final dimension. This consolidation simplifies the model runner and KV connector logic by removing backend-specific shape and stride overrides. Feedback highlights a critical regression in GPUModelRunner where a layout heuristic is ambiguous when num_blocks is 2, leading to potential crashes. Furthermore, performance regressions were identified in the FlashInfer and FlexAttention backends due to memory copies triggered by .contiguous() calls and non-contiguous reshape operations in the hot path.

Comment thread vllm/v1/worker/gpu_model_runner.py Outdated
Comment on lines +6829 to +6831
assert kv_cache.shape[1] != 2, (
f"Cannot determine layout for tensor of shape {kv_cache.shape}"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The heuristic if kv_cache.shape[0] == 2 is ambiguous when num_blocks == 2. In this case, both the legacy (2, num_blocks, ...) and the standardized (num_blocks, 2, ...) layouts would have 2 as the first two dimensions. The current implementation uses an assertion that will cause a crash in this valid configuration. This is a regression from the previous implementation which used a mock num_blocks to determine the dimension index reliably. Consider using the KVCacheSpec or another reliable metadata source to distinguish the layout.

Comment on lines +1477 to +1480
kv_cache_tuple = (
kv_split[:, :, :, 0, :].contiguous(),
kv_split[:, :, :, 1, :].contiguous(),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The explicit calls to .contiguous() on KV cache slices in the forward pass will trigger memory copies every time the model runs. This is a significant performance regression for the FlashInfer backend, especially since this is in the hot path. While the TODO acknowledges this, standardizing the layout should ideally be compatible with the backend's requirements without needing copies. Consider updating the FlashInfer wrapper or the standardized allocation layout to avoid this overhead.

Comment on lines +1110 to +1111
key_cache = key_cache.reshape(-1, self.num_kv_heads, self.head_size)
value_cache = value_cache.reshape(-1, self.num_kv_heads, self.head_size)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using reshape(-1, ...) on key_cache and value_cache will trigger a memory copy if the slices are not contiguous. This happens whenever the physical layout is not NHD (e.g., HND), as the N and H dimensions will be swapped in memory relative to the logical view. This introduces a performance regression in the forward pass for FlexAttention. Consider using a method that avoids copies or ensures the layout is compatible with flattening.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

MatthewBonanni and others added 15 commits June 2, 2026 10:51
…move-legacy

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
PagedAttention.split_kv_cache() produced views with dimensions
[blocks, block_size, head_size//x, num_kv_heads, x] but the
chunked_prefill_paged_decode kernel expects
[blocks, num_kv_heads, head_size//x, block_size, x].
Dims 1 and 3 were swapped, causing the kernel to read head data
at block offsets and vice versa — producing garbage output.

Build cache views directly from the allocated
[blocks, heads, block_size, 2*head_size] layout instead.

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- NIXL transfer: Mamba cache is now a single 4D tensor, not (conv, ssm)
  tuple. Remove old unpack and hybrid layout transpose.
- chunked_prefill_paged_decode: restore block_size = value_cache.shape[3]
  (paged V cache is [B, H, D, N], not [B, H, N, D]).
- ROCm attention: use PagedAttention.split_kv_cache to minimize diff
  with main.

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…core-standardize-and-remove-legacy

# Conflicts:
#	vllm/v1/attention/backends/cpu_attn.py

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@bigPYJ1151

Copy link
Copy Markdown
Member

Hi @LucasWilkinson I have standardized kv layout of the cpu_attn in #44393
Please use it, thanks :)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
stride_v_cache_1=value_cache.stride(1),
stride_v_cache_2=value_cache.stride(2),
stride_v_cache_3=value_cache.stride(3),
stride_v_cache_2=value_cache.stride(3),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Severity: HIGH

The V-cache stride swap assigns the block-position stride to offs_d (element indexer) and the element stride to internal_offsets (position indexer). For the permuted [B, H, hs, N] view, this causes the Triton kernel to compute wrong memory offsets (e.g. offset 1283 vs correct 773), reading KV data from other requests' blocks.
Helpful? Add 👍 / 👎

💡 Fix Suggestion

Suggestion: Swap the V-cache stride assignments at lines 483-484. Currently stride_v_cache_2 is assigned value_cache.stride(3) (the position stride) and stride_v_cache_3 is assigned value_cache.stride(2) (the element stride), but in the kernel stride_v_cache_2 is used with offs_d (element indexer) and stride_v_cache_3 is used with internal_offsets (position indexer). The fix is to assign them in the correct order: stride_v_cache_2=value_cache.stride(2) and stride_v_cache_3=value_cache.stride(3).

⚠️ Experimental Feature: This code suggestion is automatically generated. Please review carefully.

Suggested change
stride_v_cache_2=value_cache.stride(3),
stride_v_cache_2=value_cache.stride(2),
stride_v_cache_3=value_cache.stride(3),

" falling back to Triton implementation."
)
real_block_size = value_cache.shape[3]
real_block_size = value_cache.shape[2]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Severity: MEDIUM

After PagedAttention.split_kv_cache refactor, value_cache is [B, H, hs, N]. shape[2] yields head_size (e.g. 128), not block_size (e.g. 16). This PHYSICAL_BLOCK_SIZE error makes the ROCm Triton decode kernel iterate far beyond allocated block boundaries, causing out-of-bounds GPU memory reads.
Helpful? Add 👍 / 👎

💡 Fix Suggestion

Suggestion: Change value_cache.shape[2] to value_cache.shape[3] to correctly extract block_size from the value cache tensor. After the split_kv_cache refactor, value_cache has shape [B, H, head_size, block_size], so shape[3] is the block size, not shape[2] (which is head_size). This is consistent with the correct usage at line 324.

⚠️ Experimental Feature: This code suggestion is automatically generated. Please review carefully.

Suggested change
real_block_size = value_cache.shape[2]
real_block_size = value_cache.shape[3]

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Comment on lines 745 to +747
kv_cache.stride(0),
kv_cache.stride(1),
kv_cache.stride(2),
kv_cache.stride(1),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Severity: MEDIUM

The _tq_full_dequant_kv kernel expects (stride_cache_block, stride_cache_pos, stride_cache_head). After transpose(1,2) gives [B,N,H,C], stride(2) is the small head-stride and stride(1) is the large position-stride — swapped from what the kernel needs, causing incorrect memory address computation and out-of-bounds reads.
Helpful? Add 👍 / 👎

💡 Fix Suggestion

Suggestion: After kv_cache.transpose(1, 2), the tensor shape is [B, N, H, C], so stride(1) is the position stride and stride(2) is the head stride. The current code passes them in the wrong order to the _tq_full_dequant_kv kernel (which expects stride_cache_block, stride_cache_pos, stride_cache_head). Swap kv_cache.stride(2) and kv_cache.stride(1) on lines 746-747 so that stride(1) (position) is passed as stride_cache_pos and stride(2) (head) is passed as stride_cache_head.

⚠️ Experimental Feature: This code suggestion is automatically generated. Please review carefully.

Suggested change
kv_cache.stride(0),
kv_cache.stride(1),
kv_cache.stride(2),
kv_cache.stride(1),
kv_cache.stride(0),
kv_cache.stride(1),
kv_cache.stride(2),

@mergify

mergify Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends deepseek Related to DeepSeek models documentation Improvements or additions to documentation intel-gpu Related to Intel GPU kv-connector needs-rebase nvidia performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Todo
Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants