[Core][WIP][1/N] Standardize kv layout#42374
Conversation
|
Documentation preview: https://vllm--42374.org.readthedocs.build/en/42374/ |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request standardizes the KV cache physical layout and allocation logic across all attention backends in the V1 engine, implementing RFC #42082. It introduces a unified KVCacheLayout descriptor and standardizes logical shapes to lead with the block dimension, typically packing key and value data into the final dimension. This consolidation simplifies the model runner and KV connector logic by removing backend-specific shape and stride overrides. Feedback highlights a critical regression in GPUModelRunner where a layout heuristic is ambiguous when num_blocks is 2, leading to potential crashes. Furthermore, performance regressions were identified in the FlashInfer and FlexAttention backends due to memory copies triggered by .contiguous() calls and non-contiguous reshape operations in the hot path.
| assert kv_cache.shape[1] != 2, ( | ||
| f"Cannot determine layout for tensor of shape {kv_cache.shape}" | ||
| ) |
There was a problem hiding this comment.
The heuristic if kv_cache.shape[0] == 2 is ambiguous when num_blocks == 2. In this case, both the legacy (2, num_blocks, ...) and the standardized (num_blocks, 2, ...) layouts would have 2 as the first two dimensions. The current implementation uses an assertion that will cause a crash in this valid configuration. This is a regression from the previous implementation which used a mock num_blocks to determine the dimension index reliably. Consider using the KVCacheSpec or another reliable metadata source to distinguish the layout.
| kv_cache_tuple = ( | ||
| kv_split[:, :, :, 0, :].contiguous(), | ||
| kv_split[:, :, :, 1, :].contiguous(), | ||
| ) |
There was a problem hiding this comment.
The explicit calls to .contiguous() on KV cache slices in the forward pass will trigger memory copies every time the model runs. This is a significant performance regression for the FlashInfer backend, especially since this is in the hot path. While the TODO acknowledges this, standardizing the layout should ideally be compatible with the backend's requirements without needing copies. Consider updating the FlashInfer wrapper or the standardized allocation layout to avoid this overhead.
| key_cache = key_cache.reshape(-1, self.num_kv_heads, self.head_size) | ||
| value_cache = value_cache.reshape(-1, self.num_kv_heads, self.head_size) |
There was a problem hiding this comment.
Using reshape(-1, ...) on key_cache and value_cache will trigger a memory copy if the slices are not contiguous. This happens whenever the physical layout is not NHD (e.g., HND), as the N and H dimensions will be swapped in memory relative to the logical view. This introduces a performance regression in the forward pass for FlexAttention. Consider using a method that avoids copies or ensures the layout is compatible with flattening.
…move-legacy Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
…' of https://github.com/neuralmagic/vllm into lwilkinson/kv-layout/core-standardize-and-remove-legacy
PagedAttention.split_kv_cache() produced views with dimensions [blocks, block_size, head_size//x, num_kv_heads, x] but the chunked_prefill_paged_decode kernel expects [blocks, num_kv_heads, head_size//x, block_size, x]. Dims 1 and 3 were swapped, causing the kernel to read head data at block offsets and vice versa — producing garbage output. Build cache views directly from the allocated [blocks, heads, block_size, 2*head_size] layout instead. Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- NIXL transfer: Mamba cache is now a single 4D tensor, not (conv, ssm) tuple. Remove old unpack and hybrid layout transpose. - chunked_prefill_paged_decode: restore block_size = value_cache.shape[3] (paged V cache is [B, H, D, N], not [B, H, N, D]). - ROCm attention: use PagedAttention.split_kv_cache to minimize diff with main. Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…core-standardize-and-remove-legacy # Conflicts: # vllm/v1/attention/backends/cpu_attn.py Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
Hi @LucasWilkinson I have standardized kv layout of the |
| stride_v_cache_1=value_cache.stride(1), | ||
| stride_v_cache_2=value_cache.stride(2), | ||
| stride_v_cache_3=value_cache.stride(3), | ||
| stride_v_cache_2=value_cache.stride(3), |
There was a problem hiding this comment.
🟠 Severity: HIGH
The V-cache stride swap assigns the block-position stride to offs_d (element indexer) and the element stride to internal_offsets (position indexer). For the permuted [B, H, hs, N] view, this causes the Triton kernel to compute wrong memory offsets (e.g. offset 1283 vs correct 773), reading KV data from other requests' blocks.
Helpful? Add 👍 / 👎
💡 Fix Suggestion
Suggestion: Swap the V-cache stride assignments at lines 483-484. Currently stride_v_cache_2 is assigned value_cache.stride(3) (the position stride) and stride_v_cache_3 is assigned value_cache.stride(2) (the element stride), but in the kernel stride_v_cache_2 is used with offs_d (element indexer) and stride_v_cache_3 is used with internal_offsets (position indexer). The fix is to assign them in the correct order: stride_v_cache_2=value_cache.stride(2) and stride_v_cache_3=value_cache.stride(3).
⚠️ Experimental Feature: This code suggestion is automatically generated. Please review carefully.
| stride_v_cache_2=value_cache.stride(3), | |
| stride_v_cache_2=value_cache.stride(2), | |
| stride_v_cache_3=value_cache.stride(3), |
| " falling back to Triton implementation." | ||
| ) | ||
| real_block_size = value_cache.shape[3] | ||
| real_block_size = value_cache.shape[2] |
There was a problem hiding this comment.
🟡 Severity: MEDIUM
After PagedAttention.split_kv_cache refactor, value_cache is [B, H, hs, N]. shape[2] yields head_size (e.g. 128), not block_size (e.g. 16). This PHYSICAL_BLOCK_SIZE error makes the ROCm Triton decode kernel iterate far beyond allocated block boundaries, causing out-of-bounds GPU memory reads.
Helpful? Add 👍 / 👎
💡 Fix Suggestion
Suggestion: Change value_cache.shape[2] to value_cache.shape[3] to correctly extract block_size from the value cache tensor. After the split_kv_cache refactor, value_cache has shape [B, H, head_size, block_size], so shape[3] is the block size, not shape[2] (which is head_size). This is consistent with the correct usage at line 324.
⚠️ Experimental Feature: This code suggestion is automatically generated. Please review carefully.
| real_block_size = value_cache.shape[2] | |
| real_block_size = value_cache.shape[3] |
| kv_cache.stride(0), | ||
| kv_cache.stride(1), | ||
| kv_cache.stride(2), | ||
| kv_cache.stride(1), |
There was a problem hiding this comment.
🟡 Severity: MEDIUM
The _tq_full_dequant_kv kernel expects (stride_cache_block, stride_cache_pos, stride_cache_head). After transpose(1,2) gives [B,N,H,C], stride(2) is the small head-stride and stride(1) is the large position-stride — swapped from what the kernel needs, causing incorrect memory address computation and out-of-bounds reads.
Helpful? Add 👍 / 👎
💡 Fix Suggestion
Suggestion: After kv_cache.transpose(1, 2), the tensor shape is [B, N, H, C], so stride(1) is the position stride and stride(2) is the head stride. The current code passes them in the wrong order to the _tq_full_dequant_kv kernel (which expects stride_cache_block, stride_cache_pos, stride_cache_head). Swap kv_cache.stride(2) and kv_cache.stride(1) on lines 746-747 so that stride(1) (position) is passed as stride_cache_pos and stride(2) (head) is passed as stride_cache_head.
⚠️ Experimental Feature: This code suggestion is automatically generated. Please review carefully.
| kv_cache.stride(0), | |
| kv_cache.stride(1), | |
| kv_cache.stride(2), | |
| kv_cache.stride(1), | |
| kv_cache.stride(0), | |
| kv_cache.stride(1), | |
| kv_cache.stride(2), |
|
This pull request has merge conflicts that must be resolved before it can be |
part 1 of: #42082
NOTE! This PR has been broken up into 4
#44454 [1/N][KV-Cache Layout Refactor] Refactor DSV4 KV cache config
#44455 [2/N][KV-Cache Layout Refactor] Pack K/V into the content dim across attention backends
#44456 [3/N][KV-Cache Layout Refactor] Standardize Mamba cache; drop
get_transfer_cache_regions#44458 [4/N][KV-Cache Layout Refactor] Standardize KV cache layout