[Hybrid] Map multiple FullAttn layers to a single page#35703
[Hybrid] Map multiple FullAttn layers to a single page#35703peakcrosser7 wants to merge 40 commits intovllm-project:mainfrom
Conversation
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to group multiple FullAttention layers to share a single KV cache page, which is particularly useful for hybrid Mamba models. The changes are extensive, touching configuration, argument parsing, KV cache utilities, and the GPU model runner. While the overall direction is sound, I've identified a critical bug in the GPU model runner related to incorrect tensor stride calculations that could lead to memory corruption. I've also noted a minor logging inaccuracy in the KV cache utility.
| kv_cache.as_strided_( | ||
| size=kv_cache.shape, | ||
| stride=(hidden_size, 2 * hidden_size, *kv_cache.stride()[2:]), | ||
| stride=( | ||
| hidden_size, | ||
| 2 * hidden_size * attn_group_size, | ||
| *kv_cache.stride()[2:], | ||
| ), | ||
| ) |
There was a problem hiding this comment.
The as_strided_ call appears to be incorrect. The kv_cache tensor has a shape of (2, num_blocks, ...) for some attention backends, but the stride tuple (hidden_size, 2 * hidden_size * attn_group_size, ...) seems to be calculated for a tensor with shape (num_blocks, 2, ...). This mismatch between the tensor's shape and the provided strides can lead to incorrect memory access, data corruption in the KV cache, and ultimately incorrect model outputs. This is a critical issue that needs to be addressed.
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
|
@Etelis are there any places that can work automatically if we enable this mode but keep page_size_bytes as it's previous definition? @peakcrosser7 already updated gpu_model_runner, and I think KVConnectors should aware of this mode when trying to support this model. |
|
@peakcrosser7 Can you verify if my understanding below is correct?
|
| ) | ||
|
|
||
| try: | ||
| kv_cache_stride_order = backend.get_kv_cache_stride_order() |
There was a problem hiding this comment.
@heheda12345 @peakcrosser7 I think it's better to use backend.get_kv_cache_stride_order(include_num_layers_dimension=True) to pack the layers more efficiently KV connectors.
Specifically, this will allow a (num_heads, num_layers, ...) page layout instead of the current proposed (num_layers, ...) page layout.
There was a problem hiding this comment.
Thanks for the suggestion! I’ll look into the include_num_layers_dimension parameter and see how to adjust the code accordingly.
There was a problem hiding this comment.
vllm/vllm/v1/worker/kv_connector_model_runner_mixin.py
Lines 260 to 291 in ca1954d
@orozery Is the logic similar to this? We could add a num_layers dimension to the KV-Cache shape, where num_layers equals pack_size for packed Attention layers. When assigning to a specific layer, we would then index into the corresponding slice based on its attn_pack_idx. Is my understanding correct?
# shape [num_layers=pack_size, 2, num_blocks, block_size, num_heads, head_size]
tensor = torch.as_strided(
raw_tensor.view(dtype),
size=kv_cache_shape,
stride=kv_cache_stride,
storage_offset=storage_offset,
).permute(*inv_order)
# shape [2, num_blocks, block_size, num_heads, head_size]
kv_caches[layer_name] = tensor[attn_pack_idx]
@heheda12345 Yes — almost everything works automatically if
Everything else — and particularly all KV connector/offloading code ( Keeping |
Hi @orozery , thanks for the review! I believe there might be a slight misunderstanding regarding the implementation. Let me clarify the logic with two key points:
I hope this clears up your confusion! Let me know if you have further questions. |
|
Hi @Etelis , thanks for your reply! |
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Thanks! This is great :) |
|
This pull request has merge conflicts that must be resolved before it can be |
@orozery Yes, you're right! |
|
We tested Qwen3.5-35B-A3B, TP2 and found that when mamba-num-attn-pages=16, sending duplicate requests resulted in normal cache hit rate. However, when mamba-num-attn-pages=8, sending a duplicate request resulted in a prefix cache hit rate of only 1%. What could be the reason for this? |
|
Hi @xhdidi , thanks for your feedback! |
|
Hi @orozery , I’ve looked into using Current Observations: It appears that the KV-Cache layout returned when Taking class FlashAttentionBackend(AttentionBackend):
@staticmethod
def get_kv_cache_shape(
num_blocks: int,
block_size: int,
num_kv_heads: int,
head_size: int,
cache_dtype_str: str = "auto",
) -> tuple[int, ...]:
if block_size % 16 != 0:
raise ValueError("Block size must be a multiple of 16.")
return (2, num_blocks, block_size, num_kv_heads, head_size)
@staticmethod
def get_kv_cache_stride_order(
include_num_layers_dimension: bool = False,
) -> tuple[int, ...]:
# `stride_order` indicates the permutation that gets
# us from `get_kv_cache_shape` to the actual memory layout we want.
cache_layout = get_kv_cache_layout()
if cache_layout == "NHD" and include_num_layers_dimension:
# (num_blocks, num_layers, 2, block_size, num_kv_heads, head_size)
return (2, 0, 1, 3, 4, 5)
elif cache_layout == "NHD":
# (2, num_blocks, block_size, num_kv_heads, head_size)
stride_order = (0, 1, 2, 3, 4)
elif cache_layout == "HND" and include_num_layers_dimension:
# (num_blocks, num_kv_heads, num_layers, 2, block_size, head_size)
return (2, 4, 0, 1, 3, 5)
elif cache_layout == "HND":
# (2, num_blocks, num_kv_heads, block_size, head_size)
stride_order = (0, 1, 3, 2, 4)
else:
raise ValueError(f"Unknown cache layout format {cache_layout}.")
return stride_order
Proposed Alternative: # (2, num_blocks, block_size, num_kv_heads, head_size)
kv_cache_shape = attn_backend.get_kv_cache_shape(
kernel_num_blocks,
kernel_block_size,
kv_cache_spec.num_kv_heads,
kv_cache_spec.head_size,
cache_dtype_str=self.cache_config.cache_dtype,
)
try:
kv_cache_stride_order = attn_backend.get_kv_cache_stride_order()
assert len(kv_cache_stride_order) == len(kv_cache_shape)
except (AttributeError, NotImplementedError):
kv_cache_stride_order = tuple(range(len(kv_cache_shape)))
# (2, num_blocks, num_kv_heads, block_size, head_size)
kv_cache_shape = tuple(
kv_cache_shape[i] for i in kv_cache_stride_order
)
# Add num_layers dimension (default=1)
# (num_layers=1, 2, num_blocks, num_kv_heads, block_size, head_size)
kv_cache_shape = (1,) + kv_cache_shape
kv_cache_stride = tuple(torch.empty(kv_cache_shape).stride())
num_layers_dim = 0
# Maintain original KV shape view.
inv_order = [
kv_cache_stride_order.index(i)
for i in range(len(kv_cache_stride_order))
]
if has_mamba:
# Set num_layers=attn_pack_size and adjust strides
# Return new strides and current Attention layer index
# (num_layers=attn_pack_size, 2, num_blocks, num_kv_heads, block_size, head_size)
kv_cache_stride, layer_index = (
mamba_utils.get_hybrid_attention_mamba_layout(...)
)
kv_caches[layer_name] = torch.as_strided(
raw_tensor.view(dtype),
size=kv_cache_shape,
stride=kv_cache_stride,
)[layer_index].permute(*inv_order)What are your thoughts on this issue and the proposed design? |
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
This was actually the reason I wanted to use |
@orozery Thanks for the response! I agree with your point. While enabling @heheda12345, what's your take? Do you think it's necessary to add a |
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
As Mamba models grow larger, their Mamba states are becoming increasingly large, which in turn drives up the
block_size. Since input tokens are hashed at the granularity ofblock_size, an excessively largeblock_sizeis detrimental to prefix-caching hit rates.Take Qwen3.5-397B-A17B-FP8 as an example: when deployed with TP2, the
block_sizecan reach 2096, meaning only one hash is generated per 2096 tokens. Both caching and cache-hit lookups operate at this 2096-token granularity, which is highly unfavorable for prefix cache hits.This PR reduces the effective
block_sizeby packing multiple FullAttn layers onto a single page.Under the previous logic, a page (or block) was allocated to either one Mamba layer (for storing states) or one FullAttn layer (for storing KV-Cache).
This PR introduces support for assigning a single page to
NFullAttn layers, where each FullAttn layer uses1/Nof the page's space, thereby reducing theblock_sizeto1/Nof its original size (consequently, Mamba states will occupy more blocks).Usage: Add the
--attn-pack-size Nargument to the engine startup parameters.Test Plan
test_kv_cache_utils.py:test_merge_attn_layers_into_pack,test_split_attn_layers_from_pack, andtest_get_kv_cache_configs_with_mamba.tests/v1/worker/test_hybrid_kv_cache_layout.pyto ensure the Attention KV-Cache layout remains consistent when replacing_update_hybrid_attention_mamba_layout()with_get_hybrid_attention_mamba_layout().test_hybrid_attention_mamba_kv_cache_pack_sizeintest_gpu_model_runner.pyto verify the end-to-end logic ofget_kv_cache_configs()andinitialize_kv_cache()with the new Attention layer packing enabled.--attn-pack-size 4argument totests/evals/gsm8k/configs/Qwen3-Next-FP8-EP2.yamlfor local accuracy validation (the--moe-backend=flashinfer_trtllmoption was commented out as I am using H20 GPUs).Test Result
Full Test Logs:
eval_pack4.log
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.