[Bugfix] Fix V2 model runner crash on hybrid attention models (Qwen3.5)#38081
[Bugfix] Fix V2 model runner crash on hybrid attention models (Qwen3.5)#38081Lidang-Jiang wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for Mamba-like models by adding a new MambaSpec type for KV cache handling. The _reshape_kv_cache function has been updated to differentiate between AttentionSpec and MambaSpec for KV cache processing. A review comment suggests an optimization for calculating tensor strides within the MambaSpec handling to avoid unnecessary temporary tensor allocations and improve performance.
| dtype_size = get_dtype_size(dtype) | ||
| num_element_per_page = kv_cache_spec.page_size_bytes // dtype_size | ||
| target_shape = (num_blocks, *shape) | ||
| stride = torch.empty(target_shape).stride() |
There was a problem hiding this comment.
The use of torch.empty(target_shape).stride() creates a temporary tensor allocation on the device just to retrieve its stride. While this might be optimized by PyTorch for small tensors, it's generally more efficient to calculate the stride directly for a C-contiguous tensor, especially in performance-critical loops within a library like VLLM. This avoids unnecessary memory allocations and potential overhead.
# Calculate stride for a C-contiguous tensor
current_stride = 1
strides = [1] * len(target_shape)
for i in range(len(target_shape) - 1, -1, -1):
strides[i] = current_stride
current_stride *= target_shape[i]
stride = tuple(strides)There was a problem hiding this comment.
Thanks for the suggestion! This pattern (torch.empty(target_shape).stride()) is intentionally kept as-is because it's directly ported from V1 model runner's _reshape_kv_cache_tensors() at gpu_model_runner.py:6629 — I wanted to maintain consistency between V1 and V2 implementations.
If we want to optimize this to avoid the temporary allocation, it would be better to update both V1 and V2 together in a follow-up PR. Happy to do that if a maintainer thinks it's worth the change.
|
@WoosukKwon Hi, could you please add the This PR fixes V2 model runner crash on hybrid attention models (Qwen3.5). The root cause is that The fix ports Thanks! |
|
Review from Codex:
|
|
I also got this error: |
c05081c to
cb7dcf1
Compare
|
@WoosukKwon Thanks for the thorough review! I've addressed all 3 issues in the latest commit: 1. Incomplete port — missing hybrid layout adjustment 2. Missing test
3. FlashInfer Fix: Wired in
The |
|
Hi @Lidang-Jiang, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Fixed. The mypy failure was a type annotation mismatch:
Will push the fix shortly. |
3ac1139 to
3df9935
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
3df9935 to
420630c
Compare
The V2 model runner's `_reshape_kv_cache()` only handled `AttentionSpec` but Qwen3.5's linear attention (Gated DeltaNet) layers produce `MambaSpec`, causing an `AssertionError` at startup. Port MambaSpec handling from V1 model runner's `_reshape_kv_cache_tensors()` to V2's `_reshape_kv_cache()`, using the same `torch.as_strided` approach for state tensor reshaping. Fixes vllm-project#38041 Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
Add type: ignore[assignment] for MambaSpec branch where list[torch.Tensor] is assigned to dict[str, torch.Tensor]. This matches V1 model runner (gpu_model_runner.py:6641). Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
…t, missing test, FlashInfer error) Address review feedback from @WoosukKwon: 1. Port _update_hybrid_attention_mamba_layout() from V1 for correct stride adjustment when attention and Mamba layers coexist 2. Wire in prepare_kernel_block_sizes() for virtual block splitting, fixing FlashInfer "numTokensPerPage must be power of 2" error 3. Restructure init_attn_backend() into 3 phases and return kernel_block_sizes for downstream use 4. Add regression test for hybrid AttentionSpec + MambaSpec KV cache initialization with virtual block splitting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
Change dict[str, AttentionBackend] to dict[str, type[AttentionBackend]] in init_kv_cache() and _reshape_kv_cache() to match the actual return type of init_attn_backend(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
420630c to
6d644c5
Compare
|
Are there any recent updates on this PR? |
…ntion models (Qwen3.5) - Add MambaSpec handling to _reshape_kv_cache in attn_utils.py to fix AssertionError - Add _update_hybrid_attention_mamba_layout for hybrid attention/Mamba models - Add virtual block splitting via kernel_block_sizes parameter - Update init_kv_cache to compute kernel_block_sizes via prepare_kernel_block_sizes - Pass attn_groups to init_kv_cache in model_runner.py - Add regression test test_v2_reshape_kv_cache_hybrid_attention_mamba Co-authored-by: GitHub Copilot Agent-Logs-Url: https://github.com/Nekofish-L/vllm/sessions/2ce02e3b-348c-472e-a23f-c53db4db2d96 Co-authored-by: Nekofish-L <29830327+Nekofish-L@users.noreply.github.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Closing this PR because it is now superseded by the merged upstream work in #35520 and #42766. Current I also checked a small ModelScope model ( Thanks for the reviews and the follow-up implementation work. |
Summary
VLLM_USE_V2_MODEL_RUNNER=1) crash on hybrid attention models like Qwen3.5_reshape_kv_cache()inattn_utils.pyonly handledAttentionSpec, but Qwen3.5's linear attention (Gated DeltaNet) layers produceMambaSpec, causingAssertionErrorat startupMambaSpechandling from V1 model runner's_reshape_kv_cache_tensors()to V2's_reshape_kv_cache(), using the sametorch.as_stridedapproach for state tensor reshapingFixes #38041
Before Fix (crash log)
V2 model runner crashes with AssertionError on Qwen3.5
After Fix (successful run)
V2 model runner successfully loads and serves Qwen3.5
Test plan
/v1/chat/completionsgh pr list --search)Notes
gpu_model_runner.py) already handles bothAttentionSpecandMambaSpeccorrectly. This PR aligns V2 model runner behavior with V1.vllm/v1/worker/gpu/attn_utils.py