Refactor paged attention dispatch to support multiple attention types#201
Merged
LxYuan0420 merged 7 commits intovllm-project:mainfrom Mar 24, 2026
Merged
Conversation
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
laudney
pushed a commit
to mmonad/vllm-metal
that referenced
this pull request
Mar 23, 2026
Hybrid models like Qwen3.5 use mixed cache types (ArraysCache for
linear/SSM layers + KVCache for attention layers). BatchKVCache.offset
returns mx.array but hybrid attention code uses cache.offset as a
Python int for mask slicing, causing:
ValueError: Slice indices must be integers or None.
Detect hybrid caches at model load time via make_prompt_cache() and
fall back to sequential decode for incompatible models.
Core detection logic lives in cache_utils.py to keep model_runner.py
minimal per vllm-project#122.
NOTE: This is an interim fix for the mlx-native (non-paged) path.
The proper solution is per-layer attention dispatching (vllm-project#201) plus a
paged linear attention kernel (roadmap vllm-project#148).
Signed-off-by: Bren Mada Bowen <bowen.bren@gmail.com>
3 tasks
Collaborator
|
Minor unused import in paged_attention.py, no unit tests for dispatch logic (only slow integration test) |
Signed-off-by: ran <hzz5361@psu.edu>
Signed-off-by: ran <hzz5361@psu.edu>
Collaborator
Author
|
Thanks for the review! @ericcurtin Added 4 fast tests ( |
LxYuan0420
approved these changes
Mar 24, 2026
LxYuan0420
pushed a commit
that referenced
this pull request
Mar 30, 2026
…B) (#210) ## Summary Allocate per-layer-type cache buffers for hybrid models (Qwen3.5) where SDPA and GDN linear attention layers coexist. This is Stage B of the Qwen3.5 roadmap (#194), builds on the dispatch refactor (Stage A, #201). - Unwrap `text_config` in `_extract_model_args` so Qwen3.5 dimensions are accessible - Add `is_hybrid` detection and GDN dimensions to `_resolve_model_dims` - Emit `FullAttentionSpec` for SDPA layers and `MambaSpec` for GDN layers in `get_kv_cache_spec` - Fix `get_cache_block_size_bytes` to count only SDPA layers - Add `LinearAttentionCache` with layout `[num_blocks, Hv, Dv, Dk]` per linear layer - Add `HybridPagedAttentionBackend` that allocates both `MetalPagedKVCache` (SDPA) and `LinearAttentionCache` (GDN) - Fail fast with `RuntimeError` when hybrid model enables paged attention (gated until Stage C) - Only SDPA layers patched; linear layers keep original mlx_lm forward Ref: #194 (Stage B: Hybrid cache allocation) ## Cache layout | Layer type | Cache class | Shape per layer | |---|---|---| | SDPA | `MetalPagedKVCache` | `[num_blocks, block_size, num_kv_heads, head_dim]` | | Linear (GDN) | `LinearAttentionCache` | `[num_blocks, Hv, Dv, Dk]` | Both caches use the same `num_blocks` from the scheduler's memory budget. `get_kv_cache_spec` emits `MambaSpec` for GDN layers so the scheduler groups them separately. This PR delivers allocation infrastructure to unblock Stage C kernel work. --------- Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
attention_sdpa.py,attention_linear.py)self_attnon all layers) to per-layer lookup — required for hybrid models like Qwen3.5 where some layers useself_attnand others uselinear_attnNo new features. This is a refactor to unblock collaboration on Qwen3.5 linear attention support.
Why attribute-based detection?
We wrap mlx_lm/mlx_vlm attention modules at runtime without modifying their source. Since we don't own the model code, we detect attention type by probing module attributes (e.g.
q_proj+o_proj→ SDPA,conv1d+ noq_proj→ linear). This should works across all known GatedDeltaNet variants (qwen3.5)How to add a new attention type
attention_<type>.pywithis_<type>()detector +<type>_forward()implementationelifbranch in the wrapper dispatch (paged_attention.py)Test plan
NotImplementedErroron linear attention layers