Refactor paged attention dispatch to support multiple attention types by WindChimeRan · Pull Request #201 · vllm-project/vllm-metal

WindChimeRan · 2026-03-22T23:01:12Z

Summary

Extract attention-type-specific logic from the monolithic wrapper into per-type modules (attention_sdpa.py, attention_linear.py)
The wrapper now dispatches based on module attributes: paged varlen SDPA path for standard dot-product attention (MHA/GQA/MQA), stub for linear attention (GatedDeltaNet), and this will also unblock MLA for glm-4.7-flash and deepseek.
Change layer patching from single-attribute (self_attn on all layers) to per-layer lookup — required for hybrid models like Qwen3.5 where some layers use self_attn and others use linear_attn
Add xfail integration test for Qwen/Qwen3.5-0.8B with paged attention

No new features. This is a refactor to unblock collaboration on Qwen3.5 linear attention support.

Why attribute-based detection?

We wrap mlx_lm/mlx_vlm attention modules at runtime without modifying their source. Since we don't own the model code, we detect attention type by probing module attributes (e.g. q_proj + o_proj → SDPA, conv1d + no q_proj → linear). This should works across all known GatedDeltaNet variants (qwen3.5)

How to add a new attention type

Create attention_<type>.py with is_<type>() detector + <type>_forward() implementation
Add one elif branch in the wrapper dispatch (paged_attention.py)

Test plan

Qwen3.5-0.8B paged attention xfail hits NotImplementedError on linear attention layers
Passed the new deterministic test from Update deterministic test golden tokens for MLX 0.31 #202. The refactoring doesn't break the existing paged varlen kernel on Qwen3-0.6B

Signed-off-by: ran <hzz5361@psu.edu>

Hybrid models like Qwen3.5 use mixed cache types (ArraysCache for linear/SSM layers + KVCache for attention layers). BatchKVCache.offset returns mx.array but hybrid attention code uses cache.offset as a Python int for mask slicing, causing: ValueError: Slice indices must be integers or None. Detect hybrid caches at model load time via make_prompt_cache() and fall back to sequential decode for incompatible models. Core detection logic lives in cache_utils.py to keep model_runner.py minimal per vllm-project#122. NOTE: This is an interim fix for the mlx-native (non-paged) path. The proper solution is per-layer attention dispatching (vllm-project#201) plus a paged linear attention kernel (roadmap vllm-project#148). Signed-off-by: Bren Mada Bowen <bowen.bren@gmail.com>

ericcurtin · 2026-03-23T12:35:44Z

Minor unused import in paged_attention.py, no unit tests for dispatch logic (only slow integration test)

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan · 2026-03-23T20:27:50Z

Thanks for the review! @ericcurtin

Added 4 fast tests (test_attention_dispatch.py) that verify the detection heuristics against real mlx_lm modules. qwen3.Attention, qwen3_5.DecoderLayer (both SDPA and GatedDeltaNet layers), and qwen3.Model for find_layers. No model weights needed, runs in ~2s. The full dispatch path (wrapper → forward → Metal kernel) is covered by the existing integration test.

#214) This PR is: - To remove `find_layers_and_attr` (deprecated in #201, zero callers) - To delete `TestBatchSplitting` which tests a local reimplementation with no production counterpart (`_run_packed_prefill` was removed in 5bf9536) --------- Signed-off-by: Yuan Lik Xun <lxyuan0420@gmail.com>

…B) (#210) ## Summary Allocate per-layer-type cache buffers for hybrid models (Qwen3.5) where SDPA and GDN linear attention layers coexist. This is Stage B of the Qwen3.5 roadmap (#194), builds on the dispatch refactor (Stage A, #201). - Unwrap `text_config` in `_extract_model_args` so Qwen3.5 dimensions are accessible - Add `is_hybrid` detection and GDN dimensions to `_resolve_model_dims` - Emit `FullAttentionSpec` for SDPA layers and `MambaSpec` for GDN layers in `get_kv_cache_spec` - Fix `get_cache_block_size_bytes` to count only SDPA layers - Add `LinearAttentionCache` with layout `[num_blocks, Hv, Dv, Dk]` per linear layer - Add `HybridPagedAttentionBackend` that allocates both `MetalPagedKVCache` (SDPA) and `LinearAttentionCache` (GDN) - Fail fast with `RuntimeError` when hybrid model enables paged attention (gated until Stage C) - Only SDPA layers patched; linear layers keep original mlx_lm forward Ref: #194 (Stage B: Hybrid cache allocation) ## Cache layout | Layer type | Cache class | Shape per layer | |---|---|---| | SDPA | `MetalPagedKVCache` | `[num_blocks, block_size, num_kv_heads, head_dim]` | | Linear (GDN) | `LinearAttentionCache` | `[num_blocks, Hv, Dv, Dk]` | Both caches use the same `num_blocks` from the scheduler's memory budget. `get_kv_cache_spec` emits `MambaSpec` for GDN layers so the scheduler groups them separately. This PR delivers allocation infrastructure to unblock Stage C kernel work. --------- Signed-off-by: RickyChen / 陳昭儒 <rickychen@infinirc.com>

WindChimeRan added 4 commits March 22, 2026 17:55

refactor the attn backend dispatch

c4911e0

Signed-off-by: ran <hzz5361@psu.edu>

rename to attn_sdpa

45d6581

Signed-off-by: ran <hzz5361@psu.edu>

simplify xfail test

c834685

Signed-off-by: ran <hzz5361@psu.edu>

simplify heuristics attn backend

11fe4b8

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan changed the title ~~Attention backend dispatch~~ Refactor paged attention dispatch to support multiple attention types Mar 22, 2026

fix linter

dee0524

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan mentioned this pull request Mar 23, 2026

[Kernel] Fused GDN linear attention kernel for Qwen3.5 #186

Closed

WindChimeRan marked this pull request as ready for review March 23, 2026 00:50

WindChimeRan mentioned this pull request Mar 23, 2026

[Qwen3.5] Tracking issue & roadmap on Qwen3.5 collaboration #194

Closed

laudney mentioned this pull request Mar 23, 2026

Fix batched decode crash for hybrid cache models (Qwen3.5) #121

Closed

3 tasks

WindChimeRan added 2 commits March 23, 2026 15:16

add fast test on dispatch

f52c9bc

Signed-off-by: ran <hzz5361@psu.edu>

fix linter

b3ffb8a

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan marked this pull request as draft March 23, 2026 22:04

WindChimeRan mentioned this pull request Mar 23, 2026

Refactor to introduce PagedAttentionBackend protocol to decouple worker from Metal kernel #203

Merged

WindChimeRan marked this pull request as ready for review March 24, 2026 03:24

LxYuan0420 approved these changes Mar 24, 2026

View reviewed changes

LxYuan0420 merged commit 19a19c4 into vllm-project:main Mar 24, 2026
5 checks passed

ricky-chaoju mentioned this pull request Mar 24, 2026

[Qwen3.5] Hybrid cache allocation for SDPA + linear attention (Stage B) #210

Merged

LxYuan0420 mentioned this pull request Mar 26, 2026

Remove deprecated find_layers_and_attr and orphaned TestBatchSplitting #214

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor paged attention dispatch to support multiple attention types#201

Refactor paged attention dispatch to support multiple attention types#201
LxYuan0420 merged 7 commits intovllm-project:mainfrom
WindChimeRan:attention_backend_dispatch

WindChimeRan commented Mar 22, 2026 •

edited

Loading

Uh oh!

ericcurtin commented Mar 23, 2026

Uh oh!

WindChimeRan commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

WindChimeRan commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why attribute-based detection?

How to add a new attention type

Test plan

Uh oh!

ericcurtin commented Mar 23, 2026

Uh oh!

WindChimeRan commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WindChimeRan commented Mar 22, 2026 •

edited

Loading