Fix batched decode crash for hybrid cache models (Qwen3.5) by laudney · Pull Request #121 · vllm-project/vllm-metal

laudney · 2026-02-27T10:50:38Z

Summary

Interim fix for the mlx-native (non-paged) path so hybrid models like Qwen3.5 can serve today.

Hybrid models (e.g. Qwen3.5-35B-A3B) use mixed cache types (ArraysCache for linear/SSM layers + KVCache for attention layers). BatchKVCache.offset returns mx.array but hybrid attention code uses cache.offset as a Python int for mask slicing, causing ValueError: Slice indices must be integers or None.
Detects hybrid caches at model load time via make_prompt_cache() and falls back to sequential decode for incompatible models.
Core detection logic in vllm_metal/v1/cache_utils.py (standalone pure function), keeping model_runner.py changes minimal per [Refactor] Refactor model runner to keep it minimal and easy to read #122.

NOTE: This is a band-aid until per-layer attention dispatching (#201) and a paged linear attention kernel (roadmap #148) land, at which point hybrid models will be handled properly at the dispatch level.

Test plan

Verified fix resolves ValueError when serving Qwen3.5 via vllm-metal
Verified standard (non-hybrid) models still use batched decode path (Qwen3-0.6B: all KVCache → True)
Verified Qwen3.5-35B-A3B hybrid model falls back gracefully ({ArraysCache, KVCache} → False)

WindChimeRan · 2026-03-01T18:58:46Z

Thanks for the PR @laudney

Is it possible to also add support to paged kv cache?

VLLM_METAL_USE_PAGED_ATTENTION=1 VLLM_METAL_MEMORY_FRACTION=0.7

ericcurtin · 2026-03-06T12:38:01Z

@laudney please sign your commit to pass the DCO build

ricky-chaoju · 2026-03-09T01:54:17Z

Hi @laudney, please complete the DCO requirement:
https://github.com/vllm-project/vllm-metal/pull/121/checks?check_run_id=66030669295
Thanks!

laudney · 2026-03-12T18:22:39Z

Hey @ericcurtin @ricky-chaoju — thanks for the heads up! Rebased onto latest main and added the DCO sign-off. Should be all green now. Let me know if anything else needs updating!

ericcurtin · 2026-03-21T14:15:26Z

@laudney sadly there's conflicts now

laudney · 2026-03-21T18:15:48Z

@ericcurtin Merge conflict is resolved — rebased onto latest main and force-pushed. DCO check is passing.

To get this merged, I believe I still need:

Code review approval from a maintainer
Paged KV cache support — @WindChimeRan asked about compatibility with VLLM_METAL_USE_PAGED_ATTENTION=1. I'll look into this as a follow-up unless it's a blocker for this PR.

Let me know if there's anything else needed!

WindChimeRan · 2026-03-21T22:29:56Z

+        # scalar cache.offset which is incompatible with BatchKVCache's
+        # per-element mx.array offset.  Determined in load_model().
+        self._supports_batched_decode: bool = True
+


unreachable code?

Fixed — the init was placed after a return in the is_stt property. Now correctly in __init__.

WindChimeRan · 2026-03-21T22:47:33Z

Thanks for the PR @laudney ! It's been a few weeks and main has changed significantly. Could you check if the original crash still reproduces on current main? If it does, I think the fix should be more surgical and ideally live outside model_runner.py in a utils file.

Clarification: Paged KV cache support is non-blocking.

Also noticed the two test plan items are still unchecked. How's that going?

Hybrid models like Qwen3.5 use mixed cache types (ArraysCache for linear/SSM layers + KVCache for attention layers). BatchKVCache.offset returns mx.array but hybrid attention code uses cache.offset as a Python int for mask slicing, causing: ValueError: Slice indices must be integers or None. Detect hybrid caches at model load time via make_prompt_cache() and fall back to sequential decode for incompatible models. Core detection logic lives in cache_utils.py to keep model_runner.py minimal per vllm-project#122. NOTE: This is an interim fix for the mlx-native (non-paged) path. The proper solution is per-layer attention dispatching (vllm-project#201) plus a paged linear attention kernel (roadmap vllm-project#148). Signed-off-by: Bren Mada Bowen <bowen.bren@gmail.com>

laudney · 2026-03-23T10:00:23Z

@WindChimeRan Thanks for the thorough review. Addressed all feedback:

Crash still reproduces on current main

Confirmed with mlx-lm 0.31.1:

KVCache.offset → int (e.g. 4)
BatchKVCache.offset → mx.array (e.g. array([4, 4]))
Hybrid model attention code does mask[..., :cache.offset] → ValueError: Slice indices must be integers or None.

Tested with mlx-community/Qwen3.5-35B-A3B-4bit: cache is {ArraysCache, KVCache} → detection correctly returns False.

Refactored: detection logic moved out of model_runner.py

Per your suggestion and #122's direction, the core logic now lives in vllm_metal/v1/cache_utils.py as a standalone pure function. model_runner.py only has a thin 7-line wrapper. Net change to model_runner.py is minimal (+22/-1).

Fixed unreachable code

Good catch on line 673 — the _supports_batched_decode init was after a return in the is_stt property. Now correctly in __init__.

Scoping: this is an interim fix

I understand this PR sits within a broader effort:

Refactor paged attention dispatch to support multiple attention types #201 — per-layer attention dispatching (SDPA vs linear)
[Kernel] Fused GDN linear attention kernel for Qwen3.5 #186 — fused GDN linear attention kernel
[RoadMap] [Paged KV] Continuous Batching + Chunked Prefilling + paged varlen flash att #148 roadmap — paged linear attention kernel for Qwen3.5

This PR is specifically an interim workaround for the mlx-native (non-paged) path so Qwen3.5 can serve today via sequential decode. Once #201 lands and a paged linear attention kernel is available, hybrid models will be handled properly at the dispatch level and this fallback becomes unnecessary.

Test plan items

Verified fix resolves ValueError when serving Qwen3.5 via vllm-metal
Verified standard (non-hybrid) models still use batched decode path — tested Qwen3-0.6B (all KVCache → supports_batched_decode = True)
Verified Qwen3.5-35B-A3B hybrid model falls back gracefully ({ArraysCache, KVCache} → supports_batched_decode = False)

laudney · 2026-03-23T11:18:32Z

@WindChimeRan Re: paged KV cache support — this PR intentionally targets only the mlx-native (non-paged) path as an interim fix.

Proper paged attention support for hybrid models like Qwen3.5 requires the per-layer attention dispatching you're building in #201 (routing SDPA layers vs GatedDeltaNet linear attention layers separately) plus a paged linear attention kernel (roadmap #148). That's the right long-term approach and this PR doesn't try to duplicate that effort.

Happy to help with #201 or the linear attention kernel if useful.

WindChimeRan · 2026-03-23T22:36:24Z

I couldn't reproduce the crash Qwen3.5-35B-A3B from the main branch. (M2 max, 64GB)

Could you please take a look? @LxYuan0420. It seems like the problem has already been fixed by #110 , but I'm not very sure.

LxYuan0420 · 2026-03-24T03:49:39Z

Could you provide exact repro commands on current main (including model, mlx-lm version)?

Also, the current change looks like a temp workaround rather than a proper fix. Ideally we want the fix to live in the attention dispatch layer for hybrid models, rather than globally gating batching?

ericcurtin · 2026-04-05T20:39:45Z

Closing this, conflicts and inactivity. Feel free to open a new PR again in future

laudney force-pushed the fix/hybrid-model-batched-decode branch from b70fa5d to 22b4ac0 Compare March 6, 2026 13:15

ricky-chaoju mentioned this pull request Mar 11, 2026

Add Qwen3.5 model support via opt-in dependency extra #154

Closed

laudney force-pushed the fix/hybrid-model-batched-decode branch from 22b4ac0 to dab8b47 Compare March 12, 2026 18:22

laudney force-pushed the fix/hybrid-model-batched-decode branch from dab8b47 to 291dab8 Compare March 21, 2026 18:15

WindChimeRan reviewed Mar 21, 2026

View reviewed changes

WindChimeRan mentioned this pull request Mar 22, 2026

[Qwen3.5] Tracking issue & roadmap on Qwen3.5 collaboration #194

Closed

laudney force-pushed the fix/hybrid-model-batched-decode branch from 291dab8 to 251f0b8 Compare March 23, 2026 10:00

ericcurtin closed this Apr 5, 2026

Conversation

laudney commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

WindChimeRan commented Mar 1, 2026

Uh oh!

ericcurtin commented Mar 6, 2026

Uh oh!

ricky-chaoju commented Mar 9, 2026

Uh oh!

laudney commented Mar 12, 2026

Uh oh!

ericcurtin commented Mar 21, 2026

Uh oh!

laudney commented Mar 21, 2026

Uh oh!

WindChimeRan Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

laudney Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

WindChimeRan commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laudney commented Mar 23, 2026

Crash still reproduces on current main

Refactored: detection logic moved out of model_runner.py

Fixed unreachable code

Scoping: this is an interim fix

Test plan items

Uh oh!

laudney commented Mar 23, 2026

Uh oh!

WindChimeRan commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LxYuan0420 commented Mar 24, 2026

Uh oh!

ericcurtin commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

laudney commented Feb 27, 2026 •

edited

Loading

WindChimeRan commented Mar 21, 2026 •

edited

Loading

WindChimeRan commented Mar 23, 2026 •

edited

Loading