Skip to content

Fix batched decode crash for hybrid cache models (Qwen3.5)#121

Closed
laudney wants to merge 1 commit intovllm-project:mainfrom
mmonad:fix/hybrid-model-batched-decode
Closed

Fix batched decode crash for hybrid cache models (Qwen3.5)#121
laudney wants to merge 1 commit intovllm-project:mainfrom
mmonad:fix/hybrid-model-batched-decode

Conversation

@laudney
Copy link
Copy Markdown

@laudney laudney commented Feb 27, 2026

Summary

Interim fix for the mlx-native (non-paged) path so hybrid models like Qwen3.5 can serve today.

  • Hybrid models (e.g. Qwen3.5-35B-A3B) use mixed cache types (ArraysCache for linear/SSM layers + KVCache for attention layers). BatchKVCache.offset returns mx.array but hybrid attention code uses cache.offset as a Python int for mask slicing, causing ValueError: Slice indices must be integers or None.
  • Detects hybrid caches at model load time via make_prompt_cache() and falls back to sequential decode for incompatible models.
  • Core detection logic in vllm_metal/v1/cache_utils.py (standalone pure function), keeping model_runner.py changes minimal per [Refactor] Refactor model runner to keep it minimal and easy to read #122.

NOTE: This is a band-aid until per-layer attention dispatching (#201) and a paged linear attention kernel (roadmap #148) land, at which point hybrid models will be handled properly at the dispatch level.

Test plan

  • Verified fix resolves ValueError when serving Qwen3.5 via vllm-metal
  • Verified standard (non-hybrid) models still use batched decode path (Qwen3-0.6B: all KVCacheTrue)
  • Verified Qwen3.5-35B-A3B hybrid model falls back gracefully ({ArraysCache, KVCache}False)

@WindChimeRan
Copy link
Copy Markdown
Collaborator

Thanks for the PR @laudney

Is it possible to also add support to paged kv cache?

VLLM_METAL_USE_PAGED_ATTENTION=1 VLLM_METAL_MEMORY_FRACTION=0.7

@ericcurtin
Copy link
Copy Markdown
Collaborator

@laudney please sign your commit to pass the DCO build

@laudney laudney force-pushed the fix/hybrid-model-batched-decode branch from b70fa5d to 22b4ac0 Compare March 6, 2026 13:15
@ricky-chaoju
Copy link
Copy Markdown
Contributor

@laudney
Copy link
Copy Markdown
Author

laudney commented Mar 12, 2026

Hey @ericcurtin @ricky-chaoju — thanks for the heads up! Rebased onto latest main and added the DCO sign-off. Should be all green now. Let me know if anything else needs updating!

@ericcurtin
Copy link
Copy Markdown
Collaborator

@laudney sadly there's conflicts now

@laudney laudney force-pushed the fix/hybrid-model-batched-decode branch from dab8b47 to 291dab8 Compare March 21, 2026 18:15
@laudney
Copy link
Copy Markdown
Author

laudney commented Mar 21, 2026

@ericcurtin Merge conflict is resolved — rebased onto latest main and force-pushed. DCO check is passing.

To get this merged, I believe I still need:

  1. Code review approval from a maintainer
  2. Paged KV cache support@WindChimeRan asked about compatibility with VLLM_METAL_USE_PAGED_ATTENTION=1. I'll look into this as a follow-up unless it's a blocker for this PR.

Let me know if there's anything else needed!

# scalar cache.offset which is incompatible with BatchKVCache's
# per-element mx.array offset. Determined in load_model().
self._supports_batched_decode: bool = True

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unreachable code?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the init was placed after a return in the is_stt property. Now correctly in __init__.

@WindChimeRan
Copy link
Copy Markdown
Collaborator

WindChimeRan commented Mar 21, 2026

Thanks for the PR @laudney ! It's been a few weeks and main has changed significantly. Could you check if the original crash still reproduces on current main? If it does, I think the fix should be more surgical and ideally live outside model_runner.py in a utils file.

Clarification: Paged KV cache support is non-blocking.

Also noticed the two test plan items are still unchecked. How's that going?

Hybrid models like Qwen3.5 use mixed cache types (ArraysCache for
linear/SSM layers + KVCache for attention layers). BatchKVCache.offset
returns mx.array but hybrid attention code uses cache.offset as a
Python int for mask slicing, causing:

    ValueError: Slice indices must be integers or None.

Detect hybrid caches at model load time via make_prompt_cache() and
fall back to sequential decode for incompatible models.

Core detection logic lives in cache_utils.py to keep model_runner.py
minimal per vllm-project#122.

NOTE: This is an interim fix for the mlx-native (non-paged) path.
The proper solution is per-layer attention dispatching (vllm-project#201) plus a
paged linear attention kernel (roadmap vllm-project#148).

Signed-off-by: Bren Mada Bowen <bowen.bren@gmail.com>
@laudney laudney force-pushed the fix/hybrid-model-batched-decode branch from 291dab8 to 251f0b8 Compare March 23, 2026 10:00
@laudney
Copy link
Copy Markdown
Author

laudney commented Mar 23, 2026

@WindChimeRan Thanks for the thorough review. Addressed all feedback:

Crash still reproduces on current main

Confirmed with mlx-lm 0.31.1:

  • KVCache.offsetint (e.g. 4)
  • BatchKVCache.offsetmx.array (e.g. array([4, 4]))
  • Hybrid model attention code does mask[..., :cache.offset]ValueError: Slice indices must be integers or None.

Tested with mlx-community/Qwen3.5-35B-A3B-4bit: cache is {ArraysCache, KVCache} → detection correctly returns False.

Refactored: detection logic moved out of model_runner.py

Per your suggestion and #122's direction, the core logic now lives in vllm_metal/v1/cache_utils.py as a standalone pure function. model_runner.py only has a thin 7-line wrapper. Net change to model_runner.py is minimal (+22/-1).

Fixed unreachable code

Good catch on line 673 — the _supports_batched_decode init was after a return in the is_stt property. Now correctly in __init__.

Scoping: this is an interim fix

I understand this PR sits within a broader effort:

This PR is specifically an interim workaround for the mlx-native (non-paged) path so Qwen3.5 can serve today via sequential decode. Once #201 lands and a paged linear attention kernel is available, hybrid models will be handled properly at the dispatch level and this fallback becomes unnecessary.

Test plan items

  • Verified fix resolves ValueError when serving Qwen3.5 via vllm-metal
  • Verified standard (non-hybrid) models still use batched decode path — tested Qwen3-0.6B (all KVCachesupports_batched_decode = True)
  • Verified Qwen3.5-35B-A3B hybrid model falls back gracefully ({ArraysCache, KVCache}supports_batched_decode = False)

@laudney
Copy link
Copy Markdown
Author

laudney commented Mar 23, 2026

@WindChimeRan Re: paged KV cache support — this PR intentionally targets only the mlx-native (non-paged) path as an interim fix.

Proper paged attention support for hybrid models like Qwen3.5 requires the per-layer attention dispatching you're building in #201 (routing SDPA layers vs GatedDeltaNet linear attention layers separately) plus a paged linear attention kernel (roadmap #148). That's the right long-term approach and this PR doesn't try to duplicate that effort.

Happy to help with #201 or the linear attention kernel if useful.

@WindChimeRan
Copy link
Copy Markdown
Collaborator

WindChimeRan commented Mar 23, 2026

I couldn't reproduce the crash Qwen3.5-35B-A3B from the main branch. (M2 max, 64GB)

Could you please take a look? @LxYuan0420. It seems like the problem has already been fixed by #110 , but I'm not very sure.

@LxYuan0420
Copy link
Copy Markdown
Collaborator

Could you provide exact repro commands on current main (including model, mlx-lm version)?

Also, the current change looks like a temp workaround rather than a proper fix. Ideally we want the fix to live in the attention dispatch layer for hybrid models, rather than globally gating batching?

@ericcurtin
Copy link
Copy Markdown
Collaborator

Closing this, conflicts and inactivity. Feel free to open a new PR again in future

@ericcurtin ericcurtin closed this Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants