Fix batched decode crash for hybrid cache models (Qwen3.5)#121
Fix batched decode crash for hybrid cache models (Qwen3.5)#121laudney wants to merge 1 commit intovllm-project:mainfrom
Conversation
|
Thanks for the PR @laudney Is it possible to also add support to paged kv cache? |
|
@laudney please sign your commit to pass the DCO build |
b70fa5d to
22b4ac0
Compare
|
Hi @laudney, please complete the DCO requirement: |
22b4ac0 to
dab8b47
Compare
|
Hey @ericcurtin @ricky-chaoju — thanks for the heads up! Rebased onto latest main and added the DCO sign-off. Should be all green now. Let me know if anything else needs updating! |
|
@laudney sadly there's conflicts now |
dab8b47 to
291dab8
Compare
|
@ericcurtin Merge conflict is resolved — rebased onto latest main and force-pushed. DCO check is passing. To get this merged, I believe I still need:
Let me know if there's anything else needed! |
| # scalar cache.offset which is incompatible with BatchKVCache's | ||
| # per-element mx.array offset. Determined in load_model(). | ||
| self._supports_batched_decode: bool = True | ||
|
|
There was a problem hiding this comment.
Fixed — the init was placed after a return in the is_stt property. Now correctly in __init__.
|
Thanks for the PR @laudney ! It's been a few weeks and Clarification: Paged KV cache support is non-blocking. Also noticed the two test plan items are still unchecked. How's that going? |
Hybrid models like Qwen3.5 use mixed cache types (ArraysCache for
linear/SSM layers + KVCache for attention layers). BatchKVCache.offset
returns mx.array but hybrid attention code uses cache.offset as a
Python int for mask slicing, causing:
ValueError: Slice indices must be integers or None.
Detect hybrid caches at model load time via make_prompt_cache() and
fall back to sequential decode for incompatible models.
Core detection logic lives in cache_utils.py to keep model_runner.py
minimal per vllm-project#122.
NOTE: This is an interim fix for the mlx-native (non-paged) path.
The proper solution is per-layer attention dispatching (vllm-project#201) plus a
paged linear attention kernel (roadmap vllm-project#148).
Signed-off-by: Bren Mada Bowen <bowen.bren@gmail.com>
291dab8 to
251f0b8
Compare
|
@WindChimeRan Thanks for the thorough review. Addressed all feedback: Crash still reproduces on current mainConfirmed with mlx-lm 0.31.1:
Tested with Refactored: detection logic moved out of model_runner.pyPer your suggestion and #122's direction, the core logic now lives in Fixed unreachable codeGood catch on line 673 — the Scoping: this is an interim fixI understand this PR sits within a broader effort:
This PR is specifically an interim workaround for the mlx-native (non-paged) path so Qwen3.5 can serve today via sequential decode. Once #201 lands and a paged linear attention kernel is available, hybrid models will be handled properly at the dispatch level and this fallback becomes unnecessary. Test plan items
|
|
@WindChimeRan Re: paged KV cache support — this PR intentionally targets only the mlx-native (non-paged) path as an interim fix. Proper paged attention support for hybrid models like Qwen3.5 requires the per-layer attention dispatching you're building in #201 (routing SDPA layers vs GatedDeltaNet linear attention layers separately) plus a paged linear attention kernel (roadmap #148). That's the right long-term approach and this PR doesn't try to duplicate that effort. Happy to help with #201 or the linear attention kernel if useful. |
|
I couldn't reproduce the crash Could you please take a look? @LxYuan0420. It seems like the problem has already been fixed by #110 , but I'm not very sure. |
|
Could you provide exact repro commands on current main (including model, mlx-lm version)? Also, the current change looks like a temp workaround rather than a proper fix. Ideally we want the fix to live in the attention dispatch layer for hybrid models, rather than globally gating batching? |
|
Closing this, conflicts and inactivity. Feel free to open a new PR again in future |
Summary
Interim fix for the mlx-native (non-paged) path so hybrid models like Qwen3.5 can serve today.
ArraysCachefor linear/SSM layers +KVCachefor attention layers).BatchKVCache.offsetreturnsmx.arraybut hybrid attention code usescache.offsetas a Pythonintfor mask slicing, causingValueError: Slice indices must be integers or None.make_prompt_cache()and falls back to sequential decode for incompatible models.vllm_metal/v1/cache_utils.py(standalone pure function), keepingmodel_runner.pychanges minimal per [Refactor] Refactor model runner to keep it minimal and easy to read #122.NOTE: This is a band-aid until per-layer attention dispatching (#201) and a paged linear attention kernel (roadmap #148) land, at which point hybrid models will be handled properly at the dispatch level.
Test plan
ValueErrorwhen serving Qwen3.5 via vllm-metalKVCache→True){ArraysCache, KVCache}→False)