fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 hybrid models#160
Conversation
…odels Qwen3.5 uses a hybrid architecture (Attention + Mamba/SSM layers), where `model.make_cache()` returns a mix of `KVCache` and `ArraysCache` objects. `ArraysCache.__init__()` requires a `size` parameter, but `BatchMambaCache` conditionally skipped it when `HAS_MAMBA_CACHE=True`. Since `MambaCache` was removed in mlx-lm >= 0.30.6 and falls back to `ArraysCache`, the `HAS_MAMBA_CACHE` flag is unreliable. This caused `--continuous-batching` mode to crash in an infinite error loop: `ArraysCache.__init__() missing 1 required positional argument: 'size'` The fix unconditionally passes `size` to `super().__init__()`, which is safe for both `ArraysCache` (requires it) and legacy `MambaCache` (accepts it). Without this fix, continuous batching and prefix caching are completely broken for Qwen3.5 models on Apple Silicon. Related upstream issues: - ml-explore/mlx-lm#980 (prefix cache fails for hybrid models) - QwenLM/Qwen3.5#37 (ArraysCache vs KVCache in hybrid arch)
|
Nice catch. The I've been running Qwen3.5-122B-A10B on M2 Ultra 128GB and can confirm this exact error class. Our PR #165 fixes hybrid cache handling in the MLLM batch generator ( One small note: the +1 for merge. This is a straightforward, well-tested bugfix that unblocks continuous batching on hybrid architectures. |
|
Reviewed the change. The fix is correct, ArraysCache requires size as a positional argument and the old HAS_MAMBA_CACHE conditional was skipping it on the True branch. Since mlx-lm >= 0.30.6 removed MambaCache entirely, that flag is always False anyway, so the conditional was dead code hiding a real bug. Always passing size is safe for both ArraysCache (requires it) and the legacy MambaCache (inherited it from ArraysCache). One minor note: after this PR, the HAS_MAMBA_CACHE variable is still assigned in the try/except import block but never read anywhere. Could be cleaned up in a follow up. |
Brings in: prompt_tokens fix (waybarrios#236), ArraysCache batching (waybarrios#160), platform rename (waybarrios#185), mlx-lm 0.31 compat (waybarrios#183, waybarrios#227), base64 hash fix (waybarrios#206), streaming UTF-8 detokenizer (waybarrios#109), and cleanup commits. Conflicts resolved: - scheduler.py: keep make_logits_processors import (fork feature) - mllm_scheduler.py: take upstream stop-token skip in detokenizer - models/mllm.py: keep SHA256 hash (fork fix for collision) - utils/tokenizer.py: merge upstream error message with fork elif chain Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…he-batching fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 hybrid models
Summary
Fixes
--continuous-batchingmode crashing with Qwen3.5 models (hybrid Attention + Mamba/SSM architecture).Root cause:
BatchMambaCache.__init__()conditionally skips passingsizetoArraysCache.__init__()whenHAS_MAMBA_CACHE=True. SinceMambaCachewas removed in mlx-lm >= 0.30.6, the fallback assignsArraysCacheasMambaCache, but theHAS_MAMBA_CACHEflag becomes unreliable.ArraysCache.__init__()requiressizeas a positional argument, causing:This error loops infinitely, making the server unresponsive.
The same issue exists in
BatchMambaCache.extract(), which callsMambaCache()withoutsize.Fix
Always pass
sizetosuper().__init__()— this is safe for bothArraysCache(requires it) and legacyMambaCache(accepts it). Remove theHAS_MAMBA_CACHEconditional branches that are no longer reliable.Impact
--continuous-batchingwith Qwen3.5 → infinite error loop, server hangsTested on M1 Pro 32GB with
mlx-community/Qwen3.5-4B-4bit:Context
Qwen3.5 uses a hybrid architecture where
model.make_cache()returns a mix ofKVCache(attention layers) andArraysCache(Mamba/SSM layers). This is a known challenge:Test plan
BatchGeneratorwith Qwen3.5-4B-4bit no longer crashes--continuous-batching --enable-prefix-cacheserves requests correctly