Skip to content

fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 hybrid models#160

Merged
waybarrios merged 2 commits intowaybarrios:mainfrom
neomody77:fix/qwen35-arrayscache-batching
Mar 31, 2026
Merged

fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 hybrid models#160
waybarrios merged 2 commits intowaybarrios:mainfrom
neomody77:fix/qwen35-arrayscache-batching

Conversation

@neomody77
Copy link
Copy Markdown
Contributor

Summary

Fixes --continuous-batching mode crashing with Qwen3.5 models (hybrid Attention + Mamba/SSM architecture).

Root cause: BatchMambaCache.__init__() conditionally skips passing size to ArraysCache.__init__() when HAS_MAMBA_CACHE=True. Since MambaCache was removed in mlx-lm >= 0.30.6, the fallback assigns ArraysCache as MambaCache, but the HAS_MAMBA_CACHE flag becomes unreliable. ArraysCache.__init__() requires size as a positional argument, causing:

Engine loop error: ArraysCache.__init__() missing 1 required positional argument: 'size'

This error loops infinitely, making the server unresponsive.

The same issue exists in BatchMambaCache.extract(), which calls MambaCache() without size.

Fix

Always pass size to super().__init__() — this is safe for both ArraysCache (requires it) and legacy MambaCache (accepts it). Remove the HAS_MAMBA_CACHE conditional branches that are no longer reliable.

Impact

  • Before: --continuous-batching with Qwen3.5 → infinite error loop, server hangs
  • After: continuous batching + prefix cache works correctly

Tested on M1 Pro 32GB with mlx-community/Qwen3.5-4B-4bit:

Request Before (SimpleEngine, no cache) After (BatchedEngine + prefix cache)
Cold 4.9s 1.17s
Warm 4.9s (no cache) 1.08s (cache hit)

Context

Qwen3.5 uses a hybrid architecture where model.make_cache() returns a mix of KVCache (attention layers) and ArraysCache (Mamba/SSM layers). This is a known challenge:

Test plan

  • BatchGenerator with Qwen3.5-4B-4bit no longer crashes
  • --continuous-batching --enable-prefix-cache serves requests correctly
  • Prefix cache provides ~2x speedup on repeated prompts
  • Tool calling (qwen3_coder parser) works with batched engine

…odels

Qwen3.5 uses a hybrid architecture (Attention + Mamba/SSM layers), where
`model.make_cache()` returns a mix of `KVCache` and `ArraysCache` objects.

`ArraysCache.__init__()` requires a `size` parameter, but `BatchMambaCache`
conditionally skipped it when `HAS_MAMBA_CACHE=True`. Since `MambaCache`
was removed in mlx-lm >= 0.30.6 and falls back to `ArraysCache`, the
`HAS_MAMBA_CACHE` flag is unreliable.

This caused `--continuous-batching` mode to crash in an infinite error loop:
  `ArraysCache.__init__() missing 1 required positional argument: 'size'`

The fix unconditionally passes `size` to `super().__init__()`, which is
safe for both `ArraysCache` (requires it) and legacy `MambaCache`
(accepts it).

Without this fix, continuous batching and prefix caching are completely
broken for Qwen3.5 models on Apple Silicon.

Related upstream issues:
- ml-explore/mlx-lm#980 (prefix cache fails for hybrid models)
- QwenLM/Qwen3.5#37 (ArraysCache vs KVCache in hybrid arch)
@Thump604
Copy link
Copy Markdown
Collaborator

Nice catch. The HAS_MAMBA_CACHE conditional branches are indeed unreliable since mlx-lm >= 0.30.6 made MambaCache just an alias for ArraysCache. The fix is clean — always passing size is correct for both code paths since MambaCache (when it existed separately) also accepted size, and ArraysCache requires it.

I've been running Qwen3.5-122B-A10B on M2 Ultra 128GB and can confirm this exact error class. Our PR #165 fixes hybrid cache handling in the MLLM batch generator (_make_batch_cache, MLLMBatch.extend(), etc.) but doesn't touch mamba_cache.py — so these two PRs are complementary, not conflicting. Both are needed for full continuous batching support on Qwen3.5 hybrid models.

One small note: the extract() fix is especially important. Without size, the extracted single-sequence MambaCache would also fail during batch operations that need to reconstitute individual caches.

+1 for merge. This is a straightforward, well-tested bugfix that unblocks continuous batching on hybrid architectures.

@waybarrios
Copy link
Copy Markdown
Owner

Reviewed the change. The fix is correct, ArraysCache requires size as a positional argument and the old HAS_MAMBA_CACHE conditional was skipping it on the True branch. Since mlx-lm >= 0.30.6 removed MambaCache entirely, that flag is always False anyway, so the conditional was dead code hiding a real bug.

Always passing size is safe for both ArraysCache (requires it) and the legacy MambaCache (inherited it from ArraysCache).

One minor note: after this PR, the HAS_MAMBA_CACHE variable is still assigned in the try/except import block but never read anywhere. Could be cleaned up in a follow up.

@waybarrios waybarrios merged commit 80d1cbf into waybarrios:main Mar 31, 2026
janhilgard added a commit to janhilgard/vllm-mlx that referenced this pull request Apr 1, 2026
Brings in: prompt_tokens fix (waybarrios#236), ArraysCache batching (waybarrios#160),
platform rename (waybarrios#185), mlx-lm 0.31 compat (waybarrios#183, waybarrios#227),
base64 hash fix (waybarrios#206), streaming UTF-8 detokenizer (waybarrios#109),
and cleanup commits.

Conflicts resolved:
- scheduler.py: keep make_logits_processors import (fork feature)
- mllm_scheduler.py: take upstream stop-token skip in detokenizer
- models/mllm.py: keep SHA256 hash (fork fix for collision)
- utils/tokenizer.py: merge upstream error message with fork elif chain

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sysit pushed a commit to sysit/vllm-mlx that referenced this pull request Apr 1, 2026
…he-batching

fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 hybrid models
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants