fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 hybrid models by neomody77 · Pull Request #160 · waybarrios/vllm-mlx

neomody77 · 2026-03-14T10:29:23Z

Summary

Fixes --continuous-batching mode crashing with Qwen3.5 models (hybrid Attention + Mamba/SSM architecture).

Root cause: BatchMambaCache.__init__() conditionally skips passing size to ArraysCache.__init__() when HAS_MAMBA_CACHE=True. Since MambaCache was removed in mlx-lm >= 0.30.6, the fallback assigns ArraysCache as MambaCache, but the HAS_MAMBA_CACHE flag becomes unreliable. ArraysCache.__init__() requires size as a positional argument, causing:

Engine loop error: ArraysCache.__init__() missing 1 required positional argument: 'size'

This error loops infinitely, making the server unresponsive.

The same issue exists in BatchMambaCache.extract(), which calls MambaCache() without size.

Fix

Always pass size to super().__init__() — this is safe for both ArraysCache (requires it) and legacy MambaCache (accepts it). Remove the HAS_MAMBA_CACHE conditional branches that are no longer reliable.

Impact

Before: --continuous-batching with Qwen3.5 → infinite error loop, server hangs
After: continuous batching + prefix cache works correctly

Tested on M1 Pro 32GB with mlx-community/Qwen3.5-4B-4bit:

Request	Before (SimpleEngine, no cache)	After (BatchedEngine + prefix cache)
Cold	4.9s	1.17s
Warm	4.9s (no cache)	1.08s (cache hit)

Context

Qwen3.5 uses a hybrid architecture where model.make_cache() returns a mix of KVCache (attention layers) and ArraysCache (Mamba/SSM layers). This is a known challenge:

Prefix cache reuse is broken for all hybrid-architecture models (sliding window, SSM/Mamba) ml-explore/mlx-lm#980 — prefix cache fails for hybrid models
[Bug] Prefill Failure of Qwen3.5 Model Using KV Cache in the mlx‑lm Framework QwenLM/Qwen3.5#37 — ArraysCache vs KVCache in hybrid architecture
Related to continuous batching incompatible with MLLM (ArraysCache vs KVCache) #159 (continuous batching incompatible with ArraysCache)

Test plan

BatchGenerator with Qwen3.5-4B-4bit no longer crashes
--continuous-batching --enable-prefix-cache serves requests correctly
Prefix cache provides ~2x speedup on repeated prompts
Tool calling (qwen3_coder parser) works with batched engine

…odels Qwen3.5 uses a hybrid architecture (Attention + Mamba/SSM layers), where `model.make_cache()` returns a mix of `KVCache` and `ArraysCache` objects. `ArraysCache.__init__()` requires a `size` parameter, but `BatchMambaCache` conditionally skipped it when `HAS_MAMBA_CACHE=True`. Since `MambaCache` was removed in mlx-lm >= 0.30.6 and falls back to `ArraysCache`, the `HAS_MAMBA_CACHE` flag is unreliable. This caused `--continuous-batching` mode to crash in an infinite error loop: `ArraysCache.__init__() missing 1 required positional argument: 'size'` The fix unconditionally passes `size` to `super().__init__()`, which is safe for both `ArraysCache` (requires it) and legacy `MambaCache` (accepts it). Without this fix, continuous batching and prefix caching are completely broken for Qwen3.5 models on Apple Silicon. Related upstream issues: - ml-explore/mlx-lm#980 (prefix cache fails for hybrid models) - QwenLM/Qwen3.5#37 (ArraysCache vs KVCache in hybrid arch)

Thump604 · 2026-03-21T02:25:50Z

Nice catch. The HAS_MAMBA_CACHE conditional branches are indeed unreliable since mlx-lm >= 0.30.6 made MambaCache just an alias for ArraysCache. The fix is clean — always passing size is correct for both code paths since MambaCache (when it existed separately) also accepted size, and ArraysCache requires it.

I've been running Qwen3.5-122B-A10B on M2 Ultra 128GB and can confirm this exact error class. Our PR #165 fixes hybrid cache handling in the MLLM batch generator (_make_batch_cache, MLLMBatch.extend(), etc.) but doesn't touch mamba_cache.py — so these two PRs are complementary, not conflicting. Both are needed for full continuous batching support on Qwen3.5 hybrid models.

One small note: the extract() fix is especially important. Without size, the extracted single-sequence MambaCache would also fail during batch operations that need to reconstitute individual caches.

+1 for merge. This is a straightforward, well-tested bugfix that unblocks continuous batching on hybrid architectures.

waybarrios · 2026-03-31T17:57:04Z

Reviewed the change. The fix is correct, ArraysCache requires size as a positional argument and the old HAS_MAMBA_CACHE conditional was skipping it on the True branch. Since mlx-lm >= 0.30.6 removed MambaCache entirely, that flag is always False anyway, so the conditional was dead code hiding a real bug.

Always passing size is safe for both ArraysCache (requires it) and the legacy MambaCache (inherited it from ArraysCache).

One minor note: after this PR, the HAS_MAMBA_CACHE variable is still assigned in the try/except import block but never read anywhere. Could be cleaned up in a follow up.

Brings in: prompt_tokens fix (waybarrios#236), ArraysCache batching (waybarrios#160), platform rename (waybarrios#185), mlx-lm 0.31 compat (waybarrios#183, waybarrios#227), base64 hash fix (waybarrios#206), streaming UTF-8 detokenizer (waybarrios#109), and cleanup commits. Conflicts resolved: - scheduler.py: keep make_logits_processors import (fork feature) - mllm_scheduler.py: take upstream stop-token skip in detokenizer - models/mllm.py: keep SHA256 hash (fork fix for collision) - utils/tokenizer.py: merge upstream error message with fork elif chain Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…he-batching fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 hybrid models

Thump604 mentioned this pull request Mar 21, 2026

Contributing guidelines and PR review process #186

Open

This was referenced Mar 24, 2026

Fix successful MLX tokenizer loads computor-org/vllm-mlx#2

Merged

tokenizer: return successful mlx-lm load result #215

Open

fix: bump mlx-lm minimum to 0.31.0 for hybrid model batching #227

Merged

remove unused HAS_MAMBA_CACHE flag

6b22f32

waybarrios approved these changes Mar 31, 2026

View reviewed changes

waybarrios merged commit 80d1cbf into waybarrios:main Mar 31, 2026

sysit pushed a commit to sysit/vllm-mlx that referenced this pull request Apr 1, 2026

Merge pull request waybarrios#160 from neomody77/fix/qwen35-arrayscac…

a15cf5e

…he-batching fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 hybrid models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 hybrid models#160

fix: pass size to ArraysCache in BatchMambaCache for Qwen3.5 hybrid models#160
waybarrios merged 2 commits intowaybarrios:mainfrom
neomody77:fix/qwen35-arrayscache-batching

neomody77 commented Mar 14, 2026

Uh oh!

Thump604 commented Mar 21, 2026

Uh oh!

waybarrios commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neomody77 commented Mar 14, 2026

Summary

Fix

Impact

Context

Test plan

Uh oh!

Thump604 commented Mar 21, 2026

Uh oh!

waybarrios commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants