fix: chunked prefill compat with mlx-lm >= 0.31.0 by Thump604 · Pull Request #169 · waybarrios/vllm-mlx

Thump604 · 2026-03-16T23:49:07Z

Fix chunked prefill compatibility with mlx-lm >= 0.31.0.

What: mlx-lm 0.31.0 changed the generate_step signature. This patch handles both old and new signatures so vllm-mlx works with mlx-lm 0.30.x and 0.31.x.

Files: engine/simple.py

Test: Start server with mlx-lm >= 0.31.0, send any chat request.

Comprehensive tests for MLLM continuous batching with hybrid model caches (KVCache + ArraysCache). Covers merge, filter, extract, extend operations on mixed cache lists, plus message normalization for real-world client formats (OpenCode consecutive same-role messages). Tests written first — implementation follows.

Support ArraysCache (SSM layers), RotatingKVCache, and CacheList in the MLLM batch cache factory. Matches the pattern from mlx-lm's native BatchGenerator. Uses type(c) is KVCache (strict identity) to avoid catching QuantizedKVCache subclass.

Replace isinstance(sample_cache, KVCache) with a capability check (hasattr merge). This is the actual crash point for hybrid models like Qwen 3.5 where layer 0 is ArraysCache (GatedDeltaNet). The merge loop is already polymorphic — each cache type's merge() returns the correct batched representation.

Replace hasattr(o, 'keys') check with empty() which is universal across all cache types via _BaseCache. ArraysCache uses cache[0] is None for empty(), BatchKVCache uses keys is None. This fixes silent skip of SSM layer caches during batch extension.

Add _normalize_messages() preprocessing that merges consecutive messages with the same role. Prevents chat template failures when clients like OpenCode send system+system+user+user format that Qwen 3.5 and other templates reject. Only merges string content; multimodal list content is preserved as-is.

OpenAI Responses API and clients like Claude Code send messages with role "developer" instead of "system". Chat templates (Qwen 3.5, Llama, etc.) don't recognize this role, causing template failure → raw prefill fallback → potential crash during generation. Add a _ROLE_MAP dict to _normalize_messages() that maps non-standard roles before the merge logic runs. This ensures developer + system messages also merge correctly when consecutive. Closes waybarrios#137

SimpleEngine.stream_chat() for MLLM models ran asyncio.to_thread( run_stream) without acquiring self._generation_lock. The non-streaming chat() path and the LLM stream_generate() path both hold the lock, but MLLM streaming was completely unprotected. When clients like OpenCode send concurrent streaming requests (e.g. title generation + main prompt simultaneously), both requests would execute self._model.stream_chat() in separate thread pool threads, causing concurrent Metal operations and SIGSEGV/_MTLCommandBuffer assertion crashes. Wrap the MLLM streaming path in async with self._generation_lock to serialize Metal access, matching the behavior of all other engine methods.

Fixes ruff F401 lint error.

_can_trim_cache() only checked the first cache layer's is_trimmable(). For hybrid models (Qwen 3.5 MoE, Nemotron Mamba+Attention), the prompt_cache mixes KVCache (trimmable) and ArraysCache (not trimmable). The first layer happened to be KVCache, so _can_trim_cache returned True. _trim_cache then trimmed KVCache layers but silently skipped ArraysCache layers, leaving KV and SSM/MoE state inconsistent. Fix: check ALL cache layers, not just the first. Also add is_trimmable guard in _trim_cache to skip non-trimmable caches explicitly. Relates to waybarrios#145

…s#142, waybarrios#136) Add _is_kv_layer() to classify positional (KVCache) vs non-positional (ArraysCache) cache layers. _extract_block_tensor_slice() now skips non-KV layers instead of crashing with 'Too many indices for array with 3 dimensions'. NonKVCacheData dataclass added for storing non-positional state (used by subsequent commits for full hybrid reconstruction).

store_cache() separates KV (block-sliced) and non-KV (stored whole) layers. reconstruct_cache() rebuilds both: KV via block concatenation, non-KV via from_state(). If non-KV states are missing for a hybrid model, returns None to force safe recomputation.

… robustness fetch_cache() now rejects partial prefix matches when non-KV states are present but don't match the candidate block set. release_cache() and clear() clean up non-KV state. scheduler fallback guards against non-4D tensors in cache state reconstruction.

- fork_cache() now copies has_non_kv from source entry - reconstruct_cache() uses dict lookup instead of list.index() in the per-layer loop (O(n) → O(1) per iteration)

mlx-lm 0.31.0 (PR ml-explore/mlx-lm#911) added prompt_checkpoints as a 7th element to BatchGenerator.unprocessed_prompts tuples. The chunked prefill code in _install_chunked_prefill hardcoded a 6-element zip unpacking which crashed with ValueError. Fix: use *_extra catch-all in the zip unpacking so extra fields from mlx-lm are captured but ignored. This is forward-compatible with any future tuple additions. The "small prompt" path already works because it passes tuples directly to mlx-lm's own _process_prompts which handles 7 elements. Only the chunked prefill path (total_tokens > budget) did its own unpacking. Closes waybarrios#155

…nizer for Unicode (waybarrios#130) Two fixes in scheduler.py: 1. _chunked_next tuple unpack (issue waybarrios#178): mlx-lm 0.31.x added prompt_checkpoints as a 7th tuple element. _chunked_next only unpacked 6, crashing when prefix cache triggers chunked prefill. Same class of bug as PR waybarrios#169 but in a different code path. 2. NaiveStreamingDetokenizer for BatchedEngine (issue waybarrios#130): Raw tokenizer.decode([token]) splits multi-byte codepoints (emoji, CJK) into surrogate pairs in streaming output. Replace with NaiveStreamingDetokenizer that buffers incomplete UTF-8 byte sequences and only emits valid segments. Matches the fix applied to mllm_scheduler.py in commit d2ea97c.

Thump604 · 2026-03-22T12:21:31Z

Superseded — the chunked prefill tuple unpack fix is now in PR #194 (standalone, rebased against current main). The hybrid cache and prefix cache changes are in PR #165.

Thump604 added 19 commits March 16, 2026 13:54

style: black formatting

ced02c0

fix: remove unused CacheList import in test

a1e0da8

Fixes ruff F401 lint error.

ci: retrigger CI

c9a3dd1

style: black formatting for hybrid prefix cache changes

9993252

fix: fork_cache propagates has_non_kv flag + O(1) layer lookup

c405431

- fork_cache() now copies has_non_kv from source entry - reconstruct_cache() uses dict lookup instead of list.index() in the per-layer loop (O(n) → O(1) per iteration)

fix: remove unused mlx.core import (ruff F401)

d781743

style: black formatting for dict comprehension

f92b886

This was referenced Mar 21, 2026

fix: MLLM continuous batching for hybrid models #165

Closed

Contributing guidelines and PR review process #186

Open

fix: compatibility with mlx-lm 0.31.x (prompt_checkpoints tuple) #183

Merged

Thump604 closed this Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: chunked prefill compat with mlx-lm >= 0.31.0#169

fix: chunked prefill compat with mlx-lm >= 0.31.0#169
Thump604 wants to merge 19 commits intowaybarrios:mainfrom
Thump604:fix/chunked-prefill-mlx-lm-compat

Thump604 commented Mar 16, 2026 •

edited

Loading

Uh oh!

Thump604 commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thump604 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thump604 commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Thump604 commented Mar 16, 2026 •

edited

Loading