Skip to content

fix: chunked prefill compat with mlx-lm >= 0.31.0#169

Closed
Thump604 wants to merge 19 commits intowaybarrios:mainfrom
Thump604:fix/chunked-prefill-mlx-lm-compat
Closed

fix: chunked prefill compat with mlx-lm >= 0.31.0#169
Thump604 wants to merge 19 commits intowaybarrios:mainfrom
Thump604:fix/chunked-prefill-mlx-lm-compat

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 commented Mar 16, 2026

Fix chunked prefill compatibility with mlx-lm >= 0.31.0.

What: mlx-lm 0.31.0 changed the generate_step signature. This patch handles both old and new signatures so vllm-mlx works with mlx-lm 0.30.x and 0.31.x.

Files: engine/simple.py

Test: Start server with mlx-lm >= 0.31.0, send any chat request.

Thump604 added 19 commits March 16, 2026 13:54
Comprehensive tests for MLLM continuous batching with hybrid model
caches (KVCache + ArraysCache). Covers merge, filter, extract, extend
operations on mixed cache lists, plus message normalization for
real-world client formats (OpenCode consecutive same-role messages).

Tests written first — implementation follows.
Support ArraysCache (SSM layers), RotatingKVCache, and CacheList
in the MLLM batch cache factory. Matches the pattern from mlx-lm's
native BatchGenerator. Uses type(c) is KVCache (strict identity)
to avoid catching QuantizedKVCache subclass.
Replace isinstance(sample_cache, KVCache) with a capability check
(hasattr merge). This is the actual crash point for hybrid models
like Qwen 3.5 where layer 0 is ArraysCache (GatedDeltaNet).
The merge loop is already polymorphic — each cache type's merge()
returns the correct batched representation.
Replace hasattr(o, 'keys') check with empty() which is universal
across all cache types via _BaseCache. ArraysCache uses
cache[0] is None for empty(), BatchKVCache uses keys is None.
This fixes silent skip of SSM layer caches during batch extension.
Add _normalize_messages() preprocessing that merges consecutive
messages with the same role. Prevents chat template failures when
clients like OpenCode send system+system+user+user format that
Qwen 3.5 and other templates reject. Only merges string content;
multimodal list content is preserved as-is.
OpenAI Responses API and clients like Claude Code send messages with
role "developer" instead of "system". Chat templates (Qwen 3.5, Llama,
etc.) don't recognize this role, causing template failure → raw prefill
fallback → potential crash during generation.

Add a _ROLE_MAP dict to _normalize_messages() that maps non-standard
roles before the merge logic runs. This ensures developer + system
messages also merge correctly when consecutive.

Closes waybarrios#137
SimpleEngine.stream_chat() for MLLM models ran asyncio.to_thread(
run_stream) without acquiring self._generation_lock. The non-streaming
chat() path and the LLM stream_generate() path both hold the lock,
but MLLM streaming was completely unprotected.

When clients like OpenCode send concurrent streaming requests (e.g.
title generation + main prompt simultaneously), both requests would
execute self._model.stream_chat() in separate thread pool threads,
causing concurrent Metal operations and SIGSEGV/_MTLCommandBuffer
assertion crashes.

Wrap the MLLM streaming path in async with self._generation_lock to
serialize Metal access, matching the behavior of all other engine
methods.
Fixes ruff F401 lint error.
_can_trim_cache() only checked the first cache layer's is_trimmable().
For hybrid models (Qwen 3.5 MoE, Nemotron Mamba+Attention), the
prompt_cache mixes KVCache (trimmable) and ArraysCache (not trimmable).
The first layer happened to be KVCache, so _can_trim_cache returned
True. _trim_cache then trimmed KVCache layers but silently skipped
ArraysCache layers, leaving KV and SSM/MoE state inconsistent.

Fix: check ALL cache layers, not just the first. Also add is_trimmable
guard in _trim_cache to skip non-trimmable caches explicitly.

Relates to waybarrios#145
…s#142, waybarrios#136)

Add _is_kv_layer() to classify positional (KVCache) vs non-positional
(ArraysCache) cache layers. _extract_block_tensor_slice() now skips
non-KV layers instead of crashing with 'Too many indices for array
with 3 dimensions'.

NonKVCacheData dataclass added for storing non-positional state
(used by subsequent commits for full hybrid reconstruction).
store_cache() separates KV (block-sliced) and non-KV (stored whole)
layers. reconstruct_cache() rebuilds both: KV via block concatenation,
non-KV via from_state(). If non-KV states are missing for a hybrid
model, returns None to force safe recomputation.
… robustness

fetch_cache() now rejects partial prefix matches when non-KV states
are present but don't match the candidate block set. release_cache()
and clear() clean up non-KV state. scheduler fallback guards against
non-4D tensors in cache state reconstruction.
- fork_cache() now copies has_non_kv from source entry
- reconstruct_cache() uses dict lookup instead of list.index()
  in the per-layer loop (O(n) → O(1) per iteration)
mlx-lm 0.31.0 (PR ml-explore/mlx-lm#911) added prompt_checkpoints as
a 7th element to BatchGenerator.unprocessed_prompts tuples. The chunked
prefill code in _install_chunked_prefill hardcoded a 6-element zip
unpacking which crashed with ValueError.

Fix: use *_extra catch-all in the zip unpacking so extra fields from
mlx-lm are captured but ignored. This is forward-compatible with any
future tuple additions.

The "small prompt" path already works because it passes tuples directly
to mlx-lm's own _process_prompts which handles 7 elements. Only the
chunked prefill path (total_tokens > budget) did its own unpacking.

Closes waybarrios#155
Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Mar 21, 2026
…nizer for Unicode (waybarrios#130)

Two fixes in scheduler.py:

1. _chunked_next tuple unpack (issue waybarrios#178): mlx-lm 0.31.x added
   prompt_checkpoints as a 7th tuple element. _chunked_next only
   unpacked 6, crashing when prefix cache triggers chunked prefill.
   Same class of bug as PR waybarrios#169 but in a different code path.

2. NaiveStreamingDetokenizer for BatchedEngine (issue waybarrios#130): Raw
   tokenizer.decode([token]) splits multi-byte codepoints (emoji,
   CJK) into surrogate pairs in streaming output. Replace with
   NaiveStreamingDetokenizer that buffers incomplete UTF-8 byte
   sequences and only emits valid segments. Matches the fix applied
   to mllm_scheduler.py in commit d2ea97c.
Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Mar 22, 2026
…nizer for Unicode (waybarrios#130)

Two fixes in scheduler.py:

1. _chunked_next tuple unpack (issue waybarrios#178): mlx-lm 0.31.x added
   prompt_checkpoints as a 7th tuple element. _chunked_next only
   unpacked 6, crashing when prefix cache triggers chunked prefill.
   Same class of bug as PR waybarrios#169 but in a different code path.

2. NaiveStreamingDetokenizer for BatchedEngine (issue waybarrios#130): Raw
   tokenizer.decode([token]) splits multi-byte codepoints (emoji,
   CJK) into surrogate pairs in streaming output. Replace with
   NaiveStreamingDetokenizer that buffers incomplete UTF-8 byte
   sequences and only emits valid segments. Matches the fix applied
   to mllm_scheduler.py in commit d2ea97c.
@Thump604
Copy link
Copy Markdown
Collaborator Author

Superseded — the chunked prefill tuple unpack fix is now in PR #194 (standalone, rebased against current main). The hybrid cache and prefix cache changes are in PR #165.

@Thump604 Thump604 closed this Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant