fix: chunked prefill compat with mlx-lm >= 0.31.0#169
Closed
Thump604 wants to merge 19 commits intowaybarrios:mainfrom
Closed
fix: chunked prefill compat with mlx-lm >= 0.31.0#169Thump604 wants to merge 19 commits intowaybarrios:mainfrom
Thump604 wants to merge 19 commits intowaybarrios:mainfrom
Conversation
Comprehensive tests for MLLM continuous batching with hybrid model caches (KVCache + ArraysCache). Covers merge, filter, extract, extend operations on mixed cache lists, plus message normalization for real-world client formats (OpenCode consecutive same-role messages). Tests written first — implementation follows.
Support ArraysCache (SSM layers), RotatingKVCache, and CacheList in the MLLM batch cache factory. Matches the pattern from mlx-lm's native BatchGenerator. Uses type(c) is KVCache (strict identity) to avoid catching QuantizedKVCache subclass.
Replace isinstance(sample_cache, KVCache) with a capability check (hasattr merge). This is the actual crash point for hybrid models like Qwen 3.5 where layer 0 is ArraysCache (GatedDeltaNet). The merge loop is already polymorphic — each cache type's merge() returns the correct batched representation.
Replace hasattr(o, 'keys') check with empty() which is universal across all cache types via _BaseCache. ArraysCache uses cache[0] is None for empty(), BatchKVCache uses keys is None. This fixes silent skip of SSM layer caches during batch extension.
Add _normalize_messages() preprocessing that merges consecutive messages with the same role. Prevents chat template failures when clients like OpenCode send system+system+user+user format that Qwen 3.5 and other templates reject. Only merges string content; multimodal list content is preserved as-is.
OpenAI Responses API and clients like Claude Code send messages with role "developer" instead of "system". Chat templates (Qwen 3.5, Llama, etc.) don't recognize this role, causing template failure → raw prefill fallback → potential crash during generation. Add a _ROLE_MAP dict to _normalize_messages() that maps non-standard roles before the merge logic runs. This ensures developer + system messages also merge correctly when consecutive. Closes waybarrios#137
SimpleEngine.stream_chat() for MLLM models ran asyncio.to_thread( run_stream) without acquiring self._generation_lock. The non-streaming chat() path and the LLM stream_generate() path both hold the lock, but MLLM streaming was completely unprotected. When clients like OpenCode send concurrent streaming requests (e.g. title generation + main prompt simultaneously), both requests would execute self._model.stream_chat() in separate thread pool threads, causing concurrent Metal operations and SIGSEGV/_MTLCommandBuffer assertion crashes. Wrap the MLLM streaming path in async with self._generation_lock to serialize Metal access, matching the behavior of all other engine methods.
Fixes ruff F401 lint error.
_can_trim_cache() only checked the first cache layer's is_trimmable(). For hybrid models (Qwen 3.5 MoE, Nemotron Mamba+Attention), the prompt_cache mixes KVCache (trimmable) and ArraysCache (not trimmable). The first layer happened to be KVCache, so _can_trim_cache returned True. _trim_cache then trimmed KVCache layers but silently skipped ArraysCache layers, leaving KV and SSM/MoE state inconsistent. Fix: check ALL cache layers, not just the first. Also add is_trimmable guard in _trim_cache to skip non-trimmable caches explicitly. Relates to waybarrios#145
…s#142, waybarrios#136) Add _is_kv_layer() to classify positional (KVCache) vs non-positional (ArraysCache) cache layers. _extract_block_tensor_slice() now skips non-KV layers instead of crashing with 'Too many indices for array with 3 dimensions'. NonKVCacheData dataclass added for storing non-positional state (used by subsequent commits for full hybrid reconstruction).
store_cache() separates KV (block-sliced) and non-KV (stored whole) layers. reconstruct_cache() rebuilds both: KV via block concatenation, non-KV via from_state(). If non-KV states are missing for a hybrid model, returns None to force safe recomputation.
… robustness fetch_cache() now rejects partial prefix matches when non-KV states are present but don't match the candidate block set. release_cache() and clear() clean up non-KV state. scheduler fallback guards against non-4D tensors in cache state reconstruction.
- fork_cache() now copies has_non_kv from source entry - reconstruct_cache() uses dict lookup instead of list.index() in the per-layer loop (O(n) → O(1) per iteration)
mlx-lm 0.31.0 (PR ml-explore/mlx-lm#911) added prompt_checkpoints as a 7th element to BatchGenerator.unprocessed_prompts tuples. The chunked prefill code in _install_chunked_prefill hardcoded a 6-element zip unpacking which crashed with ValueError. Fix: use *_extra catch-all in the zip unpacking so extra fields from mlx-lm are captured but ignored. This is forward-compatible with any future tuple additions. The "small prompt" path already works because it passes tuples directly to mlx-lm's own _process_prompts which handles 7 elements. Only the chunked prefill path (total_tokens > budget) did its own unpacking. Closes waybarrios#155
This was referenced Mar 21, 2026
Thump604
added a commit
to Thump604/vllm-mlx
that referenced
this pull request
Mar 21, 2026
…nizer for Unicode (waybarrios#130) Two fixes in scheduler.py: 1. _chunked_next tuple unpack (issue waybarrios#178): mlx-lm 0.31.x added prompt_checkpoints as a 7th tuple element. _chunked_next only unpacked 6, crashing when prefix cache triggers chunked prefill. Same class of bug as PR waybarrios#169 but in a different code path. 2. NaiveStreamingDetokenizer for BatchedEngine (issue waybarrios#130): Raw tokenizer.decode([token]) splits multi-byte codepoints (emoji, CJK) into surrogate pairs in streaming output. Replace with NaiveStreamingDetokenizer that buffers incomplete UTF-8 byte sequences and only emits valid segments. Matches the fix applied to mllm_scheduler.py in commit d2ea97c.
Thump604
added a commit
to Thump604/vllm-mlx
that referenced
this pull request
Mar 22, 2026
…nizer for Unicode (waybarrios#130) Two fixes in scheduler.py: 1. _chunked_next tuple unpack (issue waybarrios#178): mlx-lm 0.31.x added prompt_checkpoints as a 7th tuple element. _chunked_next only unpacked 6, crashing when prefix cache triggers chunked prefill. Same class of bug as PR waybarrios#169 but in a different code path. 2. NaiveStreamingDetokenizer for BatchedEngine (issue waybarrios#130): Raw tokenizer.decode([token]) splits multi-byte codepoints (emoji, CJK) into surrogate pairs in streaming output. Replace with NaiveStreamingDetokenizer that buffers incomplete UTF-8 byte sequences and only emits valid segments. Matches the fix applied to mllm_scheduler.py in commit d2ea97c.
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix chunked prefill compatibility with mlx-lm >= 0.31.0.
What: mlx-lm 0.31.0 changed the
generate_stepsignature. This patch handles both old and new signatures so vllm-mlx works with mlx-lm 0.30.x and 0.31.x.Files:
engine/simple.pyTest: Start server with mlx-lm >= 0.31.0, send any chat request.