fix: MLLM continuous batching — system prompt, routing, and KV cache#76
Conversation
Three related bugs caused multimodal models (e.g. Qwen2.5-VL, Qwen3-VL)
to produce garbage output when running with --continuous-batching:
1. Chat template drops system prompt and conversation history
_apply_chat_template used mlx_vlm.prompt_utils.apply_chat_template
which only extracted the last user message text, discarding system
prompts and all prior conversation turns. Fixed by using the
processor's (or tokenizer's) apply_chat_template with the full
message list. Also removed hardcoded enable_thinking=True which
caused issues with non-thinking models.
2. MLLM text-only requests crash with NoneType error
generate() and stream_generate() only routed to _mllm_scheduler
when images or videos were present, but for MLLM models _engine
is None (only _mllm_scheduler is initialised). Text-only requests
to MLLM models fell through to self._engine which is None.
Fixed by routing all requests through _mllm_scheduler when the
model is multimodal.
3. KV cache from VLM prefill not transferred to BatchKVCache
_process_prompts called _run_vision_encoding without passing a
cache, so the VLM's language model created temporary internal
caches that were discarded. The code then tried to transfer KV
state from the model's internal layer caches to a pre-created
BatchKVCache, but BatchKVCache.insert_single doesn't exist.
Fixed by:
- Passing a per-request KVCache to _run_vision_encoding, which
flows through to the VLM's language_model(cache=...) call
- Using KVCache.merge() to combine per-request caches into a
properly aligned BatchKVCache for generation
Tested with Qwen2.5-VL-32B and Qwen3-VL-30B — both produce correct
output with --continuous-batching, including system prompt retention,
multi-turn conversation history, and concurrent request batching.
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Add a guard to catch QuantizedKVCache early since it does not support merge and would crash at runtime. Wrap the cache merge in try-except with proper logging so failures are not silent. Validate input types in prepare_mllm_messages to reject malformed messages and limit total prompt tokens before merging to prevent memory exhaustion. Also fix missing type annotation on the cache parameter, log the TypeError fallback in chat template application, complete the prepare_mllm_messages docstring, and update several stale docstrings that still described the old hybrid routing instead of the current MLLMScheduler approach.
|
Pushed two commits with some hardening fixes on top of your changes. The most important one is a guard for QuantizedKVCache. The merge() method only exists on KVCache, so if someone enables kv-cache-quantization with an MLLM model and continuous batching, it would crash with an AttributeError at runtime. Now it raises a clear ValueError telling them to disable that flag. I also wrapped the cache merge in a try-except with logging so any unexpected failures get surfaced instead of silently breaking, and added a prompt token limit check before the merge to prevent memory exhaustion from oversized batches. On the input validation side, _prepare_mllm_messages now skips non-dict messages and filters out content parts that are not dicts or strings, which avoids passing unexpected types to the processor. The rest is smaller stuff: added a type annotation on the cache parameter, logged the TypeError fallback in the chat template path instead of swallowing it silently, completed the _prepare_mllm_messages docstring, and updated a few stale docstrings that still referenced the old hybrid routing instead of the current MLLMScheduler approach. |
Merge 17 upstream commits including: - KV cache quantization for prefix cache memory reduction (waybarrios#62) - Streaming tool call parsing via ToolParser integration (waybarrios#46) - MTP speculative decoding for Qwen3-Next (waybarrios#82) - GPT-OSS reasoning parser and Harmony format parsers - mlx-lm >= 0.30.5 requirement, transformers >= 5.0.0 - BatchMambaCache fix for mlx-lm >= 0.30.6 (waybarrios#89) - MLLM continuous batching fixes (waybarrios#76) - Force MLLM mode option (waybarrios#81) - Various bug fixes Conflict resolution: - server.py: Replaced local tool_call_buffering with upstream's ToolParser-based streaming (more robust) - cli.py: Deduplicated --mllm, --default-temperature, --default-top-p args (upstream already added them), kept local --embedding-model - mamba_cache.py: Took upstream's conditional HAS_MAMBA_CACHE approach - pyproject.toml: Took upstream's version and dependency changes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Fixes three related bugs that caused multimodal models (e.g. Qwen2.5-VL, Qwen3-VL) to produce garbage output when running with
--continuous-batching:_apply_chat_templateusedmlx_vlm.prompt_utils.apply_chat_templatewhich only extracted the last user message text, discarding system prompts and all prior turns. Fixed by using the processor/tokenizer'sapply_chat_templatewith the full message list.generate()andstream_generate()only routed to_mllm_schedulerwhen images/videos were present, but_engineisNonefor MLLM models. Fixed by routing all requests through_mllm_schedulerwhen the model is multimodal._run_vision_encodingwas called without a cache argument, so prefill KV state was discarded. The code then tried to copy state viaBatchKVCache.insert_singlewhich doesn't exist. Fixed by passing per-requestKVCacheobjects to the VLM forward pass, then merging them into aBatchKVCacheviaKVCache.merge().Details
Bug 1: Chat template (
batched.py)The MLLM path called
mlx_vlm.prompt_utils.apply_chat_template(processor, config, text_prompt)with only the extracted last-user-message text. This function is designed for single-turn VLM inference and doesn't accept a messages list. The fix usesprocessor.apply_chat_template(messages, ...)(the HuggingFace standard) which preserves system prompts, assistant turns, and multi-turn history. Also removed hardcodedenable_thinking=Truewhich isn't supported by all model templates.Bug 2: MLLM routing (
batched.py)For MLLM models, only
_mllm_scheduleris initialised (not_engine). The conditionif self._is_mllm and self._mllm_scheduler and (images or videos)meant text-only chat requests fell through toself._engine.add_request()which isNone, causing anAttributeError. Removed the(images or videos)guard so all requests route through the MLLM scheduler when the model is multimodal.Bug 3: KV cache (
mllm_batch_generator.py)_process_promptscreated an emptyBatchKVCacheupfront, ran VLM encoding without passing any cache, then attempted to extract KV state from the model's internal layer caches. This failed because (a) the VLM discards its internal cache after the forward pass without acache=argument, and (b)BatchKVCachedoesn't have aninsert_singlemethod. The fix:KVCache(frommlx_lm.models.cache.make_prompt_cache)_run_vision_encoding(req, cache=request_cache)— the VLM model's__call__passescache=through toself.language_model()KVCache.merge()→BatchKVCachewith proper left-padding alignmentTest plan
Tested on Mac Studio M3 Ultra (96GB):
Qwen2.5-VL-32B-Instruct-4bitwith--continuous-batching— correct outputQwen3-VL-30B-A3B-Instruct-4bitwith--continuous-batching— correct output🤖 Generated with Claude Code