feat: MTP per-request routing + system KV cache in BatchedEngine#192
Closed
Thump604 wants to merge 12 commits intowaybarrios:mainfrom
Closed
feat: MTP per-request routing + system KV cache in BatchedEngine#192Thump604 wants to merge 12 commits intowaybarrios:mainfrom
Thump604 wants to merge 12 commits intowaybarrios:mainfrom
Conversation
When both --mllm and --enable-mtp are set, SimpleEngine builds a parallel mlx_lm TextModel sharing the VLM backbone weights (zero-copy). Text-only requests route to mlx_lm with MTP speculative decoding; media requests route to the mlx_vlm MLLM path. Key components: - text_model_from_vlm.py: Build mlx_lm TextModel from VLM weights - Per-request routing in stream_chat() via _has_media_content() - _stream_generate_text() for MTP-accelerated text generation - MTP passthrough: --enable-mtp flag through CLI/server/engine/LLM Tested on Qwen3.5-35B-A3B VLM+MTP (8-bit): - Text (MTP): 65.3 tok/s - Vision (MLLM): 63.8 tok/s - Memory: 38.7 GB (zero-copy, same as single model)
Multi-turn conversations with tool results can produce messages where system messages are out of order or consecutive same-role messages appear. Qwen 3.5's chat template rejects these with "System message must be at the beginning", crashing CLI agents on turn 2. Add _normalize_messages() to SimpleEngine.stream_chat() to: 1. Map developer -> system (OpenAI Responses API compat) 2. Merge consecutive same-role messages (alternating-role requirement) This matches the normalization already done in BatchedEngine paths (PR waybarrios#165) but was missing from SimpleEngine's MTP text path.
Many CLIs (OpenCode, Qwen Code, Kilo) send system messages mid-conversation. The Qwen 3.5 chat template enforces system messages at position [0], causing TemplateError crashes and dropped connections (2600+ occurrences in logs). Changes: - _normalize_messages() now hoists all system messages to position [0] and merges their content, after the existing role-mapping and same-role merge - Added _normalize_messages() call to SimpleEngine.chat() (non-streaming path was missing it) The condition only triggers when system messages are out of position (len > 1 or merged[0] != system), so well-ordered messages pass through unchanged.
Add _hoist_system_messages() to mllm.py for the MLLM-internal get_chat_template calls. These go through mlx_vlm's template application which also enforces system-at-position-[0]. Defense-in-depth: SimpleEngine already normalizes before calling MLLM methods, but direct MLLM usage (benchmarks, tests) and edge cases during startup now also handle out-of-order system messages correctly.
SimpleEngine.chat() lacked MLLM+MTP per-request routing that stream_chat() already had. Text-only requests via non-streaming API went to mlx_vlm MLLM path, causing 160GB Metal buffer allocation attempts and server crashes. Routes text-only chat() requests through _stream_generate_text() (MTP path) matching stream_chat() behavior.
Extract _normalize_messages() to shared message_utils.py module. Add calls in BatchedEngine.chat() and BatchedEngine.stream_chat(). Qwen 3.5 templates reject developer role, consecutive same-role messages, and system messages out of position [0].
Build mlx_lm TextModel with MTP weights from VLM backbone at load time (zero-copy weight sharing). Route text-only requests to TextModel with MTP speculative decoding, media requests to MLLM path. Dual-backend architecture: MLLMScheduler for media, direct TextModel generation with asyncio.Lock for text. Metal serialization prevents concurrent GPU access. Depends on PR waybarrios#171 (text_model_from_vlm.py).
Add specprefill_enabled, specprefill_draft_model_path, specprefill_threshold, specprefill_keep_pct to load_model() and SimpleEngine.__init__. Previously these were parsed from CLI but never passed to the engine — _specprefill_threshold and _specprefill_keep_pct would AttributeError if triggered.
On the first text-only request, prefills the system prompt tokens and snapshots the backbone KV state. Subsequent requests with the same system prompt restore the snapshot and only prefill suffix tokens, saving ~57s per request on 122B. Cache hit path: restores KV snapshot into fresh cache, passes suffix tokens + primed prompt_cache to mlx_lm stream_generate with MTP. Cache miss path: runs full generation, then separately prefills system prefix on a fresh cache and saves the snapshot for next time. Handles both KVCache (Qwen 3.5) and ArraysCache (hybrid) state formats. System prefix boundary detected via ChatML <|im_start|>user marker. Stats exposed via get_stats() for /v1/status visibility.
Load draft model at init, dispatch specprefill for prompts exceeding threshold. Uses score_tokens/select_chunks/sparse_prefill from specprefill.py module. Composes with system KV cache: when cache hits, sparse-prefills only the suffix with position_offset. Per-request API: extra_body specprefill/specprefill_keep_pct. Graceful fallback to normal MTP path on any error. SpecPrefill generates autoregressively (no MTP) because sparse cache is incompatible with MTP speculative decoding. _SPECPREFILL_MAX_TOKENS = 196608 (supports up to 192K context).
Collaborator
Author
|
Update: SpecPrefill support added (commit 1244d93)
|
Thump604
added a commit
to Thump604/vllm-mlx
that referenced
this pull request
Mar 22, 2026
Port SimpleEngine features to BatchedEngine for continuous batching: - Per-request MTP routing: text-only → TextModel (MTP), media → MLLM - message_utils.py: shared _normalize_messages (developer→system, merge consecutive same-role, hoist system to [0]) - SpecPrefill config + draft model lifecycle in BatchedEngine - System KV cache with ChatML boundary detection Replaces PR waybarrios#192 (rebased against main after merge of waybarrios#180, waybarrios#97).
Collaborator
Author
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Port MTP routing, SpecPrefill, and system KV caching to BatchedEngine for continuous batching parity with SimpleEngine.
What:
Results (M2 Ultra 128GB, Qwen3.5-122B):
Depends on: #171, #180
Files:
engine/batched.py,server.py,cli.py,api/models.py,specprefill.py,message_utils.py