feat: MLLM+MTP per-request routing for text and vision#171
Closed
Thump604 wants to merge 5 commits intowaybarrios:mainfrom
Closed
feat: MLLM+MTP per-request routing for text and vision#171Thump604 wants to merge 5 commits intowaybarrios:mainfrom
Thump604 wants to merge 5 commits intowaybarrios:mainfrom
Conversation
When both --mllm and --enable-mtp are set, SimpleEngine builds a parallel mlx_lm TextModel sharing the VLM backbone weights (zero-copy). Text-only requests route to mlx_lm with MTP speculative decoding; media requests route to the mlx_vlm MLLM path. Key components: - text_model_from_vlm.py: Build mlx_lm TextModel from VLM weights - Per-request routing in stream_chat() via _has_media_content() - _stream_generate_text() for MTP-accelerated text generation - MTP passthrough: --enable-mtp flag through CLI/server/engine/LLM Tested on Qwen3.5-35B-A3B VLM+MTP (8-bit): - Text (MTP): 65.3 tok/s - Vision (MLLM): 63.8 tok/s - Memory: 38.7 GB (zero-copy, same as single model)
This was referenced Mar 17, 2026
Closed
Multi-turn conversations with tool results can produce messages where system messages are out of order or consecutive same-role messages appear. Qwen 3.5's chat template rejects these with "System message must be at the beginning", crashing CLI agents on turn 2. Add _normalize_messages() to SimpleEngine.stream_chat() to: 1. Map developer -> system (OpenAI Responses API compat) 2. Merge consecutive same-role messages (alternating-role requirement) This matches the normalization already done in BatchedEngine paths (PR waybarrios#165) but was missing from SimpleEngine's MTP text path.
Many CLIs (OpenCode, Qwen Code, Kilo) send system messages mid-conversation. The Qwen 3.5 chat template enforces system messages at position [0], causing TemplateError crashes and dropped connections (2600+ occurrences in logs). Changes: - _normalize_messages() now hoists all system messages to position [0] and merges their content, after the existing role-mapping and same-role merge - Added _normalize_messages() call to SimpleEngine.chat() (non-streaming path was missing it) The condition only triggers when system messages are out of position (len > 1 or merged[0] != system), so well-ordered messages pass through unchanged.
Add _hoist_system_messages() to mllm.py for the MLLM-internal get_chat_template calls. These go through mlx_vlm's template application which also enforces system-at-position-[0]. Defense-in-depth: SimpleEngine already normalizes before calling MLLM methods, but direct MLLM usage (benchmarks, tests) and edge cases during startup now also handle out-of-order system messages correctly.
SimpleEngine.chat() lacked MLLM+MTP per-request routing that stream_chat() already had. Text-only requests via non-streaming API went to mlx_vlm MLLM path, causing 160GB Metal buffer allocation attempts and server crashes. Routes text-only chat() requests through _stream_generate_text() (MTP path) matching stream_chat() behavior.
This was referenced Mar 21, 2026
Thump604
added a commit
to Thump604/vllm-mlx
that referenced
this pull request
Mar 21, 2026
Build mlx_lm TextModel with MTP weights from VLM backbone at load time (zero-copy weight sharing). Route text-only requests to TextModel with MTP speculative decoding, media requests to MLLM path. Dual-backend architecture: MLLMScheduler for media, direct TextModel generation with asyncio.Lock for text. Metal serialization prevents concurrent GPU access. Depends on PR waybarrios#171 (text_model_from_vlm.py).
This was referenced Mar 21, 2026
Collaborator
Author
|
Superseded — all changes merged upstream via PR #180 (SpecPrefill, which included MTP routing, system KV cache, and prefill step size). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Per-request routing for VLM+MTP models: text-only requests use mlx_lm TextModel with MTP speculative decoding, media requests use mlx_vlm MLLM path.
What:
text_model_from_vlm.pybuilds an mlx_lm TextModel from the VLM backbone (zero-copy weight sharing, ~0 extra RAM). At request time,_has_media_content()routes accordingly. Includes system message hoisting and developer→system mapping in both SimpleEngine paths.Why: MTP speculative decoding only works through mlx_lm, not mlx_vlm. Without routing, text-only requests on VLM models don't get MTP — losing ~30% generation speed.
Depends on: #165
Files: new
text_model_from_vlm.py, modifiedengine/simple.py,server.pyTest: