feat: MLLM+MTP per-request routing for text and vision by Thump604 · Pull Request #171 · waybarrios/vllm-mlx

Thump604 · 2026-03-17T16:15:29Z

Per-request routing for VLM+MTP models: text-only requests use mlx_lm TextModel with MTP speculative decoding, media requests use mlx_vlm MLLM path.

What: text_model_from_vlm.py builds an mlx_lm TextModel from the VLM backbone (zero-copy weight sharing, ~0 extra RAM). At request time, _has_media_content() routes accordingly. Includes system message hoisting and developer→system mapping in both SimpleEngine paths.

Why: MTP speculative decoding only works through mlx_lm, not mlx_vlm. Without routing, text-only requests on VLM models don't get MTP — losing ~30% generation speed.

Depends on: #165

Files: new text_model_from_vlm.py, modified engine/simple.py, server.py

Test:

# Start with --mllm --enable-mtp
# Text-only → check server log for "TextModel" routing
curl http://localhost:8080/v1/chat/completions -d '{
  "messages": [{"role":"user","content":"Hello"}],
  "max_tokens": 20
}'

When both --mllm and --enable-mtp are set, SimpleEngine builds a parallel mlx_lm TextModel sharing the VLM backbone weights (zero-copy). Text-only requests route to mlx_lm with MTP speculative decoding; media requests route to the mlx_vlm MLLM path. Key components: - text_model_from_vlm.py: Build mlx_lm TextModel from VLM weights - Per-request routing in stream_chat() via _has_media_content() - _stream_generate_text() for MTP-accelerated text generation - MTP passthrough: --enable-mtp flag through CLI/server/engine/LLM Tested on Qwen3.5-35B-A3B VLM+MTP (8-bit): - Text (MTP): 65.3 tok/s - Vision (MLLM): 63.8 tok/s - Memory: 38.7 GB (zero-copy, same as single model)

Multi-turn conversations with tool results can produce messages where system messages are out of order or consecutive same-role messages appear. Qwen 3.5's chat template rejects these with "System message must be at the beginning", crashing CLI agents on turn 2. Add _normalize_messages() to SimpleEngine.stream_chat() to: 1. Map developer -> system (OpenAI Responses API compat) 2. Merge consecutive same-role messages (alternating-role requirement) This matches the normalization already done in BatchedEngine paths (PR waybarrios#165) but was missing from SimpleEngine's MTP text path.

Many CLIs (OpenCode, Qwen Code, Kilo) send system messages mid-conversation. The Qwen 3.5 chat template enforces system messages at position [0], causing TemplateError crashes and dropped connections (2600+ occurrences in logs). Changes: - _normalize_messages() now hoists all system messages to position [0] and merges their content, after the existing role-mapping and same-role merge - Added _normalize_messages() call to SimpleEngine.chat() (non-streaming path was missing it) The condition only triggers when system messages are out of position (len > 1 or merged[0] != system), so well-ordered messages pass through unchanged.

Add _hoist_system_messages() to mllm.py for the MLLM-internal get_chat_template calls. These go through mlx_vlm's template application which also enforces system-at-position-[0]. Defense-in-depth: SimpleEngine already normalizes before calling MLLM methods, but direct MLLM usage (benchmarks, tests) and edge cases during startup now also handle out-of-order system messages correctly.

SimpleEngine.chat() lacked MLLM+MTP per-request routing that stream_chat() already had. Text-only requests via non-streaming API went to mlx_vlm MLLM path, causing 160GB Metal buffer allocation attempts and server crashes. Routes text-only chat() requests through _stream_generate_text() (MTP path) matching stream_chat() behavior.

Build mlx_lm TextModel with MTP weights from VLM backbone at load time (zero-copy weight sharing). Route text-only requests to TextModel with MTP speculative decoding, media requests to MLLM path. Dual-backend architecture: MLLMScheduler for media, direct TextModel generation with asyncio.Lock for text. Metal serialization prevents concurrent GPU access. Depends on PR waybarrios#171 (text_model_from_vlm.py).

Thump604 · 2026-03-22T12:15:13Z

Superseded — all changes merged upstream via PR #180 (SpecPrefill, which included MTP routing, system KV cache, and prefill step size).

Thump604 added 4 commits March 20, 2026 00:42

This was referenced Mar 21, 2026

fix: MLLM continuous batching for hybrid models #165

Closed

Contributing guidelines and PR review process #186

Closed

This was referenced Mar 21, 2026

feat: MTP per-request routing + system KV cache in BatchedEngine #192

Closed

feat: Add Qwen3.5 MLLM support #140

Closed

Thump604 closed this Mar 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MLLM+MTP per-request routing for text and vision#171

feat: MLLM+MTP per-request routing for text and vision#171
Thump604 wants to merge 5 commits intowaybarrios:mainfrom
Thump604:feat/mllm-mtp-per-request-routing

Thump604 commented Mar 17, 2026 •

edited

Loading

Uh oh!

Thump604 commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thump604 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Thump604 commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Thump604 commented Mar 17, 2026 •

edited

Loading