Skip to content

feat: MLLM+MTP per-request routing for text and vision#171

Closed
Thump604 wants to merge 5 commits intowaybarrios:mainfrom
Thump604:feat/mllm-mtp-per-request-routing
Closed

feat: MLLM+MTP per-request routing for text and vision#171
Thump604 wants to merge 5 commits intowaybarrios:mainfrom
Thump604:feat/mllm-mtp-per-request-routing

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 commented Mar 17, 2026

Per-request routing for VLM+MTP models: text-only requests use mlx_lm TextModel with MTP speculative decoding, media requests use mlx_vlm MLLM path.

What: text_model_from_vlm.py builds an mlx_lm TextModel from the VLM backbone (zero-copy weight sharing, ~0 extra RAM). At request time, _has_media_content() routes accordingly. Includes system message hoisting and developer→system mapping in both SimpleEngine paths.

Why: MTP speculative decoding only works through mlx_lm, not mlx_vlm. Without routing, text-only requests on VLM models don't get MTP — losing ~30% generation speed.

Depends on: #165

Files: new text_model_from_vlm.py, modified engine/simple.py, server.py

Test:

# Start with --mllm --enable-mtp
# Text-only → check server log for "TextModel" routing
curl http://localhost:8080/v1/chat/completions -d '{
  "messages": [{"role":"user","content":"Hello"}],
  "max_tokens": 20
}'

When both --mllm and --enable-mtp are set, SimpleEngine builds a
parallel mlx_lm TextModel sharing the VLM backbone weights (zero-copy).
Text-only requests route to mlx_lm with MTP speculative decoding;
media requests route to the mlx_vlm MLLM path.

Key components:
- text_model_from_vlm.py: Build mlx_lm TextModel from VLM weights
- Per-request routing in stream_chat() via _has_media_content()
- _stream_generate_text() for MTP-accelerated text generation
- MTP passthrough: --enable-mtp flag through CLI/server/engine/LLM

Tested on Qwen3.5-35B-A3B VLM+MTP (8-bit):
- Text (MTP): 65.3 tok/s
- Vision (MLLM): 63.8 tok/s
- Memory: 38.7 GB (zero-copy, same as single model)
Multi-turn conversations with tool results can produce messages where
system messages are out of order or consecutive same-role messages appear.
Qwen 3.5's chat template rejects these with "System message must be at
the beginning", crashing CLI agents on turn 2.

Add _normalize_messages() to SimpleEngine.stream_chat() to:
1. Map developer -> system (OpenAI Responses API compat)
2. Merge consecutive same-role messages (alternating-role requirement)

This matches the normalization already done in BatchedEngine paths
(PR waybarrios#165) but was missing from SimpleEngine's MTP text path.
Many CLIs (OpenCode, Qwen Code, Kilo) send system messages mid-conversation.
The Qwen 3.5 chat template enforces system messages at position [0], causing
TemplateError crashes and dropped connections (2600+ occurrences in logs).

Changes:
- _normalize_messages() now hoists all system messages to position [0] and
  merges their content, after the existing role-mapping and same-role merge
- Added _normalize_messages() call to SimpleEngine.chat() (non-streaming
  path was missing it)

The condition only triggers when system messages are out of position
(len > 1 or merged[0] != system), so well-ordered messages pass through
unchanged.
Add _hoist_system_messages() to mllm.py for the MLLM-internal
get_chat_template calls. These go through mlx_vlm's template
application which also enforces system-at-position-[0].

Defense-in-depth: SimpleEngine already normalizes before calling
MLLM methods, but direct MLLM usage (benchmarks, tests) and
edge cases during startup now also handle out-of-order system
messages correctly.
SimpleEngine.chat() lacked MLLM+MTP per-request routing that
stream_chat() already had. Text-only requests via non-streaming API
went to mlx_vlm MLLM path, causing 160GB Metal buffer allocation
attempts and server crashes.

Routes text-only chat() requests through _stream_generate_text()
(MTP path) matching stream_chat() behavior.
Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Mar 21, 2026
Build mlx_lm TextModel with MTP weights from VLM backbone at load time
(zero-copy weight sharing). Route text-only requests to TextModel with
MTP speculative decoding, media requests to MLLM path.

Dual-backend architecture: MLLMScheduler for media, direct TextModel
generation with asyncio.Lock for text. Metal serialization prevents
concurrent GPU access.

Depends on PR waybarrios#171 (text_model_from_vlm.py).
@Thump604
Copy link
Copy Markdown
Collaborator Author

Superseded — all changes merged upstream via PR #180 (SpecPrefill, which included MTP routing, system KV cache, and prefill step size).

@Thump604 Thump604 closed this Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant