feat: MTP per-request routing in BatchedEngine by Thump604 · Pull Request #223 · waybarrios/vllm-mlx

Thump604 · 2026-03-24T22:43:06Z

Summary

Port SimpleEngine's MLLM+MTP per-request routing to BatchedEngine for continuous batching.

Text-only requests → mlx_lm TextModel with MTP speculative decoding (1.1-1.3x gen speed)
Media requests → MLLM path (unchanged)

Routing decision is per-request based on _has_media_content(). Uses text_model_from_vlm.py (upstream from PR #180) to build a zero-copy TextModel from VLM backbone weights (~0 extra RAM).

Changes (+226 lines in `batched.py`, +107 lines tests)

_has_media_content() helper + _MEDIA_TYPES constant (mirrors SimpleEngine)
mtp and prefill_step_size params in BatchedEngine.__init__
TextModel build in _start_mllm() with Qwen3.5 eos_token fix
Per-request routing in chat() and stream_chat()
_chat_text_model() / _stream_chat_text_model() for mlx_lm generation under lock
Cleanup in stop()

Design decisions

Focused scope: No SpecPrefill, no system KV cache — those can come in follow-up PRs
Lock-based serialization: _text_generation_lock serializes Metal operations for text model path (same pattern as SimpleEngine's _generation_lock)
Graceful degradation: If MTP weights don't exist, build_text_model returns None and all requests go to MLLM

Test plan

8 unit tests for _has_media_content (text, image, video, audio, multi-turn, mixed)
Black + ruff clean
Integration test with VLM+MTP model (text routing + media routing)

Port SimpleEngine's MLLM+MTP per-request routing to BatchedEngine. Text-only requests route to mlx_lm TextModel with MTP speculative decoding; media requests route to MLLM path. Uses text_model_from_vlm.py (already upstream from PR waybarrios#180) to build a zero-copy TextModel from VLM backbone weights. Routing decision is per-request based on message content via _has_media_content(). Changes: - Add mtp/prefill_step_size params to BatchedEngine.__init__ - Build TextModel in _start_mllm() when mtp=True - Route text-only to _stream_chat_text_model in chat()/stream_chat() - Add _chat_text_model/_stream_chat_text_model for mlx_lm generation - Add _has_media_content helper (mirrors SimpleEngine) - Add test_batched_mtp_routing.py (8 tests)

Removed manual make_prompt_cache + make_mtp_cache concatenation that caused AttributeError (keys=None) during generate_step. mlx_lm's stream_generate is MTP-aware and creates the correct cache internally.

Thump604 · 2026-03-31T23:10:31Z

Production evidence from M2 Ultra 128GB, Qwen3.5-122B-A10B-VLM-MTP-5bit, BatchedEngine:

This PR enables MTP speculative decoding in BatchedEngine (continuous batching mode). Without it, MTP only works in SimpleEngine. The routing logic mirrors SimpleEngine: text-only requests go to TextModel with MTP, media requests go to MLLM path without MTP. Zero-copy weight sharing between paths.

Tested with continuous_batching=true, mtp=true, mllm=true in production. MTP tokens accepted at ~60% rate, giving ~1.4x decode speedup on text-only requests while media requests work correctly through the MLLM path.

Thump604 · 2026-04-01T22:02:08Z

Closing in favor of #245 which covers the same ground with a broader scope (MTP injection, weight extraction script, batch auto-skip, MLLM path support). Reviewed and approved #245.

Thump604 added 2 commits March 24, 2026 17:42

fix: let mlx_lm handle MTP cache creation in _stream_chat_text_model

1996581

Removed manual make_prompt_cache + make_mtp_cache concatenation that caused AttributeError (keys=None) during generate_step. mlx_lm's stream_generate is MTP-aware and creates the correct cache internally.

Thump604 mentioned this pull request Mar 30, 2026

feat: full sampling parameter support (top_k, min_p, presence_penalty, repetition_penalty) #213

Open

5 tasks

Thump604 mentioned this pull request Apr 1, 2026

Add MTP speculative decoding for Qwen3.5 models #245

Open

7 tasks

Thump604 closed this Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MTP per-request routing in BatchedEngine#223

feat: MTP per-request routing in BatchedEngine#223
Thump604 wants to merge 2 commits intowaybarrios:mainfrom
Thump604:feat/batched-mtp-routing

Thump604 commented Mar 24, 2026

Uh oh!

Thump604 commented Mar 31, 2026

Uh oh!

Thump604 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thump604 commented Mar 24, 2026

Summary

Changes (+226 lines in batched.py, +107 lines tests)

Design decisions

Test plan

Uh oh!

Thump604 commented Mar 31, 2026

Uh oh!

Thump604 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Changes (+226 lines in `batched.py`, +107 lines tests)