Skip to content

feat: MTP per-request routing in BatchedEngine#223

Closed
Thump604 wants to merge 2 commits intowaybarrios:mainfrom
Thump604:feat/batched-mtp-routing
Closed

feat: MTP per-request routing in BatchedEngine#223
Thump604 wants to merge 2 commits intowaybarrios:mainfrom
Thump604:feat/batched-mtp-routing

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

Summary

Port SimpleEngine's MLLM+MTP per-request routing to BatchedEngine for continuous batching.

  • Text-only requests → mlx_lm TextModel with MTP speculative decoding (1.1-1.3x gen speed)
  • Media requests → MLLM path (unchanged)

Routing decision is per-request based on _has_media_content(). Uses text_model_from_vlm.py (upstream from PR #180) to build a zero-copy TextModel from VLM backbone weights (~0 extra RAM).

Changes (+226 lines in batched.py, +107 lines tests)

  • _has_media_content() helper + _MEDIA_TYPES constant (mirrors SimpleEngine)
  • mtp and prefill_step_size params in BatchedEngine.__init__
  • TextModel build in _start_mllm() with Qwen3.5 eos_token fix
  • Per-request routing in chat() and stream_chat()
  • _chat_text_model() / _stream_chat_text_model() for mlx_lm generation under lock
  • Cleanup in stop()

Design decisions

  • Focused scope: No SpecPrefill, no system KV cache — those can come in follow-up PRs
  • Lock-based serialization: _text_generation_lock serializes Metal operations for text model path (same pattern as SimpleEngine's _generation_lock)
  • Graceful degradation: If MTP weights don't exist, build_text_model returns None and all requests go to MLLM

Test plan

  • 8 unit tests for _has_media_content (text, image, video, audio, multi-turn, mixed)
  • Black + ruff clean
  • Integration test with VLM+MTP model (text routing + media routing)

Port SimpleEngine's MLLM+MTP per-request routing to BatchedEngine.
Text-only requests route to mlx_lm TextModel with MTP speculative
decoding; media requests route to MLLM path.

Uses text_model_from_vlm.py (already upstream from PR waybarrios#180) to build
a zero-copy TextModel from VLM backbone weights. Routing decision is
per-request based on message content via _has_media_content().

Changes:
- Add mtp/prefill_step_size params to BatchedEngine.__init__
- Build TextModel in _start_mllm() when mtp=True
- Route text-only to _stream_chat_text_model in chat()/stream_chat()
- Add _chat_text_model/_stream_chat_text_model for mlx_lm generation
- Add _has_media_content helper (mirrors SimpleEngine)
- Add test_batched_mtp_routing.py (8 tests)
Removed manual make_prompt_cache + make_mtp_cache concatenation that
caused AttributeError (keys=None) during generate_step. mlx_lm's
stream_generate is MTP-aware and creates the correct cache internally.
@Thump604
Copy link
Copy Markdown
Collaborator Author

Production evidence from M2 Ultra 128GB, Qwen3.5-122B-A10B-VLM-MTP-5bit, BatchedEngine:

This PR enables MTP speculative decoding in BatchedEngine (continuous batching mode). Without it, MTP only works in SimpleEngine. The routing logic mirrors SimpleEngine: text-only requests go to TextModel with MTP, media requests go to MLLM path without MTP. Zero-copy weight sharing between paths.

Tested with continuous_batching=true, mtp=true, mllm=true in production. MTP tokens accepted at ~60% rate, giving ~1.4x decode speedup on text-only requests while media requests work correctly through the MLLM path.

@Thump604
Copy link
Copy Markdown
Collaborator Author

Thump604 commented Apr 1, 2026

Closing in favor of #245 which covers the same ground with a broader scope (MTP injection, weight extraction script, batch auto-skip, MLLM path support). Reviewed and approved #245.

@Thump604 Thump604 closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant