Skip to content

feat: wire MTP through SimpleEngine for native speculative decoding#170

Closed
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:feat/mtp-simple-engine
Closed

feat: wire MTP through SimpleEngine for native speculative decoding#170
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:feat/mtp-simple-engine

Conversation

@Thump604
Copy link
Copy Markdown
Collaborator

Summary

  • Passes the existing --enable-mtp flag through SimpleEngineMLXLanguageModelmlx_lm.stream_generate(mtp=True)
  • Uses mlx-lm PR #990's native MTP speculative decoding for Qwen3.5 models
  • MTP predicts token t+2 using the model's built-in MTP head — no external draft model needed, ~150MB extra memory

Currently --enable-mtp only affects BatchedEngine (scheduler-based MTP). This PR adds the SimpleEngine path, which is the production path for single-user serving.

Changes

File Change
cli.py Pass mtp=args.enable_mtp to load_model(), log MTP status for SimpleEngine
server.py Accept mtp param in load_model(), forward to SimpleEngine
engine/simple.py Accept mtp in constructor, pass to MLXLanguageModel, log MTP status
models/llm.py Accept mtp param, pass mtp=True to mlx_lm.stream_generate()

Benchmark (M2 Ultra, 35B-A3B-8bit, streaming API, greedy)

Config tok/s Speedup
Baseline (no MTP) 53.0 1.00x
MTP enabled 71.5 1.35x

mlx-lm CLI benchmarks (all models, greedy):

Model Baseline MTP Speedup
27B-8bit (dense) 20.6 27.1 1.32x
35B-A3B-8bit (MoE) 74.4 82.3 1.11x
122B-A10B-5bit (MoE) 43.0 46.7 1.09x

Requirements

Usage

vllm-mlx serve <model-with-mtp-weights> --enable-mtp

Test plan

  • Verified MTP flag flows through to mlx_lm.stream_generate(mtp=True)
  • Streaming chat completions work with MTP enabled
  • Baseline (no MTP) still works unchanged
  • --enable-mtp with --continuous-batching uses existing scheduler MTP path (no conflict)

Partially addresses #56

Pass --enable-mtp flag through SimpleEngine → MLXLanguageModel →
mlx_lm.stream_generate(mtp=True). Uses mlx-lm PR #990's native
MTP speculative decoding for Qwen3.5 models.

Requires model converted with MTP weights preserved (see mlx-lm
PR #990). MTP only affects the streaming LLM path (SimpleEngine),
not BatchedEngine (which has its own scheduler-based MTP).

Tested on M2 Ultra: 35B-A3B-8bit 53→71.5 tok/s (1.35x) through
the streaming API.

Closes waybarrios#56 (partially — SimpleEngine path only)
@Thump604
Copy link
Copy Markdown
Collaborator Author

Closing — fully subsumed by #171 which adds per-request MLLM+MTP routing on top of the basic MTP wiring introduced here.

@Thump604 Thump604 closed this Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant