feat: wire MTP through SimpleEngine for native speculative decoding by Thump604 · Pull Request #170 · waybarrios/vllm-mlx

Thump604 · 2026-03-17T07:15:26Z

Summary

Passes the existing --enable-mtp flag through SimpleEngine → MLXLanguageModel → mlx_lm.stream_generate(mtp=True)
Uses mlx-lm PR #990's native MTP speculative decoding for Qwen3.5 models
MTP predicts token t+2 using the model's built-in MTP head — no external draft model needed, ~150MB extra memory

Currently --enable-mtp only affects BatchedEngine (scheduler-based MTP). This PR adds the SimpleEngine path, which is the production path for single-user serving.

Changes

File	Change
`cli.py`	Pass `mtp=args.enable_mtp` to `load_model()`, log MTP status for SimpleEngine
`server.py`	Accept `mtp` param in `load_model()`, forward to `SimpleEngine`
`engine/simple.py`	Accept `mtp` in constructor, pass to `MLXLanguageModel`, log MTP status
`models/llm.py`	Accept `mtp` param, pass `mtp=True` to `mlx_lm.stream_generate()`

Benchmark (M2 Ultra, 35B-A3B-8bit, streaming API, greedy)

Config	tok/s	Speedup
Baseline (no MTP)	53.0	1.00x
MTP enabled	71.5	1.35x

mlx-lm CLI benchmarks (all models, greedy):

Model	Baseline	MTP	Speedup
27B-8bit (dense)	20.6	27.1	1.32x
35B-A3B-8bit (MoE)	74.4	82.3	1.11x
122B-A10B-5bit (MoE)	43.0	46.7	1.09x

Requirements

Model must be converted with MTP weights preserved (mlx-lm PR #990)
MoE models (35B, 122B) also need the MoE sanitize fix for qwen3_5_moe.py
Pre-converted models: Thump604/Qwen3.5-27B-MLX-8bit, 35B, 122B

Usage

vllm-mlx serve <model-with-mtp-weights> --enable-mtp

Test plan

Verified MTP flag flows through to mlx_lm.stream_generate(mtp=True)
Streaming chat completions work with MTP enabled
Baseline (no MTP) still works unchanged
--enable-mtp with --continuous-batching uses existing scheduler MTP path (no conflict)

Partially addresses #56

Pass --enable-mtp flag through SimpleEngine → MLXLanguageModel → mlx_lm.stream_generate(mtp=True). Uses mlx-lm PR #990's native MTP speculative decoding for Qwen3.5 models. Requires model converted with MTP weights preserved (see mlx-lm PR #990). MTP only affects the streaming LLM path (SimpleEngine), not BatchedEngine (which has its own scheduler-based MTP). Tested on M2 Ultra: 35B-A3B-8bit 53→71.5 tok/s (1.35x) through the streaming API. Closes waybarrios#56 (partially — SimpleEngine path only)

Thump604 · 2026-03-17T22:14:45Z

Closing — fully subsumed by #171 which adds per-request MLLM+MTP routing on top of the basic MTP wiring introduced here.

Thump604 closed this Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: wire MTP through SimpleEngine for native speculative decoding#170

feat: wire MTP through SimpleEngine for native speculative decoding#170
Thump604 wants to merge 1 commit intowaybarrios:mainfrom
Thump604:feat/mtp-simple-engine

Thump604 commented Mar 17, 2026

Uh oh!

Thump604 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thump604 commented Mar 17, 2026

Summary

Changes

Benchmark (M2 Ultra, 35B-A3B-8bit, streaming API, greedy)

Requirements

Usage

Test plan

Uh oh!

Thump604 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant