feat: Add speculative decoding support with draft models by janhilgard · Pull Request #45 · waybarrios/vllm-mlx

janhilgard · 2026-02-05T11:30:26Z

Summary

Add support for speculative decoding using mlx-lm's draft model feature, including a new HybridEngine that shares a single model instance between speculative and batched modes.

Features

Speculative decoding: 1.2-1.4x throughput improvement with draft models
HybridEngine: Combines speculative decoding with continuous batching using shared model (~44GB RAM savings)

New CLI arguments

--draft-model: Path to draft model (must share tokenizer with main model)
--num-draft-tokens: Tokens to speculate per step (default: 4)

Usage modes

Simple mode (speculative only):

vllm-mlx serve mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit \
    --draft-model mlx-community/Qwen3-0.6B-4bit \
    --num-draft-tokens 5

Hybrid mode (speculative + batching with shared model):

vllm-mlx serve mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit \
    --continuous-batching \
    --draft-model mlx-community/Qwen3-0.6B-4bit \
    --num-draft-tokens 5

HybridEngine Architecture

When both --continuous-batching and --draft-model are specified, the server uses HybridEngine which:

HybridEngine
├── Shared: _shared_model (44GB), _shared_tokenizer
├── SimpleEngine (speculative, 80+ tok/s) + draft_model (350MB)
└── BatchedEngine (batching, lazy start)

Mode switching:

active_requests < 2 → SimpleEngine (speculative decoding)
active_requests >= 2 → BatchedEngine (continuous batching)
BatchedEngine starts lazily on first switch (avoids throughput regression)

RAM usage: ~45GB (vs ~90GB if running separate engines)

Benchmark results

Tested with Qwen3-Next-80B-6bit + Qwen3-0.6B-4bit:

Configuration	Throughput	Notes
Baseline (no draft)	49-53 tok/s	mlx_lm.server
SimpleEngine + speculative	58 tok/s	+18%
HybridEngine + speculative	80 tok/s	Repetitive content
HybridEngine + speculative	35 tok/s	Complex content

Speculative decoding acceptance rate varies by content type - higher for repetitive text (lists, numbers).

Implementation details

HybridEngine: Manages shared model between SimpleEngine and BatchedEngine
_inject_shared_model(start_engine=False): Lazy start for HybridEngine
_decide_and_switch_mode(): Dynamic mode switching based on concurrent requests
_switch_to_mode(): Handles ownership transfer via ModelRegistry
Uses mlx-lm's native stream_generate() with draft_model parameter

Recent fixes

Lazy start BatchedEngine: Fixed throughput regression where BatchedEngine's scheduler was running in background even when not used. Now starts only on first switch to batched mode.

Limitations

Speculative decoding not supported for MLLM (multimodal) models
Draft model must have same tokenizer vocab_size as main model
Throughput varies based on content type (speculative acceptance rate)

Test plan

Tested with Qwen3-Next-80B-6bit + Qwen3-0.6B-4bit
Verified throughput improvement (80 tok/s for repetitive content)
Verified HybridEngine shares model instance (~45GB RAM)
Verified lazy start BatchedEngine (no throughput regression)
Verified mode switching works based on concurrent requests
Verified graceful degradation when draft model has different vocab size
Verified MLLM models warn and ignore draft model

🤖 Generated with Claude Code

enryold · 2026-02-09T10:23:07Z

This is super!
@janhilgard what's your machine specs and how you benchmark it?
I've an M1 Max 64gb so can start collecting different benchmarks on different systems can be helpful

janhilgard · 2026-02-09T21:07:17Z

Thanks @enryold!

Machine: Mac Studio M3 Ultra, 256GB unified memory

Benchmark method: Custom Python script using async aiohttp against the /v1/chat/completions endpoint. Tests include:

Single-stream throughput — 3 runs, median tok/s
Multi-turn caching — 8-turn conversation measuring TTFT with prefix cache
Parallel requests — concurrency 1/2/4/8, aggregate vs per-request throughput
Cache verification — TTFT ratio between turns to confirm cache hits

Results with GPT-OSS-20B-4bit (~11GB model):

Metric	vllm-mlx	llama.cpp
Single-stream	143 tok/s	115 tok/s
TTFT (cached turn)	0.04s	0.94s
8-concurrent aggregate	320 tok/s	67 tok/s

M1 Max 64GB benchmarks would be great! The 80B model won't fit (~55GB weights), but GPT-OSS-20B-4bit (~11GB) or Qwen3-8B would be good candidates. Happy to help with setup.

waybarrios · 2026-02-10T01:32:17Z

Great work on this. I have a question about the accuracy tradeoff when using small draft models.

In theory, speculative decoding should be lossless because the target model verifies every token the draft model proposes. If the draft model guesses wrong, the token gets rejected and the target model generates the correct one. So the final output quality should be identical to running the target model alone, you just get variable speedup depending on how well the draft model predicts.

But in practice I'm wondering about a few things:

Does mlx-lm's implementation guarantee mathematically lossless decoding (full rejection sampling), or does it use some approximation/threshold that could affect output quality?
Looking at the benchmarks, the throughput drops from 80 tok/s (repetitive content) to 35 tok/s (complex content). That 35 tok/s is actually below the 49-53 tok/s baseline without speculative decoding. So for complex tasks (reasoning, code generation) the draft model hurts throughput instead of helping. Is there a way to detect this at runtime and fall back to non-speculative mode when the acceptance rate drops below some threshold?
With a 0.6B draft model paired with an 80B target, the vocabulary and knowledge gap is massive. For coding or domain-specific tasks where the small model has weak coverage, would we see mostly rejections? Has anyone measured acceptance rates across different task types?

My concern is that for production use cases where accuracy matters most (agentic tool calling, code generation), the small draft models might not help much and could even hurt. The sweet spot seems to be repetitive or predictable content, which is not where we typically need the most performance.

Would be useful to add some acceptance rate logging so users can see whether speculative decoding is actually helping for their specific workload.

janhilgard · 2026-02-10T23:16:02Z

Great questions @waybarrios, let me address them one by one.

1. Mathematically lossless — Yes

mlx-lm uses full rejection sampling. The implementation in mlx_lm/generate.py (speculative_generate_step) works as follows:

Draft model proposes k tokens
Main model runs a single forward pass over all k draft tokens + 1
Token-by-token exact match comparison — first mismatch stops the chain
Non-speculative fallback token is always sampled from the main model

The output distribution is identical to non-speculative decoding: accepted tokens match main model output exactly, and rejected tokens trigger immediate fallback to the main model's sample. This is genuine rejection sampling, not an approximation.

2. Runtime fallback — not yet, but feasible

Currently mlx-lm's speculative_generate_step doesn't track acceptance metrics — it just yields tokens. The PromptLookupDecoding path in this PR already has acceptance tracking (record_accepted(), get_stats()), but the draft model path via mlx-lm doesn't expose this.

A practical approach for a follow-up:

Instrument the wrapper in HybridEngine to count accepted vs rejected tokens per request (we can infer from the is_draft flag mlx-lm yields)
Add adaptive threshold — if acceptance rate drops below ~40% over a sliding window, fall back to non-speculative for the rest of that request
Per-request logging — emit acceptance stats at end of each generation

The main challenge is that mlx-lm's generator is all-or-nothing per session — switching mid-stream would require restarting the generator. But we could at least measure and log, and disable speculation for subsequent requests if a workload consistently underperforms.

3. Acceptance rates by task type

From our benchmarks (Mac Studio M3 Ultra, Qwen3-Next-80B + Qwen3-0.6B draft):

k	Throughput	Speedup	Acceptance
3	62 tok/s	1.00x	50%
5	80 tok/s	1.28x	70%
7	70 tok/s	1.12x	70%

The 70% acceptance rate was on general instruction-following. You're right that coding/reasoning tasks will likely see lower acceptance — the 0.6B model simply can't predict complex reasoning chains. The speedup is real for conversational/predictable content but marginal-to-negative for complex generation.

Why it still works for the 80B MoE case: Qwen3-Next only activates ~3B params per token despite 80B total. The 0.6B draft model has surprisingly decent overlap for common patterns, and same tokenizer (151,643 vocab) ensures zero alignment overhead.

Important caveat: speculative decoding is workload-dependent

From our own production usage, the impact varies dramatically by scenario:

Where it helps significantly:

Conversational chat, Q&A, summarization — 1.28x speedup (62→80 tok/s), the draft model predicts common patterns well
Repetitive/templated content — peak 136 tok/s (2.2x), the draft model nails predictable sequences
Long-form text generation — consistent 70%+ acceptance rate

Where it actually hurts:

Complex reasoning and code generation — 35 tok/s vs 49-53 tok/s baseline — that's a ~30% slowdown because the draft model guesses wrong on nearly every token, and the verification overhead adds up
Multi-tool agentic workflows — the verification latency per step compounds across chained calls
High-concurrency scenarios — the draft model consumes additional compute per request

The takeaway is clear: speculative decoding is not a universal win. It can deliver substantial speedups for the right workload, but it can equally slow things down for the wrong one. Users absolutely need to benchmark their specific use case. The acceptance rate statistics are essential — they tell you immediately whether speculation is helping or hurting.

Next steps

I agree acceptance rate logging would be valuable. I'll add it as a follow-up:

Per-request acceptance rate logging in HybridEngine
Aggregate stats endpoint (similar to /v1/stats)
Optional --spec-min-acceptance threshold flag
Documentation with guidance on when to enable/disable based on workload characteristics

This keeps the current PR focused on the core feature, with observability as a clean follow-up.

enryold · 2026-02-11T17:55:13Z

Thanks @enryold!

Machine: Mac Studio M3 Ultra, 256GB unified memory

Benchmark method: Custom Python script using async aiohttp against the /v1/chat/completions endpoint. Tests include:

Metric vllm-mlx llama.cpp
Single-stream 143 tok/s 115 tok/s
TTFT (cached turn) 0.04s 0.94s
8-concurrent aggregate 320 tok/s 67 tok/s
M1 Max 64GB benchmarks would be great! The 80B model won't fit (~55GB weights), but GPT-OSS-20B-4bit (~11GB) or Qwen3-8B would be good candidates. Happy to help with setup.

Can you share this script and the setup so we can start running some benchmarks on diff hardware?
thx

TomLucidor · 2026-02-12T09:31:52Z

@janhilgard are there a standard/template/example command for toby1991/Qwen3-Coder-Next-REAP-48B-A3B-4bit-mlx along side mlx-community/Qwen3-0.6B-4bit for the speculative decoding + "HybridEngine" design, specifically for speeding up prompt digestion (output parsers e.g. tool use parser, reasoning parser)?

TomLucidor · 2026-02-13T02:09:53Z

@janhilgard thanks for the updates, and it seems like things are going smoothly. To avoid this whole PR only work towards Qwen3-Next, I got some weird test items.

Just to stress-test this feature:

Trying something outside Qwen, Nemotron lmstudio-community/NVIDIA-Nemotron-3-Nano-30B-A3B-MLX-4bit with draft model nvidia/Nemotron-H-4B-Instruct-128K (the draft model need a quant version for acceleration)
Also for Granite, lmstudio-community/granite-4.0-h-small-MLX-4bit and either mlx-community/granite-4.0-h-350m-8bit or mlx-community/granite-4.0-h-1b-6bit (should draft models have higher bitrate?)
And for Falcon, NexVeridian/Falcon-H1-34B-Instruct-4bit along side mlx-community/Falcon-H1-0.5B-Instruct-4bit

To check if they can handle models with weird tokenization file formats, maybe mlx-community/Kimi-Linear-48B-A3B-Instruct-4bit could have a slower draft model based on mlx-community/Moonlight-16B-A3B-Instruct-4-bit (not for acceleration testing but compatibility testing)

As for dedicated models like Eagle3:

Checking against Qwen3 (not enough linear attention SLMs have them), use lmsys/SGLang-EAGLE3-Qwen3-Next-80B-A3B-Instruct-FP8-SpecForge-Meituan or huang3eng/SGLang-EAGLE3-Qwen3-Next-80B-A3B-Instruct-SpecForge
Maybe some non-linear models like GPT-OSS could get an Eagle3 treatment with FlatFootInternational/Huihui-gpt-oss-20b-mxfp4-abliterated-v2-mlx and RedHatAI/gpt-oss-20b-speculator.eagle3

And I wonder how tool use and code generation can be made to speed up as well (outside of Eagle3).

- Speculative decoding with mlx-lm draft models (1.2-1.4x speedup) - HybridEngine: shared model between speculative + batched modes - JSON schema enforcement with guided generation support - Fix false positive tool call detection for regular JSON - Strip <think> tags from API responses to prevent JSON parse errors Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TomLucidor · 2026-02-19T12:12:44Z

@janhilgard @enryold I need to clarify that most Eagle3 models are NOT compatible with vLLM-MLX since they are not MLX-based. I tried to run them to see if they are functional, but it does not work out so well (for Qwen3 and GPT-OSS-20B) e.g. ReadHatAI speculator models and AngelSlim eagle3 models. If it does work with Qwen3-30B variants, please show me the commands so I can test it ASAP #57

waybarrios · 2026-02-19T20:45:43Z

This is absolutely true @TomLucidor

janhilgard · 2026-02-19T21:09:15Z

Right, Eagle3 models ship as PyTorch safetensors with a custom architecture (fusion head on top of the base model) — there's no MLX equivalent yet, so they can't be loaded by mlx-lm or vllm-mlx.

What works today for speculative decoding in vllm-mlx:

Approach	How	Speedup	Models
MTP (native)	`--enable-mtp`	~1.4x	Qwen3-Next (has built-in MTP heads)
Draft model	`--draft-model` via mlx-lm	~1.2-1.3x	Any model pair with same tokenizer (e.g. Qwen3-80B + Qwen3-0.6B)

Re: Qwen3-30B variants — @TomLucidor Qwen3-Coder-Next-48B-A3B should work with Qwen3-0.6B as draft model (same tokenizer, 151K vocab). For the 30B dense Qwen3 models, MTP won't work (no MTP heads), but draft model speculation should. Example:

vllm-mlx serve mlx-community/Qwen3-30B-A3B-4bit \
    --continuous-batching \
    --draft-model mlx-community/Qwen3-0.6B-4bit \
    --num-draft-tokens 5

For Eagle3 to work in MLX, someone would need to:

Port the Eagle3 fusion head architecture to MLX (model definition + weight conversion)
Implement the multi-level feature extraction (Eagle3 uses low/mid/high-level features, not just the last layer)
Add a verification loop compatible with continuous batching

This is non-trivial and blocked on the fact that no Eagle3 models exist in MLX format. If someone converts the weights, the integration side would be feasible as a follow-up to this PR.

In the meantime, the draft model approach gives comparable speedups (1.2-1.3x) for models with a matching smaller variant, which covers Qwen3, Granite, and Falcon families.

waybarrios · 2026-02-19T21:25:42Z

@janhilgard How difficult is it to port Eagle to MLX format? I am bit curious about how it could work

TomLucidor · 2026-02-24T05:36:30Z

@waybarrios even AngelSlim is asking for community feedback on how weights be made from "standard" to MLX with a custom converter.

janhilgard · 2026-03-21T22:18:38Z

Closing: Superseded by merged #82 (MTP-based speculative decoding approach). This draft-model approach is no longer needed.

Vigilans · 2026-04-01T03:08:58Z

Since Qwen3-Coder-Next does not have MTP layer, and MLX doesn't support EAGLE3 yet, isn't draft model the only way to enable speculative decoding for it?

janhilgard mentioned this pull request Feb 10, 2026

MTP Support? #56

Open

janhilgard force-pushed the feature/speculative-decoding branch from 1967ac9 to 04252f6 Compare February 12, 2026 17:52

This was referenced Feb 13, 2026

Feat/speculative decoding cubist38/mlx-openai-server#195

Merged

Speculative decoding estimation support Pavelevich/llm-checker#11

Closed

janhilgard force-pushed the feature/speculative-decoding branch from 04252f6 to dcfa067 Compare February 15, 2026 17:55

TomLucidor mentioned this pull request Feb 16, 2026

Ollama support BlockRunAI/ClawRouter#2

Open

TomLucidor mentioned this pull request Feb 20, 2026

MLX Support for Eagle3 Tencent/AngelSlim#242

Open

TomLucidor mentioned this pull request Feb 24, 2026

Speculative Decoding jundot/omlx#30

Closed

janhilgard closed this Mar 21, 2026

Conversation

janhilgard commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features

New CLI arguments

Usage modes

HybridEngine Architecture

Benchmark results

Implementation details

Recent fixes

Limitations

Test plan

Uh oh!

enryold commented Feb 9, 2026

Uh oh!

janhilgard commented Feb 9, 2026

Uh oh!

waybarrios commented Feb 10, 2026

Uh oh!

janhilgard commented Feb 10, 2026

1. Mathematically lossless — Yes

2. Runtime fallback — not yet, but feasible

3. Acceptance rates by task type

Important caveat: speculative decoding is workload-dependent

Next steps

Uh oh!

enryold commented Feb 11, 2026

Uh oh!

TomLucidor commented Feb 12, 2026

Uh oh!

TomLucidor commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomLucidor commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

waybarrios commented Feb 19, 2026

Uh oh!

janhilgard commented Feb 19, 2026

Uh oh!

waybarrios commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomLucidor commented Feb 24, 2026

Uh oh!

janhilgard commented Mar 21, 2026

Uh oh!

Vigilans commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

janhilgard commented Feb 5, 2026 •

edited

Loading

TomLucidor commented Feb 13, 2026 •

edited

Loading

TomLucidor commented Feb 19, 2026 •

edited

Loading

waybarrios commented Feb 19, 2026 •

edited

Loading