Feat/speculative decoding by cubist38 · Pull Request #195 · cubist38/mlx-openai-server

cubist38 · 2026-02-10T14:30:34Z

feat: add speculative decoding support for lm models

Adds speculative decoding for text-only (lm) models: a smaller draft model proposes tokens and the main model verifies them, speeding up generation.

Changes

CLI: New options --draft-model <path> and --num-draft-tokens <n> (default: 2). Only applied when --model-type lm; other model types ignore them and log a warning.
Config: MLXServerConfig gains draft_model_path and num_draft_tokens with lm-only validation.
Handler: MLXLMHandler passes draft model and num draft tokens into MLX_LM and uses keyword args for context_length to match the updated constructor.
Docs: README updated with Quick Start example, server parameters, and an "Advanced configuration → Speculative decoding (lm)" section.

Example

mlx-openai-server launch --model-path /path/to/main --model-type lm \
  --draft-model /path/to/draft --num-draft-tokens 4

Related issues

#177

This commit updates the MLX_LM class to include support for draft models, allowing for speculative decoding. A new method, _load_draft_model, is introduced to load the draft model and its tokenizer, with validation to ensure compatibility. The constructor is modified to accept draft model parameters, and the create_prompt_cache method is updated to incorporate caches from both the main and draft models. Additionally, logging is added to warn about potential tokenizer mismatches.

…onfiguration This commit introduces new CLI options for specifying a draft model and the number of draft tokens for speculative decoding. The MLXServerConfig class is updated to handle these new parameters, including validation for model type compatibility. Additionally, the MLXLMHandler is modified to accept draft model parameters, enhancing the model's capabilities for speculative decoding. This change improves flexibility in model usage and aligns with recent enhancements in the MLX_LM class.

…tions for draft models

TomLucidor · 2026-02-13T05:12:46Z

I wrote a shortlist of possible testing for speculative decoding, including Qwen3, Nemotron, and GPT-OSS-20B. waybarrios/vllm-mlx#45 (comment)

cubist38 added 3 commits February 10, 2026 21:20

Update README to document speculative decoding feature and new CLI op…

5dd80a9

…tions for draft models

cubist38 merged commit 3c2a549 into main Feb 10, 2026

cubist38 mentioned this pull request Feb 10, 2026

Q: Speculative Decoding Support? #177

Open

cubist38 deleted the feat/speculative_decoding branch February 10, 2026 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/speculative decoding#195

Feat/speculative decoding#195
cubist38 merged 3 commits intomainfrom
feat/speculative_decoding

cubist38 commented Feb 10, 2026 •

edited

Loading

Uh oh!

TomLucidor commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cubist38 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat: add speculative decoding support for lm models

Changes

Example

Related issues

Uh oh!

TomLucidor commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cubist38 commented Feb 10, 2026 •

edited

Loading