Skip to content

Feat/speculative decoding#195

Merged
cubist38 merged 3 commits intomainfrom
feat/speculative_decoding
Feb 10, 2026
Merged

Feat/speculative decoding#195
cubist38 merged 3 commits intomainfrom
feat/speculative_decoding

Conversation

@cubist38
Copy link
Copy Markdown
Owner

@cubist38 cubist38 commented Feb 10, 2026

feat: add speculative decoding support for lm models

Adds speculative decoding for text-only (lm) models: a smaller draft model proposes tokens and the main model verifies them, speeding up generation.

Changes

  • CLI: New options --draft-model <path> and --num-draft-tokens <n> (default: 2). Only applied when --model-type lm; other model types ignore them and log a warning.
  • Config: MLXServerConfig gains draft_model_path and num_draft_tokens with lm-only validation.
  • Handler: MLXLMHandler passes draft model and num draft tokens into MLX_LM and uses keyword args for context_length to match the updated constructor.
  • Docs: README updated with Quick Start example, server parameters, and an "Advanced configuration → Speculative decoding (lm)" section.

Example

mlx-openai-server launch --model-path /path/to/main --model-type lm \
  --draft-model /path/to/draft --num-draft-tokens 4

Related issues

#177

This commit updates the MLX_LM class to include support for draft models, allowing for speculative decoding. A new method, _load_draft_model, is introduced to load the draft model and its tokenizer, with validation to ensure compatibility. The constructor is modified to accept draft model parameters, and the create_prompt_cache method is updated to incorporate caches from both the main and draft models. Additionally, logging is added to warn about potential tokenizer mismatches.
…onfiguration

This commit introduces new CLI options for specifying a draft model and the number of draft tokens for speculative decoding. The MLXServerConfig class is updated to handle these new parameters, including validation for model type compatibility. Additionally, the MLXLMHandler is modified to accept draft model parameters, enhancing the model's capabilities for speculative decoding. This change improves flexibility in model usage and aligns with recent enhancements in the MLX_LM class.
@cubist38 cubist38 merged commit 3c2a549 into main Feb 10, 2026
@cubist38 cubist38 deleted the feat/speculative_decoding branch February 10, 2026 14:33
@TomLucidor
Copy link
Copy Markdown

I wrote a shortlist of possible testing for speculative decoding, including Qwen3, Nemotron, and GPT-OSS-20B. waybarrios/vllm-mlx#45 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants