[SpecDecoding] extend mtp support for mimo 2.5 by ZJY0516 · Pull Request #41905 · vllm-project/vllm

ZJY0516 · 2026-05-07T06:34:43Z

Purpose

support num_speculative_tokens>1 for mimo 2.5 mtp

Test Plan

vllm serve XiaomiMiMo/MiMo-V2.5 -tp 4 --trust-remote-code 
--speculative_config '{"method":"mtp","num_speculative_tokens":3}' 
--no-async-scheduling

lm_eval --model local-completions --model_args "model=XiaomiMiMo/MiMo-V2.5,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=256,timeout=5000,max_length=40960" --tasks gsm8k --num_fewshot 5 --gen_kwargs max_gen_toks=5120

Test Result

w/o mtp

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9530	±	0.0058
		strict-match	5	exact_match	↑	0.9538	±	0.0058

mtp3

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9454	±	0.0063
		strict-match	5	exact_match	↑	0.9469	±	0.0062

mtp3 w/o async-scheduling

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9522	±	0.0059
		strict-match	5	exact_match	↑	0.9522	±	0.0059

Note: async shceduling may affect mtp accuracy

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 16e86d293a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-07T06:38:27Z

+        current_step_idx = spec_step_idx % self.num_mtp_layers
+        return self.mtp.layers[str(current_step_idx)](


Load the remaining MTP layers for multi-step drafts

When num_speculative_tokens > 1, this still sets self.num_mtp_layers to 1, so every spec_step_idx maps back to model.mtp.layers.0; load_weights then ignores the checkpoint's later MTP layers. The MiMo-V2.5-Pro model card lists 3 MTP layers and its deployment example uses 3 speculative steps (https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro/blob/main/README.md), so enabling multi-token drafting here runs all draft steps through the first-layer weights instead of the trained layer sequence, which makes the new >1 support materially incorrect/low-acceptance. Please instantiate/load the available MTP layers or keep rejecting num_speculative_tokens > 1.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request removes the restriction limiting MiMo-V2 MTP to a single speculative token and implements modulo-based indexing for MTP layers. Feedback indicates that the module-level documentation needs updating to reflect this change. Additionally, a critical issue was identified where num_mtp_layers remains hardcoded to 1, which would cause all speculative steps to incorrectly reuse the first layer instead of utilizing the appropriate layers for each step.

gemini-code-assist · 2026-05-07T06:40:03Z

 # MiMo-V2 checkpoints contain multiple MTP layers, but vLLM currently supports
-# only the first layer and only one speculative token.
+# only the first layer


The comment still states that vLLM only supports the first MTP layer. Since this PR aims to support multiple speculative tokens, this comment should be updated to reflect that multiple layers are now supported (assuming the hardcoded layer count is also addressed).

Suggested change

# MiMo-V2 checkpoints contain multiple MTP layers, but vLLM currently supports

# only the first layer and only one speculative token.

# only the first layer

# MiMo-V2 checkpoints contain multiple MTP layers, and vLLM supports

# multiple speculative tokens by using these layers.

gemini-code-assist · 2026-05-07T06:40:03Z

-            raise ValueError(
-                "MiMo-V2 MTP in vLLM only supports num_speculative_tokens=1."
-            )
        num_mtp_layers = 1


The variable num_mtp_layers is still hardcoded to 1. This contradicts the PR's objective of supporting num_speculative_tokens > 1. If num_mtp_layers remains 1, only the first MTP layer will be initialized and loaded. When num_speculative_tokens > 1, the forward method (line 204) will reuse this single layer for all speculative steps due to the modulo operation (spec_step_idx % 1). This is mathematically incorrect for MTP architectures like DeepSeek/MiMo where each speculative step typically requires a distinct layer trained for that specific offset. You should derive num_mtp_layers from the model configuration (e.g., config.num_nextn_predict_layers) or set it based on spec_cfg.num_speculative_tokens while ensuring it does not exceed the model's actual capacity.

Suggested change

num_mtp_layers = 1

num_mtp_layers = spec_cfg.num_speculative_tokens

jeejeelee · 2026-05-07T07:25:11Z

Can you provide the accuracy result here?

ZJY0516 · 2026-05-07T09:15:48Z

Can you provide the accuracy result here?

Updated, PTAL

jeejeelee

thank you

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

update

16e86d2

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

claude Bot reviewed May 7, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

Isotr0py approved these changes May 7, 2026

View reviewed changes

jeejeelee approved these changes May 7, 2026

View reviewed changes

jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label May 7, 2026

jeejeelee enabled auto-merge (squash) May 7, 2026 13:40

ZJY0516 added 2 commits May 8, 2026 22:47

Merge branch 'main' into mimo-mtp

91bc62e

Merge branch 'main' into mimo-mtp

01147e1

jeejeelee merged commit 2ee8c2a into vllm-project:main May 9, 2026
60 checks passed

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request May 11, 2026

[SpecDecoding] extend mtp support for mimo 2.5 (vllm-project#41905)

b0ade8f

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026

[SpecDecoding] extend mtp support for mimo 2.5 (vllm-project#41905)

5b645f9

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

[SpecDecoding] extend mtp support for mimo 2.5 (vllm-project#41905)

b9e1578

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[SpecDecoding] extend mtp support for mimo 2.5 (vllm-project#41905)

de60908

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026

[SpecDecoding] extend mtp support for mimo 2.5 (vllm-project#41905)

187171a

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SpecDecoding] extend mtp support for mimo 2.5#41905

[SpecDecoding] extend mtp support for mimo 2.5#41905
jeejeelee merged 3 commits into
vllm-project:mainfrom
ZJY0516:mimo-mtp

ZJY0516 commented May 7, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 7, 2026

Uh oh!

gemini-code-assist Bot May 7, 2026

Uh oh!

jeejeelee commented May 7, 2026

Uh oh!

ZJY0516 commented May 7, 2026

Uh oh!

jeejeelee left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		current_step_idx = spec_step_idx % self.num_mtp_layers
		return self.mtp.layers[str(current_step_idx)](

	num_mtp_layers = 1
	num_mtp_layers = spec_cfg.num_speculative_tokens

Uh oh!

Conversation

ZJY0516 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

jeejeelee commented May 7, 2026

Uh oh!

ZJY0516 commented May 7, 2026

Uh oh!

jeejeelee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZJY0516 commented May 7, 2026 •

edited

Loading