[SpecDecoding] extend mtp support for mimo 2.5#41905
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 16e86d293a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| current_step_idx = spec_step_idx % self.num_mtp_layers | ||
| return self.mtp.layers[str(current_step_idx)]( |
There was a problem hiding this comment.
Load the remaining MTP layers for multi-step drafts
When num_speculative_tokens > 1, this still sets self.num_mtp_layers to 1, so every spec_step_idx maps back to model.mtp.layers.0; load_weights then ignores the checkpoint's later MTP layers. The MiMo-V2.5-Pro model card lists 3 MTP layers and its deployment example uses 3 speculative steps (https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro/blob/main/README.md), so enabling multi-token drafting here runs all draft steps through the first-layer weights instead of the trained layer sequence, which makes the new >1 support materially incorrect/low-acceptance. Please instantiate/load the available MTP layers or keep rejecting num_speculative_tokens > 1.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request removes the restriction limiting MiMo-V2 MTP to a single speculative token and implements modulo-based indexing for MTP layers. Feedback indicates that the module-level documentation needs updating to reflect this change. Additionally, a critical issue was identified where num_mtp_layers remains hardcoded to 1, which would cause all speculative steps to incorrectly reuse the first layer instead of utilizing the appropriate layers for each step.
| # MiMo-V2 checkpoints contain multiple MTP layers, but vLLM currently supports | ||
| # only the first layer and only one speculative token. | ||
| # only the first layer |
There was a problem hiding this comment.
The comment still states that vLLM only supports the first MTP layer. Since this PR aims to support multiple speculative tokens, this comment should be updated to reflect that multiple layers are now supported (assuming the hardcoded layer count is also addressed).
| # MiMo-V2 checkpoints contain multiple MTP layers, but vLLM currently supports | |
| # only the first layer and only one speculative token. | |
| # only the first layer | |
| # MiMo-V2 checkpoints contain multiple MTP layers, and vLLM supports | |
| # multiple speculative tokens by using these layers. |
| raise ValueError( | ||
| "MiMo-V2 MTP in vLLM only supports num_speculative_tokens=1." | ||
| ) | ||
| num_mtp_layers = 1 |
There was a problem hiding this comment.
The variable num_mtp_layers is still hardcoded to 1. This contradicts the PR's objective of supporting num_speculative_tokens > 1. If num_mtp_layers remains 1, only the first MTP layer will be initialized and loaded. When num_speculative_tokens > 1, the forward method (line 204) will reuse this single layer for all speculative steps due to the modulo operation (spec_step_idx % 1). This is mathematically incorrect for MTP architectures like DeepSeek/MiMo where each speculative step typically requires a distinct layer trained for that specific offset. You should derive num_mtp_layers from the model configuration (e.g., config.num_nextn_predict_layers) or set it based on spec_cfg.num_speculative_tokens while ensuring it does not exceed the model's actual capacity.
| num_mtp_layers = 1 | |
| num_mtp_layers = spec_cfg.num_speculative_tokens |
|
Can you provide the accuracy result here? |
Updated, PTAL |
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Purpose
support
num_speculative_tokens>1for mimo 2.5 mtpTest Plan
Test Result
w/o mtp
mtp3
mtp3 w/o async-scheduling
Note: async shceduling may affect mtp accuracy
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.