[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support#33726
Merged
vllm-bot merged 69 commits intovllm-project:mainfrom Feb 24, 2026
Merged
[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support#33726vllm-bot merged 69 commits intovllm-project:mainfrom
vllm-bot merged 69 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
…fix nemotron_h_mtp Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
mgoin
approved these changes
Feb 24, 2026
atalman
added a commit
to atalman/vllm
that referenced
this pull request
Feb 26, 2026
…oding Support (vllm-project#33726)" This reverts commit f5972a8.
tom-zju
pushed a commit
to tom-zju/vllm
that referenced
this pull request
Feb 26, 2026
…pport (vllm-project#33726) Signed-off-by: Shahar Mor <smor@nvidia.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Shahar Mor <smor@nvidia.com> Co-authored-by: Roi Koren <roik@nvidia.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
atalman
added a commit
to atalman/vllm
that referenced
this pull request
Feb 26, 2026
…coding Support (vllm-project#33726)" This reverts commit ffee670.
atalman
added a commit
to atalman/vllm
that referenced
this pull request
Feb 26, 2026
…oding Support (vllm-project#33726)" This reverts commit f5972a8.
atalman
added a commit
to atalman/vllm
that referenced
this pull request
Feb 26, 2026
…coding Support (vllm-project#33726)" This reverts commit ffee670.
llsj14
pushed a commit
to llsj14/vllm
that referenced
this pull request
Mar 1, 2026
…pport (vllm-project#33726) Signed-off-by: Shahar Mor <smor@nvidia.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Shahar Mor <smor@nvidia.com> Co-authored-by: Roi Koren <roik@nvidia.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
tunglinwood
pushed a commit
to tunglinwood/vllm
that referenced
this pull request
Mar 4, 2026
…pport (vllm-project#33726) Signed-off-by: Shahar Mor <smor@nvidia.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Shahar Mor <smor@nvidia.com> Co-authored-by: Roi Koren <roik@nvidia.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
askliar
pushed a commit
to askliar/vllm
that referenced
this pull request
Mar 9, 2026
…pport (vllm-project#33726) Signed-off-by: Shahar Mor <smor@nvidia.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Shahar Mor <smor@nvidia.com> Co-authored-by: Roi Koren <roik@nvidia.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Copilot AI
pushed a commit
to machov/vllm
that referenced
this pull request
Mar 10, 2026
…pport (vllm-project#33726) Signed-off-by: Shahar Mor <smor@nvidia.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Shahar Mor <smor@nvidia.com> Co-authored-by: Roi Koren <roik@nvidia.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
This PR adds support for MTP for the Nemotron-H model family, which will be introduced with Nemotron V3 Super.
To facilitate this, we also implement speculative decoding support for the Mamba attention backends. Previously, mamba-style speculative decoding support was limited to Qwen3-Next. This PR attempts to implement the attention metadata in a simple and unified manner that does not introduce too much complexity to the backend.
Co-authored with @shaharmor98
Design
The core change to mamba attention is to separate prefill and decode state indices, since decode state indices will have spec tokens on the second axis. When using specdec we also add
num_accepted_tokensandquery_start_loc_d(decode) for indexing into the decode batch and selecting the right ssm cache states. Kernel support already exists for this.Test Plan
Testing is currently all local with checkpoints of Nemotron V3 Super MTP. GSM8k is consistent between MTP on/off.
Open to suggestions for a test plan for the mamba specdec pathway. I'm not sure if there are any good test model/draft combinations we could use. Maybe n-gram with nemotron nano v3?
Example run command:
Limitations
Most of the feature incompatibility of this PR is due to feature gaps in the way vLLM handles speculative decoding for linear attention models. Specifically, the following features are confirmed to cause bugs and/or crashes when used with speculative decoding and mamba2:
supports_update_block_table(will be disabled for mamba2 when specdec is on)Also, asynchronous scheduling is ineffective with linear+specdec since
_update_states_after_model_executecauses a synchronization. I alleviated issues here by moving it to after the drafting. Full overlap can be achieved by reusing logic from_copy_valid_sampled_token_count, but my preliminary attempt crashed due to a suspected race condition, so I think full overlap can be left as a future performance optimization.