Skip to content

[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support#33726

Merged
vllm-bot merged 69 commits intovllm-project:mainfrom
CentML:bchislett/mamba-nemotron-mtp
Feb 24, 2026
Merged

[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support#33726
vllm-bot merged 69 commits intovllm-project:mainfrom
CentML:bchislett/mamba-nemotron-mtp

Conversation

@benchislett
Copy link
Collaborator

@benchislett benchislett commented Feb 3, 2026

Purpose

This PR adds support for MTP for the Nemotron-H model family, which will be introduced with Nemotron V3 Super.

To facilitate this, we also implement speculative decoding support for the Mamba attention backends. Previously, mamba-style speculative decoding support was limited to Qwen3-Next. This PR attempts to implement the attention metadata in a simple and unified manner that does not introduce too much complexity to the backend.

Co-authored with @shaharmor98

Design

The core change to mamba attention is to separate prefill and decode state indices, since decode state indices will have spec tokens on the second axis. When using specdec we also add num_accepted_tokens and query_start_loc_d (decode) for indexing into the decode batch and selecting the right ssm cache states. Kernel support already exists for this.

Test Plan

Testing is currently all local with checkpoints of Nemotron V3 Super MTP. GSM8k is consistent between MTP on/off.

Open to suggestions for a test plan for the mamba specdec pathway. I'm not sure if there are any good test model/draft combinations we could use. Maybe n-gram with nemotron nano v3?

Example run command:

vllm serve $MODEL --no-enable-chunked-prefill --no-enable-prefix-caching --max-model-len 32768 --max-num-batched-tokens 32768 --trust-remote-code -tp 2 --speculative-config '{"method": "nemotron_h_mtp", "num_speculative_tokens": 7}'

Limitations

Most of the feature incompatibility of this PR is due to feature gaps in the way vLLM handles speculative decoding for linear attention models. Specifically, the following features are confirmed to cause bugs and/or crashes when used with speculative decoding and mamba2:

  • Chunked Prefill
  • Prefix Caching (old style, have not tested 'align' mode)
  • supports_update_block_table (will be disabled for mamba2 when specdec is on)

Also, asynchronous scheduling is ineffective with linear+specdec since _update_states_after_model_execute causes a synchronization. I alleviated issues here by moving it to after the drafting. Full overlap can be achieved by reusing logic from _copy_valid_sampled_token_count, but my preliminary attempt crashed due to a suspected race condition, so I think full overlap can be left as a future performance optimization.

shaharmor98 and others added 30 commits December 21, 2025 08:37
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
…fix nemotron_h_mtp

Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@mergify
Copy link

mergify bot commented Feb 20, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 20, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@mergify mergify bot removed the needs-rebase label Feb 20, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@benchislett benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 23, 2026
@mergify
Copy link

mergify bot commented Feb 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 23, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@mergify mergify bot removed the needs-rebase label Feb 23, 2026
@vllm-bot vllm-bot merged commit f5972a8 into vllm-project:main Feb 24, 2026
67 of 70 checks passed
atalman added a commit to atalman/vllm that referenced this pull request Feb 26, 2026
tom-zju pushed a commit to tom-zju/vllm that referenced this pull request Feb 26, 2026
…pport (vllm-project#33726)

Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Shahar Mor <smor@nvidia.com>
Co-authored-by: Roi Koren <roik@nvidia.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
atalman added a commit to atalman/vllm that referenced this pull request Feb 26, 2026
atalman added a commit to atalman/vllm that referenced this pull request Feb 26, 2026
atalman added a commit to atalman/vllm that referenced this pull request Feb 26, 2026
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
…pport (vllm-project#33726)

Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Shahar Mor <smor@nvidia.com>
Co-authored-by: Roi Koren <roik@nvidia.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
…pport (vllm-project#33726)

Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Shahar Mor <smor@nvidia.com>
Co-authored-by: Roi Koren <roik@nvidia.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
…pport (vllm-project#33726)

Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Shahar Mor <smor@nvidia.com>
Co-authored-by: Roi Koren <roik@nvidia.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
…pport (vllm-project#33726)

Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Shahar Mor <smor@nvidia.com>
Co-authored-by: Roi Koren <roik@nvidia.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants