[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support by benchislett · Pull Request #33726 · vllm-project/vllm

benchislett · 2026-02-03T20:32:29Z

Purpose

This PR adds support for MTP for the Nemotron-H model family, which will be introduced with Nemotron V3 Super.

To facilitate this, we also implement speculative decoding support for the Mamba attention backends. Previously, mamba-style speculative decoding support was limited to Qwen3-Next. This PR attempts to implement the attention metadata in a simple and unified manner that does not introduce too much complexity to the backend.

Co-authored with @shaharmor98

Design

The core change to mamba attention is to separate prefill and decode state indices, since decode state indices will have spec tokens on the second axis. When using specdec we also add num_accepted_tokens and query_start_loc_d (decode) for indexing into the decode batch and selecting the right ssm cache states. Kernel support already exists for this.

Test Plan

Testing is currently all local with checkpoints of Nemotron V3 Super MTP. GSM8k is consistent between MTP on/off.

Open to suggestions for a test plan for the mamba specdec pathway. I'm not sure if there are any good test model/draft combinations we could use. Maybe n-gram with nemotron nano v3?

Example run command:

vllm serve $MODEL --no-enable-chunked-prefill --no-enable-prefix-caching --max-model-len 32768 --max-num-batched-tokens 32768 --trust-remote-code -tp 2 --speculative-config '{"method": "nemotron_h_mtp", "num_speculative_tokens": 7}'

Limitations

Most of the feature incompatibility of this PR is due to feature gaps in the way vLLM handles speculative decoding for linear attention models. Specifically, the following features are confirmed to cause bugs and/or crashes when used with speculative decoding and mamba2:

Chunked Prefill
Prefix Caching (old style, have not tested 'align' mode)
supports_update_block_table (will be disabled for mamba2 when specdec is on)

Also, asynchronous scheduling is ineffective with linear+specdec since _update_states_after_model_execute causes a synchronization. I alleviated issues here by moving it to after the drafting. Full overlap can be achieved by reusing logic from _copy_valid_sampled_token_count, but my preliminary attempt crashed due to a suspected race condition, so I think full overlap can be left as a future performance optimization.

Signed-off-by: Shahar Mor <smor@nvidia.com>

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Signed-off-by: Shahar Mor <smor@nvidia.com>

…fix nemotron_h_mtp Signed-off-by: Shahar Mor <smor@nvidia.com>

Signed-off-by: Shahar Mor <smor@nvidia.com>

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify · 2026-02-20T14:22:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify · 2026-02-23T16:47:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…oding Support (vllm-project#33726)" This reverts commit f5972a8.

…pport (vllm-project#33726) Signed-off-by: Shahar Mor <smor@nvidia.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Shahar Mor <smor@nvidia.com> Co-authored-by: Roi Koren <roik@nvidia.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>

…coding Support (vllm-project#33726)" This reverts commit ffee670.

…oding Support (vllm-project#33726)" This reverts commit f5972a8.

…coding Support (vllm-project#33726)" This reverts commit ffee670.

…pport (vllm-project#33726) Signed-off-by: Shahar Mor <smor@nvidia.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Shahar Mor <smor@nvidia.com> Co-authored-by: Roi Koren <roik@nvidia.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>

…pport (vllm-project#33726) Signed-off-by: Shahar Mor <smor@nvidia.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Shahar Mor <smor@nvidia.com> Co-authored-by: Roi Koren <roik@nvidia.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

…pport (vllm-project#33726) Signed-off-by: Shahar Mor <smor@nvidia.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Shahar Mor <smor@nvidia.com> Co-authored-by: Roi Koren <roik@nvidia.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>

shaharmor98 and others added 30 commits December 21, 2025 08:37

fully working mixed requests version

c488653

Signed-off-by: Shahar Mor <smor@nvidia.com>

add working real checkpoint code

d7ee37c

Signed-off-by: Shahar Mor <smor@nvidia.com>

CUDA graphs work

02c1b02

Signed-off-by: Shahar Mor <smor@nvidia.com>

Prefix caching works

c84fec2

Signed-off-by: Shahar Mor <smor@nvidia.com>

Fix triton kernel to support varlen, and update call site

8250af1

Signed-off-by: Shahar Mor <smor@nvidia.com>

Full CUDA graphs

a807460

Signed-off-by: Shahar Mor <smor@nvidia.com>

Prefix caching + refactored code

c62c4ff

Signed-off-by: Shahar Mor <smor@nvidia.com>

added updated load weights

6c08722

Signed-off-by: Shahar Mor <smor@nvidia.com>

Fix CUDA graphs compat

3d6a23b

Signed-off-by: Shahar Mor <smor@nvidia.com>

Working speculative code with multiple MTP layers

4db7e20

Signed-off-by: Shahar Mor <smor@nvidia.com>

running mtp for num_speculative > 2

d27da73

Signed-off-by: Shahar Mor <smor@nvidia.com>

remove redundant ids handling, add eagle3 support flag

2c4f1c7

Signed-off-by: Shahar Mor <smor@nvidia.com>

temporarily disable update_block_table

5aabfd5

Signed-off-by: Shahar Mor <smor@nvidia.com>

remove redundant code, support cuda graph for target excluding drafter

f3918a1

Signed-off-by: Shahar Mor <smor@nvidia.com>

move faulty assertion

c35cf11

Signed-off-by: Shahar Mor <smor@nvidia.com>

remove full cg support for mamba

965a716

Signed-off-by: Shahar Mor <smor@nvidia.com>

commenting eagle changes

43728d8

Signed-off-by: Shahar Mor <smor@nvidia.com>

fix block size used in EAGLE slot mapping

a72852c

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

remove eagle multi layer support

d0f85ad

Signed-off-by: Shahar Mor <smor@nvidia.com>

remove multi layer spec step idx from eagle

4e3ff41

Signed-off-by: Shahar Mor <smor@nvidia.com>

update code to a single MTP layer, remove speculative enforce eager, …

fe2ff13

…fix nemotron_h_mtp Signed-off-by: Shahar Mor <smor@nvidia.com>

change mtp layer count

fd28da9

Signed-off-by: Shahar Mor <smor@nvidia.com>

final cleanup

45fbd1b

Signed-off-by: Shahar Mor <smor@nvidia.com>

polished implementation

3b211ea

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

tweaks for perf

5a6131d

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

tweak

f213300

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

tweaks for non-spec and spec compat

f461834

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Merge branch 'main' into bchislett/nemotron-h-mtp-old-rebased

6a27a0f

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

simple patches for rebase

f56420e

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

patch

cc9b29f

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify bot added the needs-rebase label Feb 20, 2026

Merge branch 'main' into bchislett/mamba-nemotron-mtp

516c7b5

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify bot removed the needs-rebase label Feb 20, 2026

benchislett added 4 commits February 23, 2026 04:26

Merge branch 'main' into bchislett/mamba-nemotron-mtp

4797d1d

use new vllmconfig getter for num_spec

f77866f

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

use new vllmconfig getter for num_spec

810caf1

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

remove no-longer-necessary bug workaround

d511f19

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 23, 2026

mergify bot added the needs-rebase label Feb 23, 2026

benchislett added 2 commits February 23, 2026 20:16

add placeholder model for NemotronH MTP

623c1f7

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

fix prefix caching for non-MTP case

1a688ba

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested review from DarkLight1337 and ywang96 as code owners February 23, 2026 20:16

Merge branch 'main' into bchislett/mamba-nemotron-mtp

685fe39

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify bot removed the needs-rebase label Feb 23, 2026

benchislett added 2 commits February 23, 2026 23:47

Merge branch 'main' into bchislett/mamba-nemotron-mtp

1a28f23

update mamba block table test config

cb56163

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mgoin approved these changes Feb 24, 2026

View reviewed changes

vllm-bot merged commit f5972a8 into vllm-project:main Feb 24, 2026
67 of 70 checks passed

atalman added a commit to atalman/vllm that referenced this pull request Feb 26, 2026

Revert "[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Dec…

ffee670

…oding Support (vllm-project#33726)" This reverts commit f5972a8.

atalman added a commit to atalman/vllm that referenced this pull request Feb 26, 2026

Reapply "[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative De…

8f0dfb3

…coding Support (vllm-project#33726)" This reverts commit ffee670.

atalman added a commit to atalman/vllm that referenced this pull request Feb 26, 2026

Revert "[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Dec…

08e492b

…oding Support (vllm-project#33726)" This reverts commit f5972a8.

atalman added a commit to atalman/vllm that referenced this pull request Feb 26, 2026

Reapply "[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative De…

41413c5

…coding Support (vllm-project#33726)" This reverts commit ffee670.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support#33726

[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support#33726
vllm-bot merged 69 commits intovllm-project:mainfrom
CentML:bchislett/mamba-nemotron-mtp

benchislett commented Feb 3, 2026 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Feb 20, 2026

Uh oh!

mergify bot commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Uh oh!

Conversation

benchislett commented Feb 3, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Design

Test Plan

Limitations

Uh oh!

mergify bot commented Feb 20, 2026

Uh oh!

mergify bot commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

benchislett commented Feb 3, 2026 •

edited by github-actions bot

Loading