-
-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support #33726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
vllm-bot
merged 69 commits into
vllm-project:main
from
CentML:bchislett/mamba-nemotron-mtp
Feb 24, 2026
Merged
Changes from all commits
Commits
Show all changes
69 commits
Select commit
Hold shift + click to select a range
c488653
fully working mixed requests version
shaharmor98 d7ee37c
add working real checkpoint code
shaharmor98 02c1b02
CUDA graphs work
roikoren755 c84fec2
Prefix caching works
roikoren755 8250af1
Fix triton kernel to support varlen, and update call site
roikoren755 a807460
Full CUDA graphs
roikoren755 c62c4ff
Prefix caching + refactored code
shaharmor98 6c08722
added updated load weights
shaharmor98 3d6a23b
Fix CUDA graphs compat
roikoren755 4db7e20
Working speculative code with multiple MTP layers
shaharmor98 d27da73
running mtp for num_speculative > 2
shaharmor98 2c4f1c7
remove redundant ids handling, add eagle3 support flag
shaharmor98 5aabfd5
temporarily disable update_block_table
shaharmor98 f3918a1
remove redundant code, support cuda graph for target excluding drafter
shaharmor98 c35cf11
move faulty assertion
shaharmor98 965a716
remove full cg support for mamba
shaharmor98 43728d8
commenting eagle changes
shaharmor98 a72852c
fix block size used in EAGLE slot mapping
benchislett d0f85ad
remove eagle multi layer support
shaharmor98 4e3ff41
remove multi layer spec step idx from eagle
shaharmor98 fe2ff13
update code to a single MTP layer, remove speculative enforce eager, …
shaharmor98 fd28da9
change mtp layer count
shaharmor98 45fbd1b
final cleanup
shaharmor98 3b211ea
polished implementation
benchislett 5a6131d
tweaks for perf
benchislett f213300
tweak
benchislett f461834
tweaks for non-spec and spec compat
benchislett 6a27a0f
Merge branch 'main' into bchislett/nemotron-h-mtp-old-rebased
benchislett f56420e
simple patches for rebase
benchislett cc9b29f
patch
benchislett 896e01f
remove unused files
benchislett b96184d
update mamba backend with specdec support
benchislett 6d59c03
refactor mamba attn cudagraph logic
benchislett 8819eb7
Merge branch 'main' into bchislett/mamba-nemotron-mtp
benchislett b75e7ac
cleanup
benchislett bca337b
cleanup
benchislett 7da121d
revert unneeded refactor
benchislett 5db7692
rename prefill_state_indices_tensor and decode_state_indices_tensor
benchislett b1e927e
rewrite todo comment for known issue
benchislett 03ef64d
change max_query_len overwrite to assert
benchislett 8b7c03f
update cuda graph padding boundaries
benchislett c67fccb
avoid slicing state indices tensor for 'all' prefix cache mode
benchislett 0739591
remove validation mode leftover from debugging
benchislett 0a5f1a3
revert no-longer-needed diff in layer.py
benchislett 063f54b
update layer.py from main to fix conflict
benchislett d66babd
Merge branch 'main' into bchislett/mamba-nemotron-mtp
benchislett f7fef88
adjust other mamba-base backends to the new state_indices_tensor repr…
shaharmor98 e71fb7a
merge main
shaharmor98 b642f2e
Merge remote-tracking branch 'origin/main' into bchislett/mamba-nemot…
shaharmor98 a02ac98
[Bugfix] Fix assertion error in _dummy_run for MTP speculative decoding
LucasWilkinson ce598de
Merge remote-tracking branch 'origin/main' into bchislett/mamba-nemot…
shaharmor98 79b6e40
fix wrong max num tokens init in GPUModelRunner
shaharmor98 4a9274f
revert assetion relaxation
shaharmor98 77d5cc1
fix cr comments
shaharmor98 dc29214
Merge commit 'dc5fa77a4eb6680339cb77abe713fb22d7795560' into bchislet…
benchislett 6b50bd0
fix comment
benchislett 1c45407
simplify prefix caching slicing
benchislett 05efc54
adjust decode threshold based on specdec existance in the request
shaharmor98 10cb13e
fix frivolous assert
benchislett 516c7b5
Merge branch 'main' into bchislett/mamba-nemotron-mtp
benchislett 4797d1d
Merge branch 'main' into bchislett/mamba-nemotron-mtp
benchislett f77866f
use new vllmconfig getter for num_spec
benchislett 810caf1
use new vllmconfig getter for num_spec
benchislett d511f19
remove no-longer-necessary bug workaround
benchislett 623c1f7
add placeholder model for NemotronH MTP
benchislett 1a688ba
fix prefix caching for non-MTP case
benchislett 685fe39
Merge branch 'main' into bchislett/mamba-nemotron-mtp
benchislett 1a28f23
Merge branch 'main' into bchislett/mamba-nemotron-mtp
benchislett cb56163
update mamba block table test config
benchislett File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shaharmor98 Could you explain the motivation for this change? I'm not sure of its purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can recall, in GDN’s original PR they modified the causal_conv1d_update so that the underlying Triton kernel caches the state for every speculated token, and then during verification restores the correct cache according to the accepted token count.
One of the things they had to change, and so did we, was to account for the num_spec tokens as part of the conv_state shape declaration.
I made this change quite a long time ago, but IIRC it caused the outputs to be garbage when that value wasn’t added.
Reference: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/mamba/mamba_utils.py#L185-L188