[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs#34457
Merged
robertgshaw2-redhat merged 5 commits intovllm-project:mainfrom Feb 17, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request correctly enables FULL cudagraph support for sparse MLA models with MTP by changing _cudagraph_support to UNIFORM_BATCH. It also adds a necessary safeguard to prevent crashes by raising a ValueError for unsupported num_speculative_tokens > 1, which is a limitation of the fp8_paged_mqa_logits kernel. The changes are well-implemented, improving both functionality and robustness. The code is clean and the logic is sound.
1 task
jeejeelee
reviewed
Feb 13, 2026
Collaborator
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Contributor
|
Two questions:
|
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Contributor
|
Documentation preview: https://vllm--34457.org.readthedocs.build/en/34457/ |
Collaborator
Author
|
LucasWilkinson
approved these changes
Feb 17, 2026
dc5fa77
into
vllm-project:main
47 of 51 checks passed
wzhao18
pushed a commit
to wzhao18/vllm
that referenced
this pull request
Feb 18, 2026
…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
jasonozuzu-cohere
pushed a commit
to jasonozuzu-cohere/vllm
that referenced
this pull request
Feb 18, 2026
…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com>
eldarkurtic
pushed a commit
to eldarkurtic/vllm
that referenced
this pull request
Feb 19, 2026
…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
ZJY0516
pushed a commit
to ZJY0516/vllm
that referenced
this pull request
Feb 23, 2026
…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
llsj14
pushed a commit
to llsj14/vllm
that referenced
this pull request
Mar 1, 2026
…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
tunglinwood
pushed a commit
to tunglinwood/vllm
that referenced
this pull request
Mar 4, 2026
…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
askliar
pushed a commit
to askliar/vllm
that referenced
this pull request
Mar 9, 2026
…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
EricccYang
pushed a commit
to EricccYang/vllm
that referenced
this pull request
Apr 1, 2026
…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
liuchenbing2026
pushed a commit
to liuchenbing2026/vllm
that referenced
this pull request
Apr 4, 2026
…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
This was referenced Apr 8, 2026
simone-chen
added a commit
to ai-dynamo/aiconfigurator
that referenced
this pull request
Apr 9, 2026
vLLM 0.17.0's FlashInfer sparse MLA backend (vllm-project/vllm#33451) and DSA CUDA graph support (vllm-project/vllm#34457) leave CUDA graph RNG offset tracking active after DeepseekV2MLAAttention construction. Any subsequent RNG call crashes with "Offset increment outside graph capture". enforce_eager and manual_seed() do not clear it. Changes: - Replace all post-construction RNG (normal_, uniform_, randn, randint) with deterministic fill_/torch.full in weight init, KV cache, and input tensors - Include buffers in init loop (k_scale/v_scale are buffers, not parameters; process_weights_after_loading asserts k_scale > 0) - Strip auto_map from config.json — HuggingFace AutoConfig tries to import the custom class (configuration_deepseek.py) from the temp directory where it doesn't exist; vLLM doesn't need it - Wrap MLA backend selection in set_current_vllm_config() context (vLLM 0.17.0 calls get_current_vllm_config() during backend selection) See: vllm-project/vllm#39371 Signed-off-by: Simone Chen <simonec@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
DeepseekV32IndexerMetadataBuildercurrently only reports support forUNIFORM_SINGLE_TOKEN_DECODE. Therefore, when running a sparse MLA model with MTP, FULL cudagraphs are never captured.In reality, the deepGEMM kernel
fp8_paged_mqa_logitssupports MTP withnum_speculative_tokens=1(i.e.next_n = 2).This PR changes the reported support to
UNIFORM_BATCHand adds an explicit error fornum_speculative_tokens > 1, rather than letting the kernel itself crash.Test Plan
with
Test Result
Main:
PR:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.