Skip to content

[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs#34457

Merged
robertgshaw2-redhat merged 5 commits intovllm-project:mainfrom
MatthewBonanni:fix_sparse_mtp
Feb 17, 2026
Merged

[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs#34457
robertgshaw2-redhat merged 5 commits intovllm-project:mainfrom
MatthewBonanni:fix_sparse_mtp

Conversation

@MatthewBonanni
Copy link
Copy Markdown
Collaborator

@MatthewBonanni MatthewBonanni commented Feb 12, 2026

Purpose

DeepseekV32IndexerMetadataBuilder currently only reports support for UNIFORM_SINGLE_TOKEN_DECODE. Therefore, when running a sparse MLA model with MTP, FULL cudagraphs are never captured.

In reality, the deepGEMM kernel fp8_paged_mqa_logits supports MTP with num_speculative_tokens=1 (i.e. next_n = 2).

This PR changes the reported support to UNIFORM_BATCH and adds an explicit error for num_speculative_tokens > 1, rather than letting the kernel itself crash.

Test Plan

vllm serve deepseek-ai/DeepSeek-V3.2 \
    -tp 8 -ep \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
    --no-enable-prefix-caching

with

vllm bench serve \
    --dataset-name random \
    --input-len 128 \
    --output-len 2048 \
    --seed 42 \
    --ignore-eos \
    --temperature 0 \
    --num-prompts 200 \
    --request-rate inf \
    --max-concurrency 64

Test Result

Main:

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Maximum request concurrency:             64        
Benchmark duration (s):                  434.34    
Total input tokens:                      25600     
Total generated tokens:                  409600    
Request throughput (req/s):              0.46      
Output token throughput (tok/s):         943.04    
Peak output token throughput (tok/s):    640.00    
Peak concurrent requests:                109.00    
Total token throughput (tok/s):          1001.98   
---------------Time to First Token----------------
Mean TTFT (ms):                          1372.90   
Median TTFT (ms):                        1520.53   
P99 TTFT (ms):                           2167.85   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          54.12     
Median TPOT (ms):                        53.95     
P99 TPOT (ms):                           57.18     
---------------Inter-token Latency----------------
Mean ITL (ms):                           107.91    
Median ITL (ms):                         102.92    
P99 ITL (ms):                            165.42    
---------------Speculative Decoding---------------
Acceptance rate (%):                     99.43     
Acceptance length:                       1.99      
Drafts:                                  205340    
Draft tokens:                            205340    
Accepted tokens:                         204174    
Per-position acceptance (%):
  Position 0:                            99.43     
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Maximum request concurrency:             64        
Benchmark duration (s):                  193.55    
Total input tokens:                      25600     
Total generated tokens:                  409600    
Request throughput (req/s):              1.03      
Output token throughput (tok/s):         2116.20   
Peak output token throughput (tok/s):    1472.00   
Peak concurrent requests:                120.00    
Total token throughput (tok/s):          2248.46   
---------------Time to First Token----------------
Mean TTFT (ms):                          1238.56   
Median TTFT (ms):                        1223.26   
P99 TTFT (ms):                           2081.36   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.73     
Median TPOT (ms):                        24.87     
P99 TPOT (ms):                           26.77     
---------------Inter-token Latency----------------
Mean ITL (ms):                           49.27     
Median ITL (ms):                         45.99     
P99 ITL (ms):                            59.30     
---------------Speculative Decoding---------------
Acceptance rate (%):                     99.26     
Acceptance length:                       1.99      
Drafts:                                  205513    
Draft tokens:                            205513    
Accepted tokens:                         203997    
Per-position acceptance (%):
  Position 0:                            99.26     
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@mergify mergify bot added nvidia v1 bug Something isn't working labels Feb 12, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly enables FULL cudagraph support for sparse MLA models with MTP by changing _cudagraph_support to UNIFORM_BATCH. It also adds a necessary safeguard to prevent crashes by raising a ValueError for unsupported num_speculative_tokens > 1, which is a limitation of the fp8_paged_mqa_logits kernel. The changes are well-implemented, improving both functionality and robustness. The code is clean and the logic is sound.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Comment thread vllm/v1/attention/backends/mla/indexer.py Outdated
@jeejeelee
Copy link
Copy Markdown
Collaborator

cc @WoosukKwon @zyongye

MatthewBonanni and others added 2 commits February 13, 2026 09:16
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@ElizaWszola
Copy link
Copy Markdown
Contributor

Two questions:

  1. What was the previous scenario for running with num_speculative_tokens > 1? Was it a kernel crash or did the limited support prevent it?
  2. Does anything need to be updated in cuda_graphs.md for this PR?

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Feb 16, 2026

Documentation preview: https://vllm--34457.org.readthedocs.build/en/34457/

@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 16, 2026
@MatthewBonanni
Copy link
Copy Markdown
Collaborator Author

@ElizaWszola

  1. Previously there was an assertion within DeepGEMM as in [Bug]: [H200] DeepSeek V3.2 MTP > 1 run into error (FLASHMLA_SPARSE backend) #31845. I was under the impression that this was due to a kernel limitation because comments throughout vLLM seem to indicate that the indexer kernel doesn't support num_speculative_tokens > 1. This may not be true, though, and it may have actually just been an issue with metadata building. @LucasWilkinson has a fix here: [BugFix] Add support for MTP num_speculative_tokens > 1 with sparse MLA #34552
  2. Sparse MLA backends weren't previously discussed in cuda_graphs.md but I've added FlashInferMLASparse to it now - good catch

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Feb 17, 2026
@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 17, 2026
@robertgshaw2-redhat robertgshaw2-redhat merged commit dc5fa77 into vllm-project:main Feb 17, 2026
47 of 51 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Feb 17, 2026
@MatthewBonanni MatthewBonanni deleted the fix_sparse_mtp branch February 17, 2026 19:03
wzhao18 pushed a commit to wzhao18/vllm that referenced this pull request Feb 18, 2026
…cudagraphs (vllm-project#34457)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
jasonozuzu-cohere pushed a commit to jasonozuzu-cohere/vllm that referenced this pull request Feb 18, 2026
…cudagraphs (vllm-project#34457)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com>
eldarkurtic pushed a commit to eldarkurtic/vllm that referenced this pull request Feb 19, 2026
…cudagraphs (vllm-project#34457)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
ZJY0516 pushed a commit to ZJY0516/vllm that referenced this pull request Feb 23, 2026
…cudagraphs (vllm-project#34457)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
…cudagraphs (vllm-project#34457)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
…cudagraphs (vllm-project#34457)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
…cudagraphs (vllm-project#34457)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…cudagraphs (vllm-project#34457)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026
…cudagraphs (vllm-project#34457)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
simone-chen added a commit to ai-dynamo/aiconfigurator that referenced this pull request Apr 9, 2026
vLLM 0.17.0's FlashInfer sparse MLA backend (vllm-project/vllm#33451)
and DSA CUDA graph support (vllm-project/vllm#34457) leave CUDA graph
RNG offset tracking active after DeepseekV2MLAAttention construction.
Any subsequent RNG call crashes with "Offset increment outside graph
capture".  enforce_eager and manual_seed() do not clear it.

Changes:
- Replace all post-construction RNG (normal_, uniform_, randn, randint)
  with deterministic fill_/torch.full in weight init, KV cache, and
  input tensors
- Include buffers in init loop (k_scale/v_scale are buffers, not
  parameters; process_weights_after_loading asserts k_scale > 0)
- Strip auto_map from config.json — HuggingFace AutoConfig tries to
  import the custom class (configuration_deepseek.py) from the temp
  directory where it doesn't exist; vLLM doesn't need it
- Wrap MLA backend selection in set_current_vllm_config() context
  (vLLM 0.17.0 calls get_current_vllm_config() during backend selection)

See: vllm-project/vllm#39371
Signed-off-by: Simone Chen <simonec@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working documentation Improvements or additions to documentation nvidia performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants