[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs by MatthewBonanni · Pull Request #34457 · vllm-project/vllm

MatthewBonanni · 2026-02-12T19:50:26Z

Purpose

DeepseekV32IndexerMetadataBuilder currently only reports support for UNIFORM_SINGLE_TOKEN_DECODE. Therefore, when running a sparse MLA model with MTP, FULL cudagraphs are never captured.

In reality, the deepGEMM kernel fp8_paged_mqa_logits supports MTP with num_speculative_tokens=1 (i.e. next_n = 2).

This PR changes the reported support to UNIFORM_BATCH and adds an explicit error for num_speculative_tokens > 1, rather than letting the kernel itself crash.

Test Plan

vllm serve deepseek-ai/DeepSeek-V3.2 \
    -tp 8 -ep \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
    --no-enable-prefix-caching

with

vllm bench serve \
    --dataset-name random \
    --input-len 128 \
    --output-len 2048 \
    --seed 42 \
    --ignore-eos \
    --temperature 0 \
    --num-prompts 200 \
    --request-rate inf \
    --max-concurrency 64

Test Result

Main:

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Maximum request concurrency:             64        
Benchmark duration (s):                  434.34    
Total input tokens:                      25600     
Total generated tokens:                  409600    
Request throughput (req/s):              0.46      
Output token throughput (tok/s):         943.04    
Peak output token throughput (tok/s):    640.00    
Peak concurrent requests:                109.00    
Total token throughput (tok/s):          1001.98   
---------------Time to First Token----------------
Mean TTFT (ms):                          1372.90   
Median TTFT (ms):                        1520.53   
P99 TTFT (ms):                           2167.85   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          54.12     
Median TPOT (ms):                        53.95     
P99 TPOT (ms):                           57.18     
---------------Inter-token Latency----------------
Mean ITL (ms):                           107.91    
Median ITL (ms):                         102.92    
P99 ITL (ms):                            165.42    
---------------Speculative Decoding---------------
Acceptance rate (%):                     99.43     
Acceptance length:                       1.99      
Drafts:                                  205340    
Draft tokens:                            205340    
Accepted tokens:                         204174    
Per-position acceptance (%):
  Position 0:                            99.43     
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Maximum request concurrency:             64        
Benchmark duration (s):                  193.55    
Total input tokens:                      25600     
Total generated tokens:                  409600    
Request throughput (req/s):              1.03      
Output token throughput (tok/s):         2116.20   
Peak output token throughput (tok/s):    1472.00   
Peak concurrent requests:                120.00    
Total token throughput (tok/s):          2248.46   
---------------Time to First Token----------------
Mean TTFT (ms):                          1238.56   
Median TTFT (ms):                        1223.26   
P99 TTFT (ms):                           2081.36   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.73     
Median TPOT (ms):                        24.87     
P99 TPOT (ms):                           26.77     
---------------Inter-token Latency----------------
Mean ITL (ms):                           49.27     
Median ITL (ms):                         45.99     
P99 ITL (ms):                            59.30     
---------------Speculative Decoding---------------
Acceptance rate (%):                     99.26     
Acceptance length:                       1.99      
Drafts:                                  205513    
Draft tokens:                            205513    
Accepted tokens:                         203997    
Per-position acceptance (%):
  Position 0:                            99.26     
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

gemini-code-assist

Code Review

This pull request correctly enables FULL cudagraph support for sparse MLA models with MTP by changing _cudagraph_support to UNIFORM_BATCH. It also adds a necessary safeguard to prevent crashes by raising a ValueError for unsupported num_speculative_tokens > 1, which is a limitation of the fp8_paged_mqa_logits kernel. The changes are well-implemented, improving both functionality and robustness. The code is clean and the logic is sound.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

jeejeelee · 2026-02-13T08:03:07Z

cc @WoosukKwon @zyongye

Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

ElizaWszola · 2026-02-16T06:51:19Z

Two questions:

What was the previous scenario for running with num_speculative_tokens > 1? Was it a kernel crash or did the limited support prevent it?
Does anything need to be updated in cuda_graphs.md for this PR?

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

mergify · 2026-02-16T16:37:55Z

Documentation preview: https://vllm--34457.org.readthedocs.build/en/34457/

MatthewBonanni · 2026-02-16T16:40:17Z

@ElizaWszola

Previously there was an assertion within DeepGEMM as in [Bug]: [H200] DeepSeek V3.2 MTP > 1 run into error (FLASHMLA_SPARSE backend) #31845. I was under the impression that this was due to a kernel limitation because comments throughout vLLM seem to indicate that the indexer kernel doesn't support num_speculative_tokens > 1. This may not be true, though, and it may have actually just been an issue with metadata building. @LucasWilkinson has a fix here: [BugFix] Add support for MTP num_speculative_tokens > 1 with sparse MLA #34552
Sparse MLA backends weren't previously discussed in cuda_graphs.md but I've added FlashInferMLASparse to it now - good catch

…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com>

…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

…cudagraphs (vllm-project#34457) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

vLLM 0.17.0's FlashInfer sparse MLA backend (vllm-project/vllm#33451) and DSA CUDA graph support (vllm-project/vllm#34457) leave CUDA graph RNG offset tracking active after DeepseekV2MLAAttention construction. Any subsequent RNG call crashes with "Offset increment outside graph capture". enforce_eager and manual_seed() do not clear it. Changes: - Replace all post-construction RNG (normal_, uniform_, randn, randint) with deterministic fill_/torch.full in weight init, KV cache, and input tensors - Include buffers in init loop (k_scale/v_scale are buffers, not parameters; process_weights_after_loading asserts k_scale > 0) - Strip auto_map from config.json — HuggingFace AutoConfig tries to import the custom class (configuration_deepseek.py) from the temp directory where it doesn't exist; vLLM doesn't need it - Wrap MLA backend selection in set_current_vllm_config() context (vLLM 0.17.0 calls get_current_vllm_config() during backend selection) See: vllm-project/vllm#39371 Signed-off-by: Simone Chen <simonec@nvidia.com>

Fix

a0528fe

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni requested a review from pavanimajety as a code owner February 12, 2026 19:50

mergify bot added nvidia v1 bug Something isn't working labels Feb 12, 2026

github-project-automation bot added this to NVIDIA Feb 12, 2026

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

MatthewBonanni mentioned this pull request Feb 12, 2026

[Bug]: [H200] DeepSeek V3.2 MTP > 1 run into error (FLASHMLA_SPARSE backend) #31845

Closed

1 task

mergify bot added the performance Performance-related issues label Feb 12, 2026

Fix

c849515

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

jeejeelee reviewed Feb 13, 2026

View reviewed changes

Comment thread vllm/v1/attention/backends/mla/indexer.py Outdated

MatthewBonanni and others added 2 commits February 13, 2026 09:16

Update comment

3429185

Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Clean up

b7bf451

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Add FlashInferMLASparse to cuda_graphs.md

a1445c0

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

mergify bot added the documentation Improvements or additions to documentation label Feb 16, 2026

LucasWilkinson approved these changes Feb 17, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Feb 17, 2026

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 17, 2026

robertgshaw2-redhat merged commit dc5fa77 into vllm-project:main Feb 17, 2026
47 of 51 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Feb 17, 2026

MatthewBonanni deleted the fix_sparse_mtp branch February 17, 2026 19:03

This was referenced Apr 8, 2026

fix: vLLM 0.17.0 collector + data (DSA, MLA, MoE) ai-dynamo/aiconfigurator#691

Closed

DSA module construction corrupts CUDA RNG state (Offset increment outside graph capture) #39371

Open

simone-chen mentioned this pull request Apr 10, 2026

fix: vLLM 0.17.0 collector compat (DSA, MLA module, MoE) ai-dynamo/aiconfigurator#718

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs#34457

[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs#34457
robertgshaw2-redhat merged 5 commits intovllm-project:mainfrom
MatthewBonanni:fix_sparse_mtp

MatthewBonanni commented Feb 12, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

jeejeelee commented Feb 13, 2026

Uh oh!

ElizaWszola commented Feb 16, 2026

Uh oh!

mergify bot commented Feb 16, 2026

Uh oh!

MatthewBonanni commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

MatthewBonanni commented Feb 12, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

jeejeelee commented Feb 13, 2026

Uh oh!

ElizaWszola commented Feb 16, 2026

Uh oh!

mergify bot commented Feb 16, 2026

Uh oh!

MatthewBonanni commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MatthewBonanni commented Feb 12, 2026 •

edited by github-actions bot

Loading