[Perf] Add decode full-graph support to FlashInfer-MLA backend by benchislett · Pull Request #26313 · vllm-project/vllm

benchislett · 2025-10-06T18:53:22Z

Purpose

The annotation was missing from FlashInfer-MLA while the implementation has support.

Running DSR1-FP4 on 4xB200 gets me 97 TPS:

VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_ATTENTION_BACKEND=FLASHINFER_MLA vllm serve nvidia/DeepSeek-R1-FP4 -tp 4 --max-model-len 32768 --max-num-seqs 128 --no-enable-prefix-caching --async-scheduling --port 8049

I also tested on a local development branch for MTP containing #25984, and #25987.

On that branch, with 3 MTP speculative tokens, I get 165 TPS and passing GSM8k evals.

Test Plan

GSM8k run as follows:

lm_eval \
  --model local-completions \
  --tasks gsm8k \
  --model_args base_url=http://0.0.0.0:8049/v1/completions,model=nvidia/DeepSeek-R1-FP4,tokenized_requests=False,tokenizer_backend=None,num_concurrent=128,timeout=120,max_retries=5

Test Result

Matches the baseline:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9439|±  |0.0063|
|     |       |strict-match    |     5|exact_match|↑  |0.9439|±  |0.0063|

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

gemini-code-assist

Code Review

This pull request correctly enables full CUDA graph support for decode operations in the FlashInfer-MLA attention backend. The change is implemented by creating a new FlashInferMLAMetadataBuilder class that inherits from MLACommonMetadataBuilder and sets the cudagraph_support attribute to AttentionCGSupport.UNIFORM_BATCH. The FlashInferMLABackend is then updated to use this new builder. The approach is clean, follows the existing design patterns in the codebase, and seems to correctly enable the feature as described. The changes are minimal and well-targeted. I found no issues of high or critical severity.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/attention/backends/mla/flashinfer_mla.py

LucasWilkinson

LGTM; thanks!

mgoin

Nice

…project#26313) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Add full graph decode support to FlashInfer-MLA

74e965c

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested a review from LucasWilkinson as a code owner October 6, 2025 18:53

mergify bot added the v1 label Oct 6, 2025

gemini-code-assist bot reviewed Oct 6, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 6, 2025

View reviewed changes

vllm/v1/attention/backends/mla/flashinfer_mla.py Show resolved Hide resolved

pavanimajety approved these changes Oct 6, 2025

View reviewed changes

LucasWilkinson approved these changes Oct 6, 2025

View reviewed changes

LucasWilkinson enabled auto-merge (squash) October 6, 2025 21:17

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 6, 2025

mgoin approved these changes Oct 6, 2025

View reviewed changes

LucasWilkinson merged commit f77df94 into vllm-project:main Oct 6, 2025
54 checks passed

benchislett deleted the flashinfer-mla-cuda-graphs branch October 9, 2025 14:23

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Perf] Add decode full-graph support to FlashInfer-MLA backend (vllm-…

00f6164

…project#26313) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Perf] Add decode full-graph support to FlashInfer-MLA backend (vllm-…

a9d655e

…project#26313) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Perf] Add decode full-graph support to FlashInfer-MLA backend (vllm-…

f430b0e

…project#26313) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

LucasWilkinson mentioned this pull request Nov 17, 2025

[Kernel] Support DECODE_ONLY Full CGs for FlashInferMLA #25804

Closed

5 tasks

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Perf] Add decode full-graph support to FlashInfer-MLA backend (vllm-…

a5f4d5a

…project#26313) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Add decode full-graph support to FlashInfer-MLA backend#26313

[Perf] Add decode full-graph support to FlashInfer-MLA backend#26313
LucasWilkinson merged 1 commit intovllm-project:mainfrom
CentML:flashinfer-mla-cuda-graphs

benchislett commented Oct 6, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

LucasWilkinson left a comment

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

benchislett commented Oct 6, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

benchislett commented Oct 6, 2025 •

edited by github-actions bot

Loading