[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache by wzhao18 · Pull Request #37252 · vllm-project/vllm

wzhao18 · 2026-03-17T04:26:55Z

Purpose

This PR sets Flashinfer sparse MLA as default backend for FP8 KV cache for better performance.

Test Plan

Test Result

Kernel microbenchmark results: #35807

E2E results with different TP (with EP enabled)

Flashinfer shows significantly better performance across the pareto for TP=1, 4, 8.
For TP=2, it shows slightly worse performance than flashMLA at higher concurrency. This is a known issue from kernel microbenchmark that the kernel has unusually bad performance at TP=2. For simplicity and the small E2E perf gap, we will set Flashinfer as the default backend for all TP sizes for FP8 KV types.

Command:

vllm serve \
    nvidia/DeepSeek-V3.2-NVFP4 \
    --trust-remote-code \
    --stream-interval 20 \
    --no-enable-prefix-caching \
    --kv-cache-dtype fp8 \
    --attention-backend {FLASHMLA_SPARSE, FLASHINFER_MLA_SPARSE} \
    --data-parallel-size {1,2,4,8} \
    --tensor-parallel-size {8,4,2,1} \
    --enable-expert-parallel \
    --gpu-memory-utilization 0.8

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

mergify · 2026-03-17T04:27:32Z

Documentation preview: https://vllm--37252.org.readthedocs.build/en/37252/

gemini-code-assist

Code Review

This pull request modifies the attention backend selection logic to prioritize Flashinfer sparse MLA for FP8 KV cache on Blackwell GPUs. The change is implemented in vllm/platforms/cuda.py by updating _get_backend_priorities to consider the kv_cache_dtype. The corresponding documentation in docs/design/attention_backends.md has been updated to reflect the new backend priority. The pull request also refactors type hints in vllm/platforms/cuda.py by introducing from __future__ import annotations.

wzhao18 · 2026-03-17T04:40:05Z

cc @MatthewBonanni

MatthewBonanni

LGTM, thanks! Can you update the documentation to discuss when one is preferred over the other? I should have done this when I conditioned it on num_heads and neglected to. You'll need to modify the generator script. It can be as simple as just adding an asterisk to each of those and a footnote somewhere

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 · 2026-03-17T16:54:01Z

@MatthewBonanni Done. Thanks!

…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 added 2 commits March 17, 2026 04:16

Prioritize flashinfer over flashmla for fp8 kv cache dtype

c5a723f

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

Add comment

92caaab

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

mergify bot added documentation Improvements or additions to documentation nvidia labels Mar 17, 2026

github-project-automation bot added this to NVIDIA Mar 17, 2026

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

MatthewBonanni approved these changes Mar 17, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Mar 17, 2026

Update attention documentation

c049526

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 requested a review from hmellor as a code owner March 17, 2026 16:52

MatthewBonanni added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 17, 2026

MatthewBonanni enabled auto-merge (squash) March 17, 2026 17:26

MatthewBonanni merged commit b36adfa into vllm-project:main Mar 17, 2026
45 of 46 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 17, 2026

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache (v…

89f814d

…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache (v…

4d04b9a

…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache (v…

b1e7c14

…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026

[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache (v…

4938b7c

…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache (v…

7e980d0

…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 mentioned this pull request Mar 27, 2026

[Perf] Support FP8 KV cache for Flashinfer MLA Sparse #35891

Merged

5 tasks

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache (v…

2afdd09

…llm-project#37252) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache#37252

[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache#37252
MatthewBonanni merged 3 commits intovllm-project:mainfrom
wzhao18:wzhao/update-sparse-mla-priority

wzhao18 commented Mar 17, 2026 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Mar 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

wzhao18 commented Mar 17, 2026

Uh oh!

MatthewBonanni left a comment

Uh oh!

wzhao18 commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wzhao18 commented Mar 17, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Mar 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

wzhao18 commented Mar 17, 2026

Uh oh!

MatthewBonanni left a comment

Choose a reason for hiding this comment

Uh oh!

wzhao18 commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wzhao18 commented Mar 17, 2026 •

edited by github-actions bot

Loading