Skip to content

[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache#37252

Merged
MatthewBonanni merged 3 commits intovllm-project:mainfrom
wzhao18:wzhao/update-sparse-mla-priority
Mar 17, 2026
Merged

[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache#37252
MatthewBonanni merged 3 commits intovllm-project:mainfrom
wzhao18:wzhao/update-sparse-mla-priority

Conversation

@wzhao18
Copy link
Copy Markdown
Contributor

@wzhao18 wzhao18 commented Mar 17, 2026

Purpose

This PR sets Flashinfer sparse MLA as default backend for FP8 KV cache for better performance.

Test Plan

Test Result

Kernel microbenchmark results: #35807

E2E results with different TP (with EP enabled)
nvidia_DeepSeek-V3 2-NVFP4_backend_cmp_isl8192_osl1024

  • Flashinfer shows significantly better performance across the pareto for TP=1, 4, 8.
  • For TP=2, it shows slightly worse performance than flashMLA at higher concurrency. This is a known issue from kernel microbenchmark that the kernel has unusually bad performance at TP=2. For simplicity and the small E2E perf gap, we will set Flashinfer as the default backend for all TP sizes for FP8 KV types.

Command:

vllm serve \
    nvidia/DeepSeek-V3.2-NVFP4 \
    --trust-remote-code \
    --stream-interval 20 \
    --no-enable-prefix-caching \
    --kv-cache-dtype fp8 \
    --attention-backend {FLASHMLA_SPARSE, FLASHINFER_MLA_SPARSE} \
    --data-parallel-size {1,2,4,8} \
    --tensor-parallel-size {8,4,2,1} \
    --enable-expert-parallel \
    --gpu-memory-utilization 0.8

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

wzhao18 added 2 commits March 17, 2026 04:16
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 17, 2026

Documentation preview: https://vllm--37252.org.readthedocs.build/en/37252/

@mergify mergify bot added documentation Improvements or additions to documentation nvidia labels Mar 17, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the attention backend selection logic to prioritize Flashinfer sparse MLA for FP8 KV cache on Blackwell GPUs. The change is implemented in vllm/platforms/cuda.py by updating _get_backend_priorities to consider the kv_cache_dtype. The corresponding documentation in docs/design/attention_backends.md has been updated to reflect the new backend priority. The pull request also refactors type hints in vllm/platforms/cuda.py by introducing from __future__ import annotations.

@wzhao18
Copy link
Copy Markdown
Contributor Author

wzhao18 commented Mar 17, 2026

cc @MatthewBonanni

Copy link
Copy Markdown
Collaborator

@MatthewBonanni MatthewBonanni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Can you update the documentation to discuss when one is preferred over the other? I should have done this when I conditioned it on num_heads and neglected to. You'll need to modify the generator script. It can be as simple as just adding an asterisk to each of those and a footnote somewhere

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Mar 17, 2026
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@wzhao18 wzhao18 requested a review from hmellor as a code owner March 17, 2026 16:52
@wzhao18
Copy link
Copy Markdown
Contributor Author

wzhao18 commented Mar 17, 2026

@MatthewBonanni Done. Thanks!

@MatthewBonanni MatthewBonanni added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 17, 2026
@MatthewBonanni MatthewBonanni enabled auto-merge (squash) March 17, 2026 17:26
@MatthewBonanni MatthewBonanni merged commit b36adfa into vllm-project:main Mar 17, 2026
45 of 46 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Mar 17, 2026
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
…llm-project#37252)

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants