Skip to content

[Attention] Use FA4 for MLA prefill#34732

Merged
tlrmchlsmth merged 18 commits intovllm-project:mainfrom
MatthewBonanni:fa4_mla_prefill
Mar 12, 2026
Merged

[Attention] Use FA4 for MLA prefill#34732
tlrmchlsmth merged 18 commits intovllm-project:mainfrom
MatthewBonanni:fa4_mla_prefill

Conversation

@MatthewBonanni
Copy link
Collaborator

@MatthewBonanni MatthewBonanni commented Feb 17, 2026

Purpose

Depends on vllm-project/flash-attention#123 and vllm-project/flash-attention#126

Enable using FA4 as an MLA prefill (forward_mha) backend. This PR makes FA4 the default MLA prefill backend on Blackwell due to its improved performance over TRT-LLM, especially for extend operations relevant to the chunked prefill performed in vLLM

Test Plan

python benchmarks/attention_benchmarks/benchmark.py --config benchmarks/attention_benchmarks/configs/mla_prefill.yaml

Test Result

Screenshot 2026-03-06 at 3 43 09 PM

Accuracy

DeepSeek-V2-Lite-Chat

FA4 (PR)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6732|±  |0.0129|
|     |       |strict-match    |     5|exact_match|↑  |0.6657|±  |0.0130|

TRTLLM (Reference)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6694|±  | 0.013|
|     |       |strict-match    |     5|exact_match|↑  |0.6649|±  | 0.013|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Feb 17, 2026

Documentation preview: https://vllm--34732.org.readthedocs.build/en/34732/

@mergify mergify bot added documentation Improvements or additions to documentation ci/build performance Performance-related issues nvidia v1 labels Feb 17, 2026
@mergify
Copy link

mergify bot commented Feb 17, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 17, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for FlashAttention-4 (FA4) for MLA prefill, which is a significant enhancement. The changes are extensive, touching the core attention logic, build system, and benchmarking infrastructure. The implementation is well-structured, including fallbacks for hardware limitations and clear configuration options. The benchmark suite has been notably improved to allow for detailed comparison of different prefill backends. I've identified one potential issue in the CMake configuration that could lead to incorrect files being installed.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@MatthewBonanni MatthewBonanni added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 6, 2026
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Copy link
Contributor

@LopezCastroRoberto LopezCastroRoberto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice analysis!

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Mar 12, 2026
@tlrmchlsmth tlrmchlsmth merged commit f444c05 into vllm-project:main Mar 12, 2026
117 of 118 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Mar 12, 2026
@MatthewBonanni MatthewBonanni deleted the fa4_mla_prefill branch March 12, 2026 16:10
athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Mar 16, 2026
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation nvidia performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants