[Attention] Use FA4 for MLA prefill by MatthewBonanni · Pull Request #34732 · vllm-project/vllm

MatthewBonanni · 2026-02-17T18:12:08Z

Purpose

Depends on vllm-project/flash-attention#123 and vllm-project/flash-attention#126

Enable using FA4 as an MLA prefill (forward_mha) backend. This PR makes FA4 the default MLA prefill backend on Blackwell due to its improved performance over TRT-LLM, especially for extend operations relevant to the chunked prefill performed in vLLM

Test Plan

python benchmarks/attention_benchmarks/benchmark.py --config benchmarks/attention_benchmarks/configs/mla_prefill.yaml

Test Result

Accuracy

DeepSeek-V2-Lite-Chat

FA4 (PR)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6732|±  |0.0129|
|     |       |strict-match    |     5|exact_match|↑  |0.6657|±  |0.0130|

TRTLLM (Reference)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6694|±  | 0.013|
|     |       |strict-match    |     5|exact_match|↑  |0.6649|±  | 0.013|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2026-02-17T18:12:47Z

Documentation preview: https://vllm--34732.org.readthedocs.build/en/34732/

mergify · 2026-02-17T18:12:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces support for FlashAttention-4 (FA4) for MLA prefill, which is a significant enhancement. The changes are extensive, touching the core attention logic, build system, and benchmarking infrastructure. The implementation is well-structured, including fallbacks for hardware limitations and clear configuration options. The benchmark suite has been notably improved to allow for detailed comparison of different prefill backends. I've identified one potential issue in the CMake configuration that could lead to incorrect files being installed.

cmake/external_projects/vllm_flash_attn.cmake

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

LopezCastroRoberto

LGTM, nice analysis!

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

mergify bot added documentation Improvements or additions to documentation ci/build performance Performance-related issues nvidia v1 labels Feb 17, 2026

github-project-automation bot added this to NVIDIA Feb 17, 2026

mergify bot added the needs-rebase label Feb 17, 2026

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

cmake/external_projects/vllm_flash_attn.cmake Show resolved Hide resolved

MatthewBonanni mentioned this pull request Feb 17, 2026

[Attention] Enable masked MHA for topk sparse attention using FA4 #34744

Open

5 tasks

MatthewBonanni force-pushed the fa4_mla_prefill branch from 50453c5 to ec1f7ba Compare March 2, 2026 15:55

mergify bot removed the needs-rebase label Mar 2, 2026

MatthewBonanni marked this pull request as ready for review March 2, 2026 19:55

MatthewBonanni requested review from LucasWilkinson, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners March 2, 2026 19:55

MatthewBonanni added 6 commits March 6, 2026 10:34

Carve out hdim 192 exception

0347963

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Update benchmarks

bf9c486

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Add mla_prefill.yaml

3db388a

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Test other mla prefill backends

2019dc2

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

More benchmark configs

5ade09f

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Add more benchmark configs

8c62fb5

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni added 4 commits March 6, 2026 10:34

Make FA4 the default MLA prefill backend on blackwell

491f1a9

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Caps

8a1abbd

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Clear cached prefill helpers

b6d1a0e

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Add mock weight

77cc4e2

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni force-pushed the fa4_mla_prefill branch from 84aec65 to 77cc4e2 Compare March 6, 2026 15:35

MatthewBonanni added 2 commits March 6, 2026 14:57

Add DeepSeek V2 case

3dd942a

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Update git tag

0535e6d

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 6, 2026

MatthewBonanni added 5 commits March 9, 2026 14:00

Fix

dd4dc11

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Revert flash_attn_supports_mla

fb05d06

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Merge branch 'main' into fa4_mla_prefill

3e28bd2

Merge branch 'main' into fa4_mla_prefill

9c39ca7

Update git tag

bbfe45d

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni force-pushed the fa4_mla_prefill branch from 436d583 to bbfe45d Compare March 11, 2026 21:07

Cleanup

e177857

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

LopezCastroRoberto approved these changes Mar 12, 2026

View reviewed changes

tlrmchlsmth approved these changes Mar 12, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Mar 12, 2026

tlrmchlsmth merged commit f444c05 into vllm-project:main Mar 12, 2026
117 of 118 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Mar 12, 2026

MatthewBonanni deleted the fa4_mla_prefill branch March 12, 2026 16:10

do420 mentioned this pull request Mar 12, 2026

fix xgrammar dependency conflict do420/vllm#1

Closed

MatthewBonanni mentioned this pull request Mar 12, 2026

[Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch Dao-AILab/flash-attention#2338

Merged

athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Mar 16, 2026

[Attention] Use FA4 for MLA prefill (vllm-project#34732)

7989a2f

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[Attention] Use FA4 for MLA prefill (vllm-project#34732)

4fc0748

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[Attention] Use FA4 for MLA prefill (vllm-project#34732)

6a8cdbe

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Attention] Use FA4 for MLA prefill (vllm-project#34732)

b9262dc

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Attention] Use FA4 for MLA prefill#34732

[Attention] Use FA4 for MLA prefill#34732
tlrmchlsmth merged 18 commits intovllm-project:mainfrom
MatthewBonanni:fa4_mla_prefill

MatthewBonanni commented Feb 17, 2026 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Feb 17, 2026

Uh oh!

mergify bot commented Feb 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

LopezCastroRoberto left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

MatthewBonanni commented Feb 17, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Accuracy

Uh oh!

mergify bot commented Feb 17, 2026

Uh oh!

mergify bot commented Feb 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

LopezCastroRoberto left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MatthewBonanni commented Feb 17, 2026 •

edited by github-actions bot

Loading