[Attention] Enable masked MHA for topk sparse attention using FA4 by MatthewBonanni · Pull Request #34744 · vllm-project/vllm

MatthewBonanni · 2026-02-17T20:44:31Z

Requires vllm-project/flash-attention#127, merge that first

Purpose

For pure prefills, there is a sweet spot of sequence length where a masked MHA pathway is faster than sparse MQA. The DeepSeek V3.2 paper mentions this:

Note that for short-sequence prefilling, we specially implement a masked MHA mode to
simulate DSA, which can achieve higher efficiency under short-context conditions.

This PR implements the two optimizations mentioned in #31473, point 3:

Use un-absorbed dense MHA pathway for requests with seq_len <= topk
Use un-absorbed masked MHA pathway for requests with seq_len <= threshold (here, threshold is taken as 8192, and we also use a min of 128)

Test Plan

Correctness

pytest tests/v1/attention/test_sparse_mla_backends.py

Performance

python benchmarks/attention_benchmarks/benchmark.py --config benchmarks/attention_benchmarks/configs/mla_sparse_mha_vs_mqa.yaml

Test Result

Correctness

Should pass in CI (V1 Attention)

Performance

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2026-02-17T20:45:08Z

⚠️ The sha of the head commit of this PR conflicts with #34732. Mergify cannot evaluate rules on this PR. ⚠️

gemini-code-assist

Code Review

This pull request enables FlashAttention 4 for MLA (Multi-head Latent Attention), particularly for the prefill stage. It introduces a new abstraction layer for different FlashAttention versions, updates the build system to include FA4 components, and adds extensive benchmarking capabilities to compare various prefill backends. The changes also adjust the default prefill backend selection for MLA on Blackwell GPUs. My review focuses on the potential impact of these changes on performance and build stability.

vllm/config/attention.py

cmake/external_projects/vllm_flash_attn.cmake

mergify · 2026-02-17T22:39:27Z

Documentation preview: https://vllm--34744.org.readthedocs.build/en/34744/

mergify · 2026-02-21T01:25:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-03-17T14:19:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

vllm/config/attention.py Show resolved Hide resolved

cmake/external_projects/vllm_flash_attn.cmake Show resolved Hide resolved

mergify bot added documentation Improvements or additions to documentation ci/build performance Performance-related issues nvidia rocm Related to AMD ROCm labels Feb 17, 2026

github-project-automation bot added this to NVIDIA Feb 17, 2026

mergify bot added the v1 label Feb 17, 2026

github-project-automation bot added this to AMD Feb 17, 2026

github-project-automation bot moved this to Todo in AMD Feb 17, 2026

mergify bot added the needs-rebase label Feb 21, 2026

LucasWilkinson mentioned this pull request Feb 24, 2026

[Performance]: DeepSeek-V3.2 Performance Optimization Tracking #31473

Open

15 tasks

MatthewBonanni force-pushed the fa4_masked_mha branch from fa0486e to 301376c Compare March 12, 2026 23:30

mergify bot removed the needs-rebase label Mar 12, 2026

mergify bot added the needs-rebase label Mar 17, 2026

MatthewBonanni added 11 commits March 23, 2026 16:02

First pass

fedfd54

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Implement

2801fbf

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Update benchmarks, add force option

cb08bb4

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Add benchmark config file

173ca6a

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Fix benchmarks

ee77694

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Benchmark: compare in single run

1ae0517

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Fix

c2d1e5d

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Only decompress topk

1bdac6c

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Topk everything

695c00d

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Decompress all, not just topk

06c4d4d

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Cleanup

77d8fca

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni added 4 commits March 23, 2026 16:09

Move mask stuff into FA4

3e88464

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Fix test

813f3bb

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Add fast paths

9ddecc4

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Fix after rebase

497cfa3

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni force-pushed the fa4_masked_mha branch from b0af374 to 497cfa3 Compare March 23, 2026 20:22

mergify bot removed the needs-rebase label Mar 23, 2026

MatthewBonanni added 5 commits March 23, 2026 16:32

Detailed sweep

4be9585

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Update dispatch logic

b083570

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Clean up FA4 changes

109ade3

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Clean up

5b26419

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Reorganize

96035f9

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni marked this pull request as ready for review March 23, 2026 21:32

MatthewBonanni requested review from LucasWilkinson, ProExpertProg, WoosukKwon, alexm-redhat, hmellor, houseroad, mgoin, njhill, pavanimajety, robertgshaw2-redhat, tjtanaa, tlrmchlsmth, yewentao256, youkaichao and zhuohan123 as code owners March 23, 2026 21:32

Update GIT_TAG

1cf84e0

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Attention] Enable masked MHA for topk sparse attention using FA4#34744

[Attention] Enable masked MHA for topk sparse attention using FA4#34744
MatthewBonanni wants to merge 21 commits intovllm-project:mainfrom
MatthewBonanni:fa4_masked_mha

MatthewBonanni commented Feb 17, 2026 •

edited

Loading

Uh oh!

mergify bot commented Feb 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 17, 2026

Uh oh!

mergify bot commented Feb 21, 2026

Uh oh!

mergify bot commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MatthewBonanni commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Correctness

Performance

Test Result

Correctness

Performance

Uh oh!

mergify bot commented Feb 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 17, 2026

Uh oh!

mergify bot commented Feb 21, 2026

Uh oh!

mergify bot commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MatthewBonanni commented Feb 17, 2026 •

edited

Loading