Skip to content

[Attention] Enable masked MHA for topk sparse attention using FA4#34744

Open
MatthewBonanni wants to merge 21 commits intovllm-project:mainfrom
MatthewBonanni:fa4_masked_mha
Open

[Attention] Enable masked MHA for topk sparse attention using FA4#34744
MatthewBonanni wants to merge 21 commits intovllm-project:mainfrom
MatthewBonanni:fa4_masked_mha

Conversation

@MatthewBonanni
Copy link
Collaborator

@MatthewBonanni MatthewBonanni commented Feb 17, 2026

Requires vllm-project/flash-attention#127, merge that first

Purpose

For pure prefills, there is a sweet spot of sequence length where a masked MHA pathway is faster than sparse MQA. The DeepSeek V3.2 paper mentions this:

Note that for short-sequence prefilling, we specially implement a masked MHA mode to
simulate DSA, which can achieve higher efficiency under short-context conditions.

This PR implements the two optimizations mentioned in #31473, point 3:

  1. Use un-absorbed dense MHA pathway for requests with seq_len <= topk
  2. Use un-absorbed masked MHA pathway for requests with seq_len <= threshold (here, threshold is taken as 8192, and we also use a min of 128)

Test Plan

Correctness

pytest tests/v1/attention/test_sparse_mla_backends.py

Performance

python benchmarks/attention_benchmarks/benchmark.py --config benchmarks/attention_benchmarks/configs/mla_sparse_mha_vs_mqa.yaml

Test Result

Correctness

Should pass in CI (V1 Attention)

Performance

mha_vs_mqa
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Feb 17, 2026

⚠️ The sha of the head commit of this PR conflicts with #34732. Mergify cannot evaluate rules on this PR. ⚠️

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables FlashAttention 4 for MLA (Multi-head Latent Attention), particularly for the prefill stage. It introduces a new abstraction layer for different FlashAttention versions, updates the build system to include FA4 components, and adds extensive benchmarking capabilities to compare various prefill backends. The changes also adjust the default prefill backend selection for MLA on Blackwell GPUs. My review focuses on the potential impact of these changes on performance and build stability.

@mergify
Copy link

mergify bot commented Feb 17, 2026

Documentation preview: https://vllm--34744.org.readthedocs.build/en/34744/

@mergify mergify bot added documentation Improvements or additions to documentation ci/build performance Performance-related issues nvidia rocm Related to AMD ROCm labels Feb 17, 2026
@mergify mergify bot added the v1 label Feb 17, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Feb 17, 2026
@mergify
Copy link

mergify bot commented Feb 21, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link

mergify bot commented Mar 17, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 17, 2026
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation nvidia performance Performance-related issues rocm Related to AMD ROCm v1

Projects

Status: Todo
Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant