[Attention] Enable masked MHA for topk sparse attention using FA4#34744
[Attention] Enable masked MHA for topk sparse attention using FA4#34744MatthewBonanni wants to merge 21 commits intovllm-project:mainfrom
Conversation
|
|
There was a problem hiding this comment.
Code Review
This pull request enables FlashAttention 4 for MLA (Multi-head Latent Attention), particularly for the prefill stage. It introduces a new abstraction layer for different FlashAttention versions, updates the build system to include FA4 components, and adds extensive benchmarking capabilities to compare various prefill backends. The changes also adjust the default prefill backend selection for MLA on Blackwell GPUs. My review focuses on the potential impact of these changes on performance and build stability.
|
Documentation preview: https://vllm--34744.org.readthedocs.build/en/34744/ |
|
This pull request has merge conflicts that must be resolved before it can be |
fa0486e to
301376c
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
b0af374 to
497cfa3
Compare
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Requires vllm-project/flash-attention#127, merge that first
Purpose
For pure prefills, there is a sweet spot of sequence length where a masked MHA pathway is faster than sparse MQA. The DeepSeek V3.2 paper mentions this:
This PR implements the two optimizations mentioned in #31473, point 3:
seq_len <= topkseq_len <= threshold(here, threshold is taken as 8192, and we also use a min of 128)Test Plan
Correctness
Performance
Test Result
Correctness
Should pass in CI (V1 Attention)
Performance
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.