[Attention] Use FA4 for MLA prefill#34732
Conversation
|
Documentation preview: https://vllm--34732.org.readthedocs.build/en/34732/ |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces support for FlashAttention-4 (FA4) for MLA prefill, which is a significant enhancement. The changes are extensive, touching the core attention logic, build system, and benchmarking infrastructure. The implementation is well-structured, including fallbacks for hardware limitations and clear configuration options. The benchmark suite has been notably improved to allow for detailed comparison of different prefill backends. I've identified one potential issue in the CMake configuration that could lead to incorrect files being installed.
50453c5 to
ec1f7ba
Compare
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
84aec65 to
77cc4e2
Compare
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
436d583 to
bbfe45d
Compare
LopezCastroRoberto
left a comment
There was a problem hiding this comment.
LGTM, nice analysis!
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Purpose
Depends on vllm-project/flash-attention#123 and vllm-project/flash-attention#126
Enable using FA4 as an MLA prefill (
forward_mha) backend. This PR makes FA4 the default MLA prefill backend on Blackwell due to its improved performance over TRT-LLM, especially for extend operations relevant to the chunked prefill performed in vLLMTest Plan
Test Result
Accuracy
DeepSeek-V2-Lite-Chat
FA4 (PR)
TRTLLM (Reference)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.