[Attention] FA2 support more head sizes, ViT support, make default backend#28763
[Attention] FA2 support more head sizes, ViT support, make default backend#28763vllm-bot merged 11 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request updates FlashAttention to support head sizes required for Vision Transformers (40, 72, 80). This is achieved by updating the dependency to a fork of flash-attention, generalizing the head size check in the FlashAttention backend, and updating tests. The logic for selecting the ViT attention backend is also refactored for clarity. My review has identified two main points. First, a critical issue in cmake/external_projects/vllm_flash_attn.cmake where the dependency points to a personal fork, which must be reverted before merging. Second, a high-severity issue in tests/kernels/attention/test_flash_attn.py where a test case for soft_cap has been removed, potentially hiding a feature regression. The other changes look good.
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
Do you know if FA2 is supported too? do you mine testing this on Ampere? I think it should be ok |
|
@LucasWilkinson |
…oject#28763) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
…oject#28763) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
I didn't find any head size checks in https://github.com/vllm-project/flash-attention/blob/main/hopper/flash_api.cpp, so I guess this MR probably just extends the supported head_size list for FA2. In other words, FA3 already supports any head_size < 256 by default. @MatthewBonanni |
|
@MoyanZitto yes, good point. I've updated the title and description |
Purpose
This PR is paired with vllm-project/flash-attention#109 (merge that first after CI passes, then I'll update the git tag), which enables FA2 to support the head sizes required for vision transformers (40, 72, and 80) (FA3 supports these by default). This PR also updates the selector to make FlashAttention the default backend over xFormers.
Test Plan
pytest tests/kernels/attention/test_flash_attn.py(updated with new head sizes)Test Result
Passes
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.