[Attention] FA3 Attention Sinks Perf Boost#22478
[Attention] FA3 Attention Sinks Perf Boost#22478LucasWilkinson merged 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request updates the flash-attention dependency to a newer commit. Based on the PR description and the commit hash, this update is intended to bring in support for FlashAttention v3. The provided patch only reflects this dependency change in the CMake configuration. The full file contents suggest there are other related changes in the Python source code to support FA3 and also to introduce a new FlashMLA backend. Since I can only comment on the provided patch, my review is limited to the dependency update itself, which seems correct.
However, while reviewing the full code for context, I found a critical issue in vllm/attention/backends/flashmla.py that I believe is part of this change.
In vllm/attention/backends/flashmla.py, the __init__ method of FlashMLAImpl has the following assertion:
assert is_flashmla_supported(), \
"FlashMLA is not supported on this device"The function is_flashmla_supported() returns a tuple (bool, Optional[str]). If FlashMLA is not supported, it returns (False, "reason string"). In Python, a non-empty tuple is truthy, so assert (False, "reason") will pass silently. This will lead to a runtime error later. This is a critical bug.
I suggest changing it to:
supported, reason = is_flashmla_supported()
if not supported:
raise NotImplementedError(f"FlashMLA is not supported on this device: {reason}")This will correctly check for support and provide a helpful error message.
a02895f to
642cb08
Compare
642cb08 to
8bee389
Compare
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
8bee389 to
d1b81c5
Compare
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Purpose
vLLM side of vllm-project/flash-attention#78 (merge that first)
Shout-out to @jayhshah (the performance wizard 🪄) for the implementation
Co-authored-by: Jay Shah jayhshah@gmail.com
Test Plan
Test Result
(Optional) Documentation Update