[Attention][FA3] Update FA3 to include new swizzle optimization#23465
[Attention][FA3] Update FA3 to include new swizzle optimization#23465vllm-bot merged 10 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the flash-attention dependency to a new commit hash, likely to incorporate the 'FA3 swizzle optimization' mentioned in the PR title. While pinning to a commit is good for reproducibility, using a commit hash that is not part of a tag or a long-lived branch can pose a risk for future builds and maintenance. I've added a comment suggesting the use of a git tag for better long-term stability.
There was a problem hiding this comment.
Pinning dependencies to a specific commit hash is good for reproducibility. However, for long-term maintenance and release builds, it's better to use a git tag. Commit hashes that are not part of a branch or tag can be garbage collected by git, or become hard to track. Since this is a work-in-progress, it's acceptable for now, but it would be best to create a tag in the vllm-project/flash-attention repository for this commit before this PR is merged.
ed1629a to
7a4e376
Compare
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
a75c6e0 to
916840d
Compare
|
Hi @LucasWilkinson, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @LucasWilkinson, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
16aaf77 to
74296b9
Compare
74296b9 to
9d203b4
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
8b8419d to
a2aba4a
Compare
dfed34c to
0d884bb
Compare
|
Documentation preview: https://vllm--23465.org.readthedocs.build/en/23465/ |
d83aa79 to
74688d5
Compare
a7546ef to
34de5bb
Compare
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
The scheduler_metadata buffer was sized using max_num_seqs, but during cuda graph capture the batch size can be up to max_cudagraph_size which may be larger. This caused RuntimeError when the scheduler returned more elements than the buffer could hold. Example: with max_num_seqs=1, buffer was 1*4+1=5, but max_cudagraph_size=2 led to scheduler returning 9 elements (2*4+1=9). Fixes CI failures in test_simple_generation[FLASH_ATTN] and similar tests. Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
The scheduler_metadata buffer was sized using max_num_seqs, but during cuda graph capture the batch size can be up to max_cudagraph_size which may be larger. This caused RuntimeError when the scheduler returned more elements than the buffer could hold. Example: with max_num_seqs=1, buffer was 1*4+1=5, but max_cudagraph_size=2 led to scheduler returning 9 elements (2*4+1=9). Fixes CI failures in test_simple_generation[FLASH_ATTN] and similar tests. Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
b5edaac to
19ab5ba
Compare
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…-project#23465) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Pai <416932041@qq.com>
…-project#23465) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: felix01.yu <felix01.yu@vipshop.com>
…-project#23465) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
vLLM side of vllm-project/flash-attention#82
meta-llama/Meta-Llama-3-8B-Instruct, 1xH100, 4k and 2k out