Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
| @classmethod | ||
| def get_supported_head_sizes(cls) -> list[int]: | ||
| # FIXME (zyongye): change this until FA4 support more head_dim | ||
| if envs.VLLM_FLASH_ATTN_VERSION == 4: | ||
| return [64, 96, 128] | ||
| return [32, 64, 96, 128, 160, 192, 224, 256] | ||
|
|
||
| @staticmethod | ||
| def get_supported_kernel_block_size() -> list[int | MultipleOf]: | ||
| if envs.VLLM_FLASH_ATTN_VERSION == 4: | ||
| return [128] |
There was a problem hiding this comment.
Accept FA4 in flash attention version selection
These new branches rely on VLLM_FLASH_ATTN_VERSION == 4 to activate FA4-specific behavior, but get_flash_attn_version still asserts that the environment variable is only 2 or 3. Setting VLLM_FLASH_ATTN_VERSION=4 to reach this code path currently triggers an AssertionError during backend initialization, so the FA4 code here is unreachable and the feature cannot be enabled. The version-selection logic needs to be updated to admit 4 (and handle unsupported hardware) before these branches will ever execute.
Useful? React with 👍 / 👎.
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
LucasWilkinson
left a comment
There was a problem hiding this comment.
Thanks for doing this! I think it's fine to import from flash_attn upstream for now since it's currently only activated by an env var.
Apologies for the delay but will update our fork soon (just need #24002)
| assert device_capability is not None | ||
|
|
||
| # 1. default version depending on platform | ||
| fa_version = ( |
There was a problem hiding this comment.
should we update this if we are requiring the env var to be set to enable FA4?
There was a problem hiding this comment.
you're right. I will revert it for now.
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
|
I also have this PR to add FA4 to supported version in FA repo. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
closed due to #32974 |
Ongoing integration for Flash Attention 4.
How to run
Accuracy bench
openai/gpt-oss-20b, GPQA:
Perf Benchmark, flashinfer is significantly better than FA4 for now, probably because splitkv hasn't been implemented yet:
Qwen3-0.6B, 1000:1000x256
FA4:
FlashInfer
8000:1x256
FA4
Flashinfer