Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Isotr0py
left a comment
There was a problem hiding this comment.
BTW, it would be best for this PR to have some benchmark to see the performance improvement as well.
There was a problem hiding this comment.
If flash_attn_varlen_func from vllm-flash-attn can work well, we can remove original FA implementation as well.
There was a problem hiding this comment.
Should we keep original FA implementation for other platform like ROCm which do not support FA through vllm-flash-attn?
There was a problem hiding this comment.
| elif self.attn_backend == _Backend.FLASH_ATTN_VLLM_V1: | |
| elif self.attn_backend in (_Backend.FLASH_ATTN_VLLM_V1, _Backend.FLASH_ATTN): |
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
0486de7 to
37e7446
Compare
Looks like |
|
The check seems to be happening here |
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
It loads fine with the normal |
|
Okay. So I did a little digging and it looks like this PR (#8910) was opened to fix that issue which relied on this PR (vllm-project/flash-attention#21) to actually disable that flag and build it with the new heads. It seems the conversation ran stale after @njhill and @WoosukKwon discussed this for a bit. Is there any reason why we can't go ahead with this and make sure we build for all head sizes? It seems to be a more pragmatic decision than to have users who rely on a single docker image to have to install both |
|
Just wanted to bring this back up in case anyone can provide any directions on what the path forward can be. If we feel that enabling this in |
|
Unfortunately we don't have much wheel size to spare with the recent Blackwell addition; I am going OOO for the next 2 weeks but I can investigate the wheel size impacts when I get back. It looks like this is mostly an FA2 problem so we can potentially use |
|
So, if I understood this issue correctly, I can use flash_attn backend with multimodal model with original flash attention 2, correct? |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
Hi @vrdn-23, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
|
Closing cause this should be solved by #28763 |
FIX #18324