-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement FlashAttention for quantized models #73
Conversation
Currently need f32 flash attention kernels for this to work. |
Code Metrics Report─────────────────────────────────────────────────────────────────────────────── Language Files Lines Blanks Comments Code Complexity ─────────────────────────────────────────────────────────────────────────────── Rust 43 14101 995 599 12507 733 ─────────────────────────────────────────────────────────────────────────────── Total 43 14101 995 599 12507 733 ─────────────────────────────────────────────────────────────────────────────── Estimated Cost to Develop 83,359 Estimated Schedule Effort 9.553981 months Estimated People Required 3.564820 ─────────────────────────────────────────────────────────────────────────────── Processed 480763 bytes, 0.481 megabytes (SI) ─────────────────────────────────────────────────────────────────────────────── |
@EricLBuehler I'm sorry to bother but what are the plans regarding paged attention? Is it difficult to implement it in candle? Or mistral.rs? I'm asking because I was considering porting AWQ to this repo (there is a candle based implementation here https://github.com/yinqiwen/lmsf) but I don't know how feasible it'd be to use the kernels from AWQ directly if we use paged attention. |
We actually have a PR for PagedAttention: #47 which uproots. Unfortunately, it produces junk right now. If you want to take a look, that would be great, although the state of the PR is a bit of a mess (it builds + runs though, perhaps you can figure out what is going on?). One reason I discontinued work on PA is because we were/are faster! I did some benchmarking of vLLM and discovered that we were faster with non-quantized models on the same hardware and BS=1. Making BS=2 caused them to report a precisely doubled T/s whereas ours it always normalized per sequence, giving the impression that their throughput is skyrocketing, when I think that makes little sense. Therefore, because we are already faster without quantized models, I decided to work on other pressing issues. Regarding AWQ, that is something I would love to see and would be very valuable for this project! I have my Candle fork here in case you want so submit a PR. I think a good place to start would be the vLLM AWQ kernels. It may be a good idea to restart the PagedAttention pull request after I move the custom kernels (so that we can depend on the official Candle again) to this repo, given the large amount of changes to the codebase. Implementing PA would be a large undertaking. If you want to begin work on that, please feel free to remove as much code as you need to get it to work! |
Thanks. I meant to create the message on the other PR, sorry. Interesting. As far as I know PA only helps reduce memory fragmentation. It's not going to make attention faster. Best case it doesn't make it slower. |
Yes, that is true. I'm not sure how necessary PA would be, but if you want to give it a try, please feel free! |
No description provided.