Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement FlashAttention for quantized models #73

Closed
wants to merge 6 commits into from

Conversation

EricLBuehler
Copy link
Owner

No description provided.

@EricLBuehler EricLBuehler added optimization models Additions to model or architectures labels Apr 4, 2024
@EricLBuehler
Copy link
Owner Author

Currently need f32 flash attention kernels for this to work.

Copy link

github-actions bot commented Apr 6, 2024

Code Metrics Report
  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        43     14101      995       599    12507        733
───────────────────────────────────────────────────────────────────────────────
Total                       43     14101      995       599    12507        733
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 83,359
Estimated Schedule Effort 9.553981 months
Estimated People Required 3.564820
───────────────────────────────────────────────────────────────────────────────
Processed 480763 bytes, 0.481 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────
  

@EricLBuehler EricLBuehler added the not planned This will not be worked on label Apr 9, 2024
@lucasavila00
Copy link
Contributor

@EricLBuehler I'm sorry to bother but what are the plans regarding paged attention?

Is it difficult to implement it in candle? Or mistral.rs?

I'm asking because I was considering porting AWQ to this repo (there is a candle based implementation here https://github.com/yinqiwen/lmsf) but I don't know how feasible it'd be to use the kernels from AWQ directly if we use paged attention.
Or would it be possible to kernels from vLLM? 🤔

@EricLBuehler
Copy link
Owner Author

We actually have a PR for PagedAttention: #47 which uproots. Unfortunately, it produces junk right now. If you want to take a look, that would be great, although the state of the PR is a bit of a mess (it builds + runs though, perhaps you can figure out what is going on?).

One reason I discontinued work on PA is because we were/are faster! I did some benchmarking of vLLM and discovered that we were faster with non-quantized models on the same hardware and BS=1. Making BS=2 caused them to report a precisely doubled T/s whereas ours it always normalized per sequence, giving the impression that their throughput is skyrocketing, when I think that makes little sense. Therefore, because we are already faster without quantized models, I decided to work on other pressing issues.

Regarding AWQ, that is something I would love to see and would be very valuable for this project! I have my Candle fork here in case you want so submit a PR. I think a good place to start would be the vLLM AWQ kernels.

It may be a good idea to restart the PagedAttention pull request after I move the custom kernels (so that we can depend on the official Candle again) to this repo, given the large amount of changes to the codebase. Implementing PA would be a large undertaking. If you want to begin work on that, please feel free to remove as much code as you need to get it to work!

@lucasavila00
Copy link
Contributor

Thanks. I meant to create the message on the other PR, sorry.

Interesting. As far as I know PA only helps reduce memory fragmentation. It's not going to make attention faster. Best case it doesn't make it slower.

@EricLBuehler EricLBuehler deleted the quantized_flash branch April 16, 2024 13:02
@EricLBuehler
Copy link
Owner Author

Yes, that is true. I'm not sure how necessary PA would be, but if you want to give it a try, please feel free!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models Additions to model or architectures not planned This will not be worked on optimization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants