Implement FlashAttention for quantized models #73

EricLBuehler · 2024-04-04T23:34:57Z

No description provided.

EricLBuehler · 2024-04-05T17:21:44Z

Currently need f32 flash attention kernels for this to work.

github-actions · 2024-04-06T01:33:44Z

Code Metrics Report

  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        43     14101      995       599    12507        733
───────────────────────────────────────────────────────────────────────────────
Total                       43     14101      995       599    12507        733
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 83,359
Estimated Schedule Effort 9.553981 months
Estimated People Required 3.564820
───────────────────────────────────────────────────────────────────────────────
Processed 480763 bytes, 0.481 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

lucasavila00 · 2024-04-15T22:05:01Z

@EricLBuehler I'm sorry to bother but what are the plans regarding paged attention?

Is it difficult to implement it in candle? Or mistral.rs?

I'm asking because I was considering porting AWQ to this repo (there is a candle based implementation here https://github.com/yinqiwen/lmsf) but I don't know how feasible it'd be to use the kernels from AWQ directly if we use paged attention.
Or would it be possible to kernels from vLLM? 🤔

EricLBuehler · 2024-04-15T22:38:26Z

We actually have a PR for PagedAttention: #47 which uproots. Unfortunately, it produces junk right now. If you want to take a look, that would be great, although the state of the PR is a bit of a mess (it builds + runs though, perhaps you can figure out what is going on?).

One reason I discontinued work on PA is because we were/are faster! I did some benchmarking of vLLM and discovered that we were faster with non-quantized models on the same hardware and BS=1. Making BS=2 caused them to report a precisely doubled T/s whereas ours it always normalized per sequence, giving the impression that their throughput is skyrocketing, when I think that makes little sense. Therefore, because we are already faster without quantized models, I decided to work on other pressing issues.

Regarding AWQ, that is something I would love to see and would be very valuable for this project! I have my Candle fork here in case you want so submit a PR. I think a good place to start would be the vLLM AWQ kernels.

It may be a good idea to restart the PagedAttention pull request after I move the custom kernels (so that we can depend on the official Candle again) to this repo, given the large amount of changes to the codebase. Implementing PA would be a large undertaking. If you want to begin work on that, please feel free to remove as much code as you need to get it to work!

lucasavila00 · 2024-04-15T23:05:55Z

Thanks. I meant to create the message on the other PR, sorry.

Interesting. As far as I know PA only helps reduce memory fragmentation. It's not going to make attention faster. Best case it doesn't make it slower.

EricLBuehler · 2024-04-16T13:04:32Z

Yes, that is true. I'm not sure how necessary PA would be, but if you want to give it a try, please feel free!

Implement flash attn for qllama

4ea9048

EricLBuehler added optimization models Additions to model or architectures labels Apr 4, 2024

EricLBuehler added 4 commits April 4, 2024 19:50

Fix result

ddaeda0

Cast it

e3d454e

Check cast

1ac0137

Cast it correctly

c2b893a

Merge branch 'master' into quantized_flash

5b0c765

EricLBuehler added the not planned This will not be worked on label Apr 9, 2024

EricLBuehler closed this Apr 16, 2024

EricLBuehler deleted the quantized_flash branch April 16, 2024 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement FlashAttention for quantized models #73

Implement FlashAttention for quantized models #73

EricLBuehler commented Apr 4, 2024

EricLBuehler commented Apr 5, 2024

github-actions bot commented Apr 6, 2024

lucasavila00 commented Apr 15, 2024

EricLBuehler commented Apr 15, 2024

lucasavila00 commented Apr 15, 2024

EricLBuehler commented Apr 16, 2024

Implement FlashAttention for quantized models #73

Implement FlashAttention for quantized models #73

Conversation

EricLBuehler commented Apr 4, 2024

EricLBuehler commented Apr 5, 2024

github-actions bot commented Apr 6, 2024

lucasavila00 commented Apr 15, 2024

EricLBuehler commented Apr 15, 2024

lucasavila00 commented Apr 15, 2024

EricLBuehler commented Apr 16, 2024