Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster AVX2 matrix multiplications for MoE models #428

Merged
merged 3 commits into from
May 21, 2024

Conversation

ikawrakow
Copy link
Contributor

This PR is a follow up of PR's #394 and #405 and enables the faster matrix multiplications for legacy and k-quants introduced there also for MoE models.

The following table shows a prompt processing speed comparison between the main branch and this PR for Mixtral8x7B on a Ryzen-7950X CPU

Quantization PP-512 (main) PP-512 (PR) Speedup
Q4_0 59.2 - -
Q4_1 35.3 69.6 1.97
Q5_0 30.6 65.4 2.14
Q5_1 29.5 64.0 2.17
Q2_K_S 66.8 88.9 1.33
Q3_K_S 45.2 85.3 1.89
Q4_K_S 53.4 81.8 1.53
Q5_K_S 38.6 75.0 1.94
Q6_K 41.8 85.6 2.05
IQ4_XS 41.6 76.1 1.83

@jart jart self-requested a review May 21, 2024 01:03
Copy link
Collaborator

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another amazing change! Very excited to see this. Just one small ask.

llamafile/sgemm.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@jart jart merged commit 938cf72 into Mozilla-Ocho:main May 21, 2024
@jart
Copy link
Collaborator

jart commented May 21, 2024

On AMD Ryzen Threadripper PRO 7995WX with Mixtral 8x7b I'm seeing speedups for Q5_K_M as high as 2.6x. Some quick measurements on my end, for a context size of 20900 and a prompt size of 1611 tokens:

quant tok/sec before tok/sec after speedup
Q2_K 153.95 195.15 1.27x
Q5_K_M 121.16 314.70 2.60x

So once again, outstanding work!

P.S. I'm going to be looking into integrating the llama-bench command sometime soon.

@ikawrakow
Copy link
Contributor Author

@jart

Yes. having llama-bench available would be very useful. Thanks for the Ryzen 7995WX performance numbers. I'm curious to see how the latest version in PR #435 does on that CPU.

Btw, I have done an implementation for ARM_NEON as well. I did it out of curiosity to see what is possible. I'm getting in the range of 2X improvement compared to mainline llama.cpp on my M2 Max. But this is still not better than just using the Accelerate framework, so I'm wondering if there is a benefit of adding this to llamafile.

@jart
Copy link
Collaborator

jart commented May 23, 2024

We ow have a llamafile-bench program. I've been running it with this script.

#!/bin/sh
cd ~/llamafile
make -j16 o//llama.cpp/llama-bench/llama-bench || exit
o//llama.cpp/llama-bench/llama-bench \
  $(for f in $(ls -S /weights/TinyLlama-1.1B-Chat-v1.0.*.gguf \
                     /weights/mixtral-8x7b-instruct-v0.1.*.gguf); do
      echo -m $f
    done) \
  "$@"

I also wrote a script you can use to zip up two text file reports into a single report. https://github.com/Mozilla-Ocho/llamafile/blob/main/llama.cpp/llama-bench/bench-llamafile-zip.py

Super exciting to hear about ARM NEON too! Raspberry Pi and Avahi Linux users will certainly thank you. The offer is open for me to mail you a Raspberry Pi 5 off Amazon if it'll help you with development.

@ikawrakow
Copy link
Contributor Author

Super exciting to hear about ARM NEON too! Raspberry Pi and Avahi Linux users will certainly thank you. The offer is open for me to mail you a Raspberry Pi 5 off Amazon if it'll help you with development.

Thanks for the offer, but it would be easier for me to just do it on my M2 laptop. I assume it would run on the Pi once it builds and runs successfully on my laptop with Arm 8.2 settings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants