-
Notifications
You must be signed in to change notification settings - Fork 973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster AVX2 matrix multiplications for MoE models #428
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another amazing change! Very excited to see this. Just one small ask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
On AMD Ryzen Threadripper PRO 7995WX with Mixtral 8x7b I'm seeing speedups for Q5_K_M as high as 2.6x. Some quick measurements on my end, for a context size of 20900 and a prompt size of 1611 tokens:
So once again, outstanding work! P.S. I'm going to be looking into integrating the |
Yes. having Btw, I have done an implementation for ARM_NEON as well. I did it out of curiosity to see what is possible. I'm getting in the range of 2X improvement compared to mainline |
We ow have a llamafile-bench program. I've been running it with this script. #!/bin/sh
cd ~/llamafile
make -j16 o//llama.cpp/llama-bench/llama-bench || exit
o//llama.cpp/llama-bench/llama-bench \
$(for f in $(ls -S /weights/TinyLlama-1.1B-Chat-v1.0.*.gguf \
/weights/mixtral-8x7b-instruct-v0.1.*.gguf); do
echo -m $f
done) \
"$@" I also wrote a script you can use to zip up two text file reports into a single report. https://github.com/Mozilla-Ocho/llamafile/blob/main/llama.cpp/llama-bench/bench-llamafile-zip.py Super exciting to hear about ARM NEON too! Raspberry Pi and Avahi Linux users will certainly thank you. The offer is open for me to mail you a Raspberry Pi 5 off Amazon if it'll help you with development. |
Thanks for the offer, but it would be easier for me to just do it on my M2 laptop. I assume it would run on the Pi once it builds and runs successfully on my laptop with Arm 8.2 settings? |
This PR is a follow up of PR's #394 and #405 and enables the faster matrix multiplications for legacy and k-quants introduced there also for MoE models.
The following table shows a prompt processing speed comparison between the main branch and this PR for Mixtral8x7B on a Ryzen-7950X CPU