Faster AVX2 matrix multiplications for MoE models #428

ikawrakow · 2024-05-20T14:03:13Z

This PR is a follow up of PR's #394 and #405 and enables the faster matrix multiplications for legacy and k-quants introduced there also for MoE models.

The following table shows a prompt processing speed comparison between the main branch and this PR for Mixtral8x7B on a Ryzen-7950X CPU

Quantization	PP-512 (main)	PP-512 (PR)	Speedup
Q4_0	59.2	-	-
Q4_1	35.3	69.6	1.97
Q5_0	30.6	65.4	2.14
Q5_1	29.5	64.0	2.17
Q2_K_S	66.8	88.9	1.33
Q3_K_S	45.2	85.3	1.89
Q4_K_S	53.4	81.8	1.53
Q5_K_S	38.6	75.0	1.94
Q6_K	41.8	85.6	2.05
IQ4_XS	41.6	76.1	1.83

jart

Another amazing change! Very excited to see this. Just one small ask.

llamafile/sgemm.cpp

jart

Thank you!

jart · 2024-05-21T06:58:35Z

On AMD Ryzen Threadripper PRO 7995WX with Mixtral 8x7b I'm seeing speedups for Q5_K_M as high as 2.6x. Some quick measurements on my end, for a context size of 20900 and a prompt size of 1611 tokens:

quant	tok/sec before	tok/sec after	speedup
Q2_K	153.95	195.15	1.27x
Q5_K_M	121.16	314.70	2.60x

So once again, outstanding work!

P.S. I'm going to be looking into integrating the llama-bench command sometime soon.

ikawrakow · 2024-05-22T13:19:13Z

@jart

Yes. having llama-bench available would be very useful. Thanks for the Ryzen 7995WX performance numbers. I'm curious to see how the latest version in PR #435 does on that CPU.

Btw, I have done an implementation for ARM_NEON as well. I did it out of curiosity to see what is possible. I'm getting in the range of 2X improvement compared to mainline llama.cpp on my M2 Max. But this is still not better than just using the Accelerate framework, so I'm wondering if there is a benefit of adding this to llamafile.

jart · 2024-05-23T22:22:49Z

We ow have a llamafile-bench program. I've been running it with this script.

#!/bin/sh
cd ~/llamafile
make -j16 o//llama.cpp/llama-bench/llama-bench || exit
o//llama.cpp/llama-bench/llama-bench \
  $(for f in $(ls -S /weights/TinyLlama-1.1B-Chat-v1.0.*.gguf \
                     /weights/mixtral-8x7b-instruct-v0.1.*.gguf); do
      echo -m $f
    done) \
  "$@"

I also wrote a script you can use to zip up two text file reports into a single report. https://github.com/Mozilla-Ocho/llamafile/blob/main/llama.cpp/llama-bench/bench-llamafile-zip.py

Super exciting to hear about ARM NEON too! Raspberry Pi and Avahi Linux users will certainly thank you. The offer is open for me to mail you a Raspberry Pi 5 off Amazon if it'll help you with development.

ikawrakow · 2024-05-24T14:31:00Z

Super exciting to hear about ARM NEON too! Raspberry Pi and Avahi Linux users will certainly thank you. The offer is open for me to mail you a Raspberry Pi 5 off Amazon if it'll help you with development.

Thanks for the offer, but it would be easier for me to just do it on my M2 laptop. I assume it would run on the Pi once it builds and runs successfully on my laptop with Arm 8.2 settings?

Kawrakow added 2 commits May 20, 2024 12:48

Adding MOE support: initial commit

473b96d

Adding MOE support: seems to be working

dbb04fd

mofosyne added the performance label May 20, 2024

jart self-requested a review May 21, 2024 01:03

jart reviewed May 21, 2024

View reviewed changes

llamafile/sgemm.cpp Outdated Show resolved Hide resolved

Address PR comment

d292ca0

jart approved these changes May 21, 2024

View reviewed changes

jart merged commit 938cf72 into Mozilla-Ocho:main May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster AVX2 matrix multiplications for MoE models #428

Faster AVX2 matrix multiplications for MoE models #428

ikawrakow commented May 20, 2024

jart left a comment

jart left a comment

jart commented May 21, 2024

ikawrakow commented May 22, 2024

jart commented May 23, 2024

ikawrakow commented May 24, 2024

Faster AVX2 matrix multiplications for MoE models #428

Faster AVX2 matrix multiplications for MoE models #428

Conversation

ikawrakow commented May 20, 2024

jart left a comment

Choose a reason for hiding this comment

jart left a comment

Choose a reason for hiding this comment

jart commented May 21, 2024

ikawrakow commented May 22, 2024

jart commented May 23, 2024

ikawrakow commented May 24, 2024