Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another performance optimization for Zen4 + refactoring #435

Merged
merged 3 commits into from
May 23, 2024

Conversation

ikawrakow
Copy link
Contributor

This PR adds the following changes

  • Improved k-quants prompt processing performance for Zen4 (AVX512F, AVX512VNNI, AVX512VL, AVX512BW and AVX512DQ are available). Improvements are in the 15-30% range on my Ryzen-7950X CPU (see Table 1 and Table 2 below)
  • Much nicer implementation - compared to the previous version, code size has increased by just ~150 LOC despite having two completely separate implementations for Zen4 and vanilla AVX2

Table 1 PP-512 performance for a LLaMA-7B model on a Ryzen-7950X CPU

Quantization t/s (main) t/s (PR) Speedup
Q2_K_S 152.8 177.8 1.164
Q3_K_S 165.7 194.7 1.175
Q4_K_S 160.0 200.0 1.250
Q5_K_S 147.1 192.5 1.308
Q6_K 168.4 195.4 1.160
IQ4_XS 150.6 193.2 1.283

Table 2 PP-512 performance for Mixtral-8x7B on a Ryzen-7950X CPU

Quantization t/s (main) t/s (PR) Speedup
Q2_K_S 84.5 102.4 1.212
Q3_K_S 81.6 95.5 1.170
Q4_K_S 77.3 97.0 1.254
Q5_K_S 70.0 92.8 1.325
Q6_K 81.3 93.9 1.155
IQ4_XS 74.1 93.8 1.265

If the cost associated with unpacking the quantized values for subsequent multiply-add operations with the activations is fully amortized, we would expect to have performance independent of the quantization type. Hence, I'm now pleased to observe that this is nearly the case except for Q2_K. I'm not sure why Q2_K performance is lower for the 7B model (my guess is that the compiler fails to achieve the best ordering of memory loads into SIMD registers and SIMD operations - Q2_K is the only quant that requires a single memory load for a block of 256 weights, all others need 2 or 3), but the fact that Q2_K performs better than the others for Mixtral8x7B may indicate that memory throughput may be playing a role even for prompt processing of long prompts.

I also did a comparison with current mainline llama.cpp (commit hash 95fb0aef) to see the combined effect of all optimizations. The following table shows the results for LLaMA-v2-7B and Mixtral8x7B on my Ryzen-7950X CPU

Model Quantization t/s (llama.cpp) t/s (PR) Speedup
LLaMA-v2-7B Q2_K_S 103.8 177.8 1.713
LLaMA-v-7B Q3_K_S 80.1 194.7 2.430
LLaMA-v2-7B Q4_K_S 102.4 200.0 1.953
LLaMA-v2-7B Q5_K_S 72.8 192.5 2.643
LLaMA-v2-7B Q6_K 79.9 195.4 2.446
LLaMA-v2-7B IQ4_XS 72.2 193.2 2.675
Mixtral8x7B Q2_K_S 61.4 102.4 1.668
Mixtral-8x7B Q3_K_S 42.6 95.5 2.240
Mixtral-8x7B Q4_K_S 53.2 97.0 1.824
Mixtral-8x7B Q5_K_S 38.5 92.8 2.407
Mixtral-8x7B Q6_K 43.0 93.9 2.184
Mixtral-8x7B IQ4_XS 38.6 93.8 2.432

@ikawrakow
Copy link
Contributor Author

Btw, I'm noticing that this PR results in a small performance benefit for plain AVX2 as well. On a Ryzen-5975WX I measure for a 7B LLaMA model

Quantization t/s (llamafile main) t/s (llama.cpp master) t/s (this PR) Speedup llamafile speedup llama.cpp
Q2_K_S 204.0 141.0 209.8 1.029 1.488
Q3_K_S 195.4 108.5 208.1 1.065 1.917
Q4_K_S 188.9 131.3 203.2 1.075 1.548
Q5_K_S 173.5 99.4 193.9 1.117 1.951
Q6_K 196.2 95.9 204.8 1.044 2.136
IQ4_XS 186.9 105.6 202.4 1.083 1.917

I noticed that my AVX2 implemetation of Q8_K quantization
(needed by k- and i-quants) has been lost. jart has counteracted
this by parallelizing quantization, but only in
ggml_compute_forward_mul_mat. Adding the exact same technique to
ggml_compute_forward_mul_mat_id results in a 5-6% performance
improvement for Mixtral8x7B. This is on top of the improvement due
to the better matrix multiplication implementation.
Copy link
Collaborator

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some measurements. The need for speed change that makes the MOE initialization phase is very impactful. Sometimes by a 2x factor for Q quants on MOE. It looks like prompt processing in general is sped up too, around 10%. Although my tinyllama measurements on an overclocked threadripper might have more than a 10% margin of error due to temperature issues I'm working on resolving with our new benchmark tool.

Awesome change!

cpu_info model_filename test t/s before t/s after t/s speedup
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.BF16.gguf pp512 473.18 476.30 1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.BF16.gguf tg16 11.48 11.44 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.F16.gguf pp512 329.84 324.31 0.98x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.F16.gguf tg16 10.53 10.54 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q8_0.gguf pp512 286.68 293.53 1.02x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q8_0.gguf tg16 16.21 16.20 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q6_K.gguf pp512 265.06 419.84 1.58x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q6_K.gguf tg16 23.54 23.55 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf pp512 238.58 416.88 1.75x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf tg16 25.36 25.82 1.02x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf pp512 244.96 438.29 1.79x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf tg16 28.11 28.39 1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q4_0.gguf pp512 282.37 274.77 0.97x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q4_0.gguf tg16 19.90 19.92 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q3_K_M.gguf pp512 248.50 421.22 1.70x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q3_K_M.gguf tg16 31.78 31.58 0.99x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q3_K_S.gguf pp512 251.07 420.90 1.68x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q3_K_S.gguf tg16 32.98 32.92 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q2_K.gguf pp512 254.88 442.63 1.74x
AMD Ryzen Threadripper PRO 7995WX 96-Cores mixtral-8x7b-instruct-v0.1.Q2_K.gguf tg16 36.12 36.30 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.F32.gguf pp512 1698.69 2069.30 1.22x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.F32.gguf tg16 58.50 58.92 1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.BF16.gguf pp512 2641.60 2649.28 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.BF16.gguf tg16 81.59 80.77 0.99x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.F16.gguf pp512 2189.05 2197.90 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.F16.gguf tg16 83.46 83.13 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q8_0.gguf pp512 2129.20 2168.69 1.02x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q8_0.gguf tg16 104.66 103.43 0.99x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q6_K.gguf pp512 2672.45 2794.55 1.05x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q6_K.gguf tg16 136.47 138.28 1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q5_1.gguf pp512 2348.72 2355.37 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q5_1.gguf tg16 147.15 143.40 0.97x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q5_K_M.gguf pp512 2557.59 2732.35 1.07x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q5_K_M.gguf tg16 148.82 148.44 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q5_0.gguf pp512 2304.01 2383.25 1.03x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q5_0.gguf tg16 152.97 151.87 0.99x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q5_K_S.gguf pp512 2496.16 2772.70 1.11x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q5_K_S.gguf tg16 148.52 148.18 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q4_1.gguf pp512 2476.31 2408.42 0.97x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q4_1.gguf tg16 153.88 154.73 1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf pp512 2598.21 2794.64 1.08x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf tg16 156.07 156.29 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q4_K_S.gguf pp512 2622.38 2841.10 1.08x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q4_K_S.gguf tg16 158.89 159.41 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q4_0.gguf pp512 2440.09 2449.48 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q4_0.gguf tg16 160.57 162.58 1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q3_K_L.gguf pp512 2555.94 2804.29 1.10x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q3_K_L.gguf tg16 159.89 161.74 1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q3_K_M.gguf pp512 2595.54 2768.61 1.07x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q3_K_M.gguf tg16 165.25 166.58 1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q3_K_S.gguf pp512 2581.57 2579.76 1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q3_K_S.gguf tg16 169.06 170.54 1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q2_K.gguf pp512 2616.44 2814.73 1.08x
AMD Ryzen Threadripper PRO 7995WX 96-Cores TinyLlama-1.1B-Chat-v1.0.Q2_K.gguf tg16 176.68 175.68 0.99x

@jart
Copy link
Collaborator

jart commented May 23, 2024

One thing that would help illuminate benchmarks re: memory latency questions, would be https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html For example, I get these measurements with my current 512gb v-color ram setup.

jart@luna:~/llamafile$ doas mlc
Intel(R) Memory Latency Checker - v3.11
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0          85.8

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      336068.2
3:1 Reads-Writes :      175669.5
2:1 Reads-Writes :      133351.9
1:1 Reads-Writes :      132625.7
Stream-triad like:      137481.7

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0
       0        141583.5

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  877.91   141653.7
 00002  877.48   141622.3
 00008  1017.20  140993.1
 00015  1233.81  140606.0
 00050  1177.21  141207.9
 00100  1112.67  141484.8
 00200  773.23   141676.5

@jart
Copy link
Collaborator

jart commented May 23, 2024

I noticed that my AVX2 implemetation of Q8_K quantization

Could you go into more detail on this? Did I accidentally remove that in the last sync? The last sync was 30k lines (a lot of upstream development for 2 weeks!) so sometimes things get lost in translation. The [jart] comment markers help me avoid doing that by mistake.

@jart jart merged commit 7cb15c6 into Mozilla-Ocho:main May 23, 2024
@ikawrakow
Copy link
Contributor Author

I noticed that my AVX2 implemetation of Q8_K quantization

Could you go into more detail on this? Did I accidentally remove that in the last sync? The last sync was 30k lines (a lot of upstream development for 2 weeks!) so sometimes things get lost in translation. The [jart] comment markers help me avoid doing that by mistake.

I had added AVX2 implementation for quantizing to Q8_K in the initial PR, see quantize_row_q8_K in https://github.com/Mozilla-Ocho/llamafile/pull/394/files. I did it that way because I didn't want to fool around with Georgi's single-threaded GGML_TASK_TYPE_INIT. But I actually like what you have done better. Once GGML_TASK_TYPE_INIT is multi-threaded, there is no performance benefit from vectorizing the quantization to Q8_K (I measured with and without Q8_K AVX2 implementation and it made no measurable difference on my computer).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants