Another performance optimization for Zen4 + refactoring #435

ikawrakow · 2024-05-22T12:50:51Z

This PR adds the following changes

Improved k-quants prompt processing performance for Zen4 (AVX512F, AVX512VNNI, AVX512VL, AVX512BW and AVX512DQ are available). Improvements are in the 15-30% range on my Ryzen-7950X CPU (see Table 1 and Table 2 below)
Much nicer implementation - compared to the previous version, code size has increased by just ~150 LOC despite having two completely separate implementations for Zen4 and vanilla AVX2

Table 1 PP-512 performance for a LLaMA-7B model on a Ryzen-7950X CPU

Quantization	t/s (main)	t/s (PR)	Speedup
Q2_K_S	152.8	177.8	1.164
Q3_K_S	165.7	194.7	1.175
Q4_K_S	160.0	200.0	1.250
Q5_K_S	147.1	192.5	1.308
Q6_K	168.4	195.4	1.160
IQ4_XS	150.6	193.2	1.283

Table 2 PP-512 performance for Mixtral-8x7B on a Ryzen-7950X CPU

Quantization	t/s (main)	t/s (PR)	Speedup
Q2_K_S	84.5	102.4	1.212
Q3_K_S	81.6	95.5	1.170
Q4_K_S	77.3	97.0	1.254
Q5_K_S	70.0	92.8	1.325
Q6_K	81.3	93.9	1.155
IQ4_XS	74.1	93.8	1.265

If the cost associated with unpacking the quantized values for subsequent multiply-add operations with the activations is fully amortized, we would expect to have performance independent of the quantization type. Hence, I'm now pleased to observe that this is nearly the case except for Q2_K. I'm not sure why Q2_K performance is lower for the 7B model (my guess is that the compiler fails to achieve the best ordering of memory loads into SIMD registers and SIMD operations - Q2_K is the only quant that requires a single memory load for a block of 256 weights, all others need 2 or 3), but the fact that Q2_K performs better than the others for Mixtral8x7B may indicate that memory throughput may be playing a role even for prompt processing of long prompts.

I also did a comparison with current mainline llama.cpp (commit hash 95fb0aef) to see the combined effect of all optimizations. The following table shows the results for LLaMA-v2-7B and Mixtral8x7B on my Ryzen-7950X CPU

Model	Quantization	t/s (llama.cpp)	t/s (PR)	Speedup
LLaMA-v2-7B	Q2_K_S	103.8	177.8	1.713
LLaMA-v-7B	Q3_K_S	80.1	194.7	2.430
LLaMA-v2-7B	Q4_K_S	102.4	200.0	1.953
LLaMA-v2-7B	Q5_K_S	72.8	192.5	2.643
LLaMA-v2-7B	Q6_K	79.9	195.4	2.446
LLaMA-v2-7B	IQ4_XS	72.2	193.2	2.675
Mixtral8x7B	Q2_K_S	61.4	102.4	1.668
Mixtral-8x7B	Q3_K_S	42.6	95.5	2.240
Mixtral-8x7B	Q4_K_S	53.2	97.0	1.824
Mixtral-8x7B	Q5_K_S	38.5	92.8	2.407
Mixtral-8x7B	Q6_K	43.0	93.9	2.184
Mixtral-8x7B	IQ4_XS	38.6	93.8	2.432

ikawrakow · 2024-05-22T16:31:20Z

Btw, I'm noticing that this PR results in a small performance benefit for plain AVX2 as well. On a Ryzen-5975WX I measure for a 7B LLaMA model

Quantization	t/s (llamafile main)	t/s (llama.cpp master)	t/s (this PR)	Speedup llamafile	speedup llama.cpp
Q2_K_S	204.0	141.0	209.8	1.029	1.488
Q3_K_S	195.4	108.5	208.1	1.065	1.917
Q4_K_S	188.9	131.3	203.2	1.075	1.548
Q5_K_S	173.5	99.4	193.9	1.117	1.951
Q6_K	196.2	95.9	204.8	1.044	2.136
IQ4_XS	186.9	105.6	202.4	1.083	1.917

I noticed that my AVX2 implemetation of Q8_K quantization (needed by k- and i-quants) has been lost. jart has counteracted this by parallelizing quantization, but only in ggml_compute_forward_mul_mat. Adding the exact same technique to ggml_compute_forward_mul_mat_id results in a 5-6% performance improvement for Mixtral8x7B. This is on top of the improvement due to the better matrix multiplication implementation.

jart

Here are some measurements. The need for speed change that makes the MOE initialization phase is very impactful. Sometimes by a 2x factor for Q quants on MOE. It looks like prompt processing in general is sped up too, around 10%. Although my tinyllama measurements on an overclocked threadripper might have more than a 10% margin of error due to temperature issues I'm working on resolving with our new benchmark tool.

Awesome change!

cpu_info	model_filename	test	t/s before	t/s after	t/s speedup
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.BF16.gguf	pp512	473.18	476.30	1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.BF16.gguf	tg16	11.48	11.44	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.F16.gguf	pp512	329.84	324.31	0.98x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.F16.gguf	tg16	10.53	10.54	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q8_0.gguf	pp512	286.68	293.53	1.02x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q8_0.gguf	tg16	16.21	16.20	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q6_K.gguf	pp512	265.06	419.84	1.58x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q6_K.gguf	tg16	23.54	23.55	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf	pp512	238.58	416.88	1.75x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf	tg16	25.36	25.82	1.02x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf	pp512	244.96	438.29	1.79x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf	tg16	28.11	28.39	1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q4_0.gguf	pp512	282.37	274.77	0.97x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q4_0.gguf	tg16	19.90	19.92	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q3_K_M.gguf	pp512	248.50	421.22	1.70x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q3_K_M.gguf	tg16	31.78	31.58	0.99x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q3_K_S.gguf	pp512	251.07	420.90	1.68x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q3_K_S.gguf	tg16	32.98	32.92	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q2_K.gguf	pp512	254.88	442.63	1.74x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	mixtral-8x7b-instruct-v0.1.Q2_K.gguf	tg16	36.12	36.30	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.F32.gguf	pp512	1698.69	2069.30	1.22x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.F32.gguf	tg16	58.50	58.92	1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.BF16.gguf	pp512	2641.60	2649.28	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.BF16.gguf	tg16	81.59	80.77	0.99x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.F16.gguf	pp512	2189.05	2197.90	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.F16.gguf	tg16	83.46	83.13	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q8_0.gguf	pp512	2129.20	2168.69	1.02x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q8_0.gguf	tg16	104.66	103.43	0.99x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q6_K.gguf	pp512	2672.45	2794.55	1.05x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q6_K.gguf	tg16	136.47	138.28	1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q5_1.gguf	pp512	2348.72	2355.37	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q5_1.gguf	tg16	147.15	143.40	0.97x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q5_K_M.gguf	pp512	2557.59	2732.35	1.07x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q5_K_M.gguf	tg16	148.82	148.44	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q5_0.gguf	pp512	2304.01	2383.25	1.03x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q5_0.gguf	tg16	152.97	151.87	0.99x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q5_K_S.gguf	pp512	2496.16	2772.70	1.11x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q5_K_S.gguf	tg16	148.52	148.18	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q4_1.gguf	pp512	2476.31	2408.42	0.97x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q4_1.gguf	tg16	153.88	154.73	1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf	pp512	2598.21	2794.64	1.08x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf	tg16	156.07	156.29	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q4_K_S.gguf	pp512	2622.38	2841.10	1.08x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q4_K_S.gguf	tg16	158.89	159.41	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q4_0.gguf	pp512	2440.09	2449.48	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q4_0.gguf	tg16	160.57	162.58	1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q3_K_L.gguf	pp512	2555.94	2804.29	1.10x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q3_K_L.gguf	tg16	159.89	161.74	1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q3_K_M.gguf	pp512	2595.54	2768.61	1.07x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q3_K_M.gguf	tg16	165.25	166.58	1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q3_K_S.gguf	pp512	2581.57	2579.76	1.00x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q3_K_S.gguf	tg16	169.06	170.54	1.01x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q2_K.gguf	pp512	2616.44	2814.73	1.08x
AMD Ryzen Threadripper PRO 7995WX 96-Cores	TinyLlama-1.1B-Chat-v1.0.Q2_K.gguf	tg16	176.68	175.68	0.99x

jart · 2024-05-23T13:00:33Z

One thing that would help illuminate benchmarks re: memory latency questions, would be https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html For example, I get these measurements with my current 512gb v-color ram setup.

jart@luna:~/llamafile$ doas mlc
Intel(R) Memory Latency Checker - v3.11
Measuring idle latencies for random access (in ns)...
                Numa node
Numa node            0
       0          85.8

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      336068.2
3:1 Reads-Writes :      175669.5
2:1 Reads-Writes :      133351.9
1:1 Reads-Writes :      132625.7
Stream-triad like:      137481.7

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0
       0        141583.5

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  877.91   141653.7
 00002  877.48   141622.3
 00008  1017.20  140993.1
 00015  1233.81  140606.0
 00050  1177.21  141207.9
 00100  1112.67  141484.8
 00200  773.23   141676.5

jart · 2024-05-23T13:04:48Z

I noticed that my AVX2 implemetation of Q8_K quantization

Could you go into more detail on this? Did I accidentally remove that in the last sync? The last sync was 30k lines (a lot of upstream development for 2 weeks!) so sometimes things get lost in translation. The [jart] comment markers help me avoid doing that by mistake.

ikawrakow · 2024-05-23T13:43:54Z

I noticed that my AVX2 implemetation of Q8_K quantization

Could you go into more detail on this? Did I accidentally remove that in the last sync? The last sync was 30k lines (a lot of upstream development for 2 weeks!) so sometimes things get lost in translation. The [jart] comment markers help me avoid doing that by mistake.

I had added AVX2 implementation for quantizing to Q8_K in the initial PR, see quantize_row_q8_K in https://github.com/Mozilla-Ocho/llamafile/pull/394/files. I did it that way because I didn't want to fool around with Georgi's single-threaded GGML_TASK_TYPE_INIT. But I actually like what you have done better. Once GGML_TASK_TYPE_INIT is multi-threaded, there is no performance benefit from vectorizing the quantization to Q8_K (I measured with and without Q8_K AVX2 implementation and it made no measurable difference on my computer).

Kawrakow added 2 commits May 22, 2024 11:27

Matrix multiplications optimized for Zen4 + simplify

5fece3a

A minor optimization

99624dc

ikawrakow mentioned this pull request May 22, 2024

Faster AVX2 matrix multiplications for MoE models #428

Merged

mofosyne added the performance label May 22, 2024

jart approved these changes May 23, 2024

View reviewed changes

jart merged commit 7cb15c6 into Mozilla-Ocho:main May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another performance optimization for Zen4 + refactoring #435

Another performance optimization for Zen4 + refactoring #435

ikawrakow commented May 22, 2024

ikawrakow commented May 22, 2024

jart left a comment

jart commented May 23, 2024

jart commented May 23, 2024

ikawrakow commented May 23, 2024

Another performance optimization for Zen4 + refactoring #435

Another performance optimization for Zen4 + refactoring #435

Conversation

ikawrakow commented May 22, 2024

ikawrakow commented May 22, 2024

jart left a comment

Choose a reason for hiding this comment

jart commented May 23, 2024

jart commented May 23, 2024

ikawrakow commented May 23, 2024