-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Another performance optimization for Zen4 + refactoring #435
Conversation
Btw, I'm noticing that this PR results in a small performance benefit for plain AVX2 as well. On a Ryzen-5975WX I measure for a 7B LLaMA model
|
I noticed that my AVX2 implemetation of Q8_K quantization (needed by k- and i-quants) has been lost. jart has counteracted this by parallelizing quantization, but only in ggml_compute_forward_mul_mat. Adding the exact same technique to ggml_compute_forward_mul_mat_id results in a 5-6% performance improvement for Mixtral8x7B. This is on top of the improvement due to the better matrix multiplication implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some measurements. The need for speed change that makes the MOE initialization phase is very impactful. Sometimes by a 2x factor for Q quants on MOE. It looks like prompt processing in general is sped up too, around 10%. Although my tinyllama measurements on an overclocked threadripper might have more than a 10% margin of error due to temperature issues I'm working on resolving with our new benchmark tool.
Awesome change!
cpu_info | model_filename | test | t/s before | t/s after | t/s speedup |
---|---|---|---|---|---|
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.BF16.gguf | pp512 | 473.18 | 476.30 | 1.01x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.BF16.gguf | tg16 | 11.48 | 11.44 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.F16.gguf | pp512 | 329.84 | 324.31 | 0.98x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.F16.gguf | tg16 | 10.53 | 10.54 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q8_0.gguf | pp512 | 286.68 | 293.53 | 1.02x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q8_0.gguf | tg16 | 16.21 | 16.20 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q6_K.gguf | pp512 | 265.06 | 419.84 | 1.58x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q6_K.gguf | tg16 | 23.54 | 23.55 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf | pp512 | 238.58 | 416.88 | 1.75x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf | tg16 | 25.36 | 25.82 | 1.02x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf | pp512 | 244.96 | 438.29 | 1.79x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf | tg16 | 28.11 | 28.39 | 1.01x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q4_0.gguf | pp512 | 282.37 | 274.77 | 0.97x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q4_0.gguf | tg16 | 19.90 | 19.92 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q3_K_M.gguf | pp512 | 248.50 | 421.22 | 1.70x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q3_K_M.gguf | tg16 | 31.78 | 31.58 | 0.99x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q3_K_S.gguf | pp512 | 251.07 | 420.90 | 1.68x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q3_K_S.gguf | tg16 | 32.98 | 32.92 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q2_K.gguf | pp512 | 254.88 | 442.63 | 1.74x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | mixtral-8x7b-instruct-v0.1.Q2_K.gguf | tg16 | 36.12 | 36.30 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.F32.gguf | pp512 | 1698.69 | 2069.30 | 1.22x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.F32.gguf | tg16 | 58.50 | 58.92 | 1.01x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.BF16.gguf | pp512 | 2641.60 | 2649.28 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.BF16.gguf | tg16 | 81.59 | 80.77 | 0.99x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.F16.gguf | pp512 | 2189.05 | 2197.90 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.F16.gguf | tg16 | 83.46 | 83.13 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q8_0.gguf | pp512 | 2129.20 | 2168.69 | 1.02x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q8_0.gguf | tg16 | 104.66 | 103.43 | 0.99x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q6_K.gguf | pp512 | 2672.45 | 2794.55 | 1.05x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q6_K.gguf | tg16 | 136.47 | 138.28 | 1.01x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q5_1.gguf | pp512 | 2348.72 | 2355.37 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q5_1.gguf | tg16 | 147.15 | 143.40 | 0.97x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q5_K_M.gguf | pp512 | 2557.59 | 2732.35 | 1.07x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q5_K_M.gguf | tg16 | 148.82 | 148.44 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q5_0.gguf | pp512 | 2304.01 | 2383.25 | 1.03x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q5_0.gguf | tg16 | 152.97 | 151.87 | 0.99x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q5_K_S.gguf | pp512 | 2496.16 | 2772.70 | 1.11x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q5_K_S.gguf | tg16 | 148.52 | 148.18 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q4_1.gguf | pp512 | 2476.31 | 2408.42 | 0.97x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q4_1.gguf | tg16 | 153.88 | 154.73 | 1.01x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf | pp512 | 2598.21 | 2794.64 | 1.08x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf | tg16 | 156.07 | 156.29 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q4_K_S.gguf | pp512 | 2622.38 | 2841.10 | 1.08x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q4_K_S.gguf | tg16 | 158.89 | 159.41 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q4_0.gguf | pp512 | 2440.09 | 2449.48 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q4_0.gguf | tg16 | 160.57 | 162.58 | 1.01x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q3_K_L.gguf | pp512 | 2555.94 | 2804.29 | 1.10x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q3_K_L.gguf | tg16 | 159.89 | 161.74 | 1.01x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q3_K_M.gguf | pp512 | 2595.54 | 2768.61 | 1.07x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q3_K_M.gguf | tg16 | 165.25 | 166.58 | 1.01x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q3_K_S.gguf | pp512 | 2581.57 | 2579.76 | 1.00x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q3_K_S.gguf | tg16 | 169.06 | 170.54 | 1.01x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q2_K.gguf | pp512 | 2616.44 | 2814.73 | 1.08x |
AMD Ryzen Threadripper PRO 7995WX 96-Cores | TinyLlama-1.1B-Chat-v1.0.Q2_K.gguf | tg16 | 176.68 | 175.68 | 0.99x |
One thing that would help illuminate benchmarks re: memory latency questions, would be https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html For example, I get these measurements with my current 512gb v-color ram setup.
|
Could you go into more detail on this? Did I accidentally remove that in the last sync? The last sync was 30k lines (a lot of upstream development for 2 weeks!) so sometimes things get lost in translation. The |
I had added |
This PR adds the following changes
AVX512F, AVX512VNNI, AVX512VL, AVX512BW
andAVX512DQ
are available). Improvements are in the 15-30% range on my Ryzen-7950X CPU (see Table 1 and Table 2 below)Table 1 PP-512 performance for a LLaMA-7B model on a Ryzen-7950X CPU
Table 2 PP-512 performance for Mixtral-8x7B on a Ryzen-7950X CPU
If the cost associated with unpacking the quantized values for subsequent multiply-add operations with the activations is fully amortized, we would expect to have performance independent of the quantization type. Hence, I'm now pleased to observe that this is nearly the case except for
Q2_K
. I'm not sure whyQ2_K
performance is lower for the 7B model (my guess is that the compiler fails to achieve the best ordering of memory loads into SIMD registers and SIMD operations -Q2_K
is the only quant that requires a single memory load for a block of 256 weights, all others need 2 or 3), but the fact thatQ2_K
performs better than the others for Mixtral8x7B may indicate that memory throughput may be playing a role even for prompt processing of long prompts.I also did a comparison with current mainline
llama.cpp
(commit hash95fb0aef
) to see the combined effect of all optimizations. The following table shows the results for LLaMA-v2-7B and Mixtral8x7B on my Ryzen-7950X CPU