vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints#23056
Conversation
|
I get a nice boost for q3_k on RTX 5090: q6_k is still slower with mmvq, which matches my expectations (mmvq generally helps for small quants because we're not quite bandwidth limited, and hurts for larger quants where we are bandwidth limited and it adds overhead to quantize activations). Would you mind updating the PR to also force on q3_k for NVIDIA? |
c0432e8 to
917fd19
Compare
|
Yes I'll try this for Intel now |
Tested on BMG B580 (Windows)I don't see any model level perf improvement, but I do see improvement on the MUL_MATs in test-backend-ops for both Q3_K and Q6_K cases. Before (Q3_K): llamacpp_test_builds\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q3_K_M.gguf -fa 1
build: 1e5ad35 (9093) llamacpp_test_builds\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q3_K_M.gguf -fa 0
build: 1e5ad35 (9093) After (Q3_K): llamacpp_test_builds\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q3_K_M.gguf -fa 1
build: 917fd19 (9094) llamacpp_test_builds\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q3_K_M.gguf -fa 0
build: 917fd19 (9094) Before (Q6_K): llamacpp_test_builds\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q6_K.gguf -fa 0
build: 1e5ad35 (9093) llamacpp_test_builds\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q6_K.gguf -fa 1
build: 1e5ad35 (9093) After (Q6_K): llamacpp_test_builds\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q6_K.gguf -fa 0
build: 917fd19 (9094) llamacpp_test_builds\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q6_K.gguf -fa 1
build: 917fd19 (9094) |
|
I'll test on Xe3 also now |
Tested on Panther Lake Xe3 (Windows)Similar case - generally no change, minor improvement for Q3_K, but the MUL_MAT shaders in test-backend-ops overall get better. Before (Q3_K): C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_Pre_BlockLoad_Q3_Q6\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m C:\Users\dungeon\Desktop\models_other\Qwen3.5-9B-Q3_K_M.gguf -fa 1,0
build: 1e5ad35 (9093) After (Q3_K) C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_BlockLoad_Q3_Q6\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m C:\Users\dungeon\Desktop\models_other\Qwen3.5-9B-Q3_K_M.gguf -fa 1,0
build: 917fd19 (9094) Before (Q6_K): C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_Pre_BlockLoad_Q3_Q6\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m C:\Users\dungeon\Desktop\models_other\Qwen3.5-9B-Q6_K.gguf -fa 1,0
build: 1e5ad35 (9093) After (Q6_K): C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_BlockLoad_Q3_Q6\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m C:\Users\dungeon\Desktop\models_other\Qwen3.5-9B-Q6_K.gguf -fa 1,0
build: 917fd19 (9094) |
|
This PR is fine by me, especially if it improves Ubuntu perf w/ Mesa that would be great. @rillomas could you please take a look at this PR? |
9e0a417 to
35a592d
Compare
|
Pushed an update for nvidia and also to take MMVQ path for small k on Q2/Q3/Q6 which seems like a win (similar to NVIDIA). diff since last push: $ git diff-tree -U1 917fd1934 35a592de8
diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index ee918349f..9531c4ee8 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -7867,3 +7867,3 @@ static bool ggml_vk_should_use_mmvq(const vk_device& device, uint32_t m, uint32_
case VK_VENDOR_ID_NVIDIA:
- if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_IQ1_S || src0_type == GGML_TYPE_IQ1_M) {
+ if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_IQ1_S || src0_type == GGML_TYPE_IQ1_M) {
return true;
@@ -7900,2 +7900,8 @@ static bool ggml_vk_should_use_mmvq(const vk_device& device, uint32_t m, uint32_
+ if (device->architecture == vk_device_architecture::INTEL_XE2) {
+ if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_Q6_K) {
+ return true;
+ }
+ }
+
if (k < 2048) { |
You may not have MMVQ enabled correctly in your benchmarks due to this block so I think we should check and benchmark again to see if there are any benefits for Xe2+ even on Windows. |
|
@TheBlueMatt @rillomas llama-bench.exe -p 0 -r 3 -n 128 Xe2 B580: qwen35 9B Q3_K - Medium (before) 56.70 ± 0.12 llama 1B Q6_K (before) 235.47 ± 1.41 qwen35 9B Q6_K (before) 47.37 ± 0.03 @rillomas Helped point me to the original issue (#17628) which got MMVQ disabled due to A770, I think if re-enabling MMVQ for Intel Windows it should be only Xe2+. I'll try Xe3 very soon. |
|
@TheBlueMatt @rillomas llama-bench.exe -p 0 -r 3 -n 128 Xe3 PTL: llama 1B Q3_K - Medium (before) 115.28 ± 1.41 qwen35 9B Q3_K - Medium (before) 21.40 ± 0.42 llama 1B Q6_K (before) 87.98 ± 2.01 qwen35 9B Q6_K (before) 14.93 ± 0.08 |
|
Personally I think the gains from the data look good. I'm ok with re-enabling MMVQ for Q3_K and Q6_K for Intel Xe2+ specifically (Win + Ubuntu). I think @rillomas will be back Monday if he has any feedback :) Thanks @TheBlueMatt for your great efforts!! |
Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned, and Q3_K still wins on NVIDIA as well. mesa isn't all that great at coalescing back-to-back loads from alternating arrays, so we force it instead. Further, we can do subtraction directly on a full int32_t rather than an i8vec4 with bit twiddling because the high bit is always free to start. On Intel BMG on mesa, the switch to MMVQ provides an immediate ~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. The futher switch to block loads leads to a ~24% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA override for K quants on Xe2 as well.
35a592d to
642aeaf
Compare
|
Okay, disabled the Q3K block on AMD and switched to accepting Q6K for all Intel/Linux. Also moved to accepting Q2K/Q3K/Q6K on Intel Windows. Thanks for all the benchmarking. @virajwad note that I just assumed here that Q2K is also faster on Intel Windows like it is for mesa, happy to change if not, though Q2K isn't all that popular anyway. |
|
@ggml-org/maintainers Another approval needed. |
|
Does Alchemist, e.g an A770 also profit from this change? |
|
Yes |
* origin/master: (36 commits) vendor : update cpp-httplib to 0.46.1 (ggml-org#23980) llama: limit max outputs of `llama_context` (ggml-org#23861) metal: template GLU kernels to support f16/f32 (ggml-org#23882) vulkan: don't hold the device mutex while compiling pipelines (ggml-org#23641) vulkan: reduce host memory lock contention (ggml-org#23376) vocab: add normalizer.lowercase support to WPM (ggml-org#23899) TP: quantized KV cache support (ggml-org#23792) security : disable private disclosures (ggml-org#23963) model: Add EXAONE 4.5 implementations (ggml-org#21733) vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (ggml-org#23056) vulkan: Removed unused functions (ggml-org#23175) common : support manually triggering the reasoning budget end sequence (ggml-org#23949) ci : add missing Linux label to cpu-x64-high-perf runner (ggml-org#23958) [SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (ggml-org#23812) [SYCL] Add more types in GET_ROWS OP (ggml-org#23710) sycl : Optimize Q3_K mul_mat by reorder (ggml-org#23725) ci: remove redundant or duplicate jobs (ggml-org#23927) server : handle If-None-Match weak ETags (ggml-org#23916) ci : limit trigger paths for the CPU workflow (ggml-org#23938) vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756) ...
|
@LuxKeiwoker It's likely due to the |
|
Hi @0cc4m Is your A770 on Ubuntu / Linux, or Windows? |
Linux.
No, Q3_K and Q6_K will also go through the MMVQ path on Linux now, as long as k > 2048. I don't use Windows, so I can't tune for or test that. There Xe1 is excluded, yes. |
…l-org#23056) Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned, and Q3_K still wins on NVIDIA as well. mesa isn't all that great at coalescing back-to-back loads from alternating arrays, so we force it instead. Further, we can do subtraction directly on a full int32_t rather than an i8vec4 with bit twiddling because the high bit is always free to start. On Intel BMG on mesa, the switch to MMVQ provides an immediate ~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. The futher switch to block loads leads to a ~24% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA override for K quants on Xe2 as well.
I'm on Windows, so that's probably why it doesn't have any effect. Is there a way to enable it under windows? |
|
You can just take out the Xe2 limit, e.g. like this: diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index e7d04634b..bff53dbf3 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -8432,10 +8432,8 @@ static bool ggml_vk_should_use_mmvq(const vk_device& device, uint32_t m, uint32_
return true;
}
case VK_VENDOR_ID_INTEL:
- if (device->architecture == vk_device_architecture::INTEL_XE2) {
- if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_Q6_K) {
- return true;
- }
+ if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_Q6_K) {
+ return true;
}
if (device->driver_id == vk::DriverId::eIntelProprietaryWindows) {That would get Xe1 to run with the same configuration as Xe2 on Windows. |

This is the non-padding part of #22951.
Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned.
mesa isn't all that great at coalescing back-to-back loads from alternating arrays, so we force it instead. Further, we can do subtraction directly on a full int32_t rather than an i8vec4 with bit twiddling because the high bit is always free to start.
Obviously this only impacts Q3_K and Q6_K which aren't all that commonly used AFAICT, but some mixed-quant Q4_K and Q5_K models have some Q6_K tensors in them, so there's still a win for some BMG users here.
On Intel BMG on mesa, the switch to MMVQ provides an immediate ~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.
The futher switch to block loads leads to a ~24% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.
Stole the subtraction trick from #22066.
@virajwad any chance you could test this on windows?