vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints by TheBlueMatt · Pull Request #23056 · ggml-org/llama.cpp

TheBlueMatt · 2026-05-14T14:26:33Z

This is the non-padding part of #22951.

Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned.

mesa isn't all that great at coalescing back-to-back loads from alternating arrays, so we force it instead. Further, we can do subtraction directly on a full int32_t rather than an i8vec4 with bit twiddling because the high bit is always free to start.

Obviously this only impacts Q3_K and Q6_K which aren't all that commonly used AFAICT, but some mixed-quant Q4_K and Q5_K models have some Q6_K tensors in them, so there's still a win for some BMG users here.

On Intel BMG on mesa, the switch to MMVQ provides an immediate ~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

The futher switch to block loads leads to a ~24% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

Stole the subtraction trick from #22066.

@virajwad any chance you could test this on windows?

I have read and agree with the contributing guidelines
AI usage disclosure: NO

jeffbolznv · 2026-05-14T15:16:43Z

I get a nice boost for q3_k on RTX 5090:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 20 --prio 1 -m c:\models\Qwen3.5-9B-Q3_K_S.gguf

master:
default
| qwen35 9B Q3_K - Small         |   4.01 GiB |     8.95 B | Vulkan     |  99 |  1 |           tg128 |        201.31 ± 5.32 |

q3_k allowed
| qwen35 9B Q3_K - Small         |   4.01 GiB |     8.95 B | Vulkan     |  99 |  1 |           tg128 |        201.81 ± 3.79 |

q3_k forced
| qwen35 9B Q3_K - Small         |   4.01 GiB |     8.95 B | Vulkan     |  99 |  1 |           tg128 |        200.83 ± 5.80 |

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                50268 runs -    20.19 us/run - 117.44 MFLOP/run -   5.82 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                40470 runs -    24.80 us/run - 234.88 MFLOP/run -   9.47 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                34932 runs -    28.68 us/run - 352.32 MFLOP/run -  12.29 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                30033 runs -    33.31 us/run - 469.76 MFLOP/run -  14.10 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                25137 runs -    39.85 us/run - 587.20 MFLOP/run -  14.74 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                15087 runs -    66.55 us/run - 939.52 MFLOP/run -  14.12 TFLOPS
  
pr:
default
| qwen35 9B Q3_K - Small         |   4.01 GiB |     8.95 B | Vulkan     |  99 |  1 |           tg128 |        200.45 ± 6.09 |

q3_k allowed
| qwen35 9B Q3_K - Small         |   4.01 GiB |     8.95 B | Vulkan     |  99 |  1 |           tg128 |        205.01 ± 7.03 |

q3_k forced
| qwen35 9B Q3_K - Small         |   4.01 GiB |     8.95 B | Vulkan     |  99 |  1 |           tg128 |        210.46 ± 6.44 |

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                57936 runs -    17.27 us/run - 117.44 MFLOP/run -   6.80 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                45582 runs -    22.13 us/run - 234.88 MFLOP/run -  10.61 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                36636 runs -    27.33 us/run - 352.32 MFLOP/run -  12.89 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                30672 runs -    32.70 us/run - 469.76 MFLOP/run -  14.36 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                22572 runs -    44.47 us/run - 587.20 MFLOP/run -  13.20 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                16157 runs -    62.16 us/run - 939.52 MFLOP/run -  15.12 TFLOPS

q6_k is still slower with mmvq, which matches my expectations (mmvq generally helps for small quants because we're not quite bandwidth limited, and hurts for larger quants where we are bandwidth limited and it adds overhead to quantize activations).

Would you mind updating the PR to also force on q3_k for NVIDIA?

virajwad · 2026-05-14T17:29:56Z

Yes I'll try this for Intel now

virajwad · 2026-05-14T18:23:56Z

Tested on BMG B580 (Windows)

I don't see any model level perf improvement, but I do see improvement on the MUL_MATs in test-backend-ops for both Q3_K and Q6_K cases.

Before (Q3_K):

llamacpp_test_builds\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q3_K_M.gguf -fa 1

model	size	params	backend	ngl	fa	test	t/s
qwen35 9B Q3_K - Medium	4.34 GiB	8.95 B	Vulkan	99	1	tg128	55.94 ± 0.15

build: 1e5ad35 (9093)

llamacpp_test_builds\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q3_K_M.gguf -fa 0

model	size	params	backend	ngl	test	t/s
qwen35 9B Q3_K - Medium	4.34 GiB	8.95 B	Vulkan	99	tg128	56.20 ± 0.27

build: 1e5ad35 (9093)

MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 6816 runs -   159.66 us/run - 117.44 MFLOP/run - 735.58 GFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 5964 runs -   170.97 us/run - 234.88 MFLOP/run -   1.37 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4828 runs -   207.68 us/run - 352.32 MFLOP/run -   1.70 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4473 runs -   229.22 us/run - 469.76 MFLOP/run -   2.05 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 3591 runs -   291.77 us/run - 587.20 MFLOP/run -   2.01 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2033 runs -   501.40 us/run - 939.52 MFLOP/run -   1.87 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                386 runs -  2599.23 us/run -  60.13 GFLOP/run -  23.13 TFLOPS
Backend Vulkan0: OK

After (Q3_K):

llamacpp_test_builds\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q3_K_M.gguf -fa 1

model	size	params	backend	ngl	fa	test	t/s
qwen35 9B Q3_K - Medium	4.34 GiB	8.95 B	Vulkan	99	1	tg128	55.96 ± 0.11

build: 917fd19 (9094)

llamacpp_test_builds\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q3_K_M.gguf -fa 0

model	size	params	backend	ngl	test	t/s
qwen35 9B Q3_K - Medium	4.34 GiB	8.95 B	Vulkan	99	tg128	56.14 ± 0.16

build: 917fd19 (9094)

MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 6816 runs -   164.56 us/run - 117.44 MFLOP/run - 713.68 GFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 8520 runs -   117.76 us/run - 234.88 MFLOP/run -   1.99 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 7952 runs -   127.82 us/run - 352.32 MFLOP/run -   2.76 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 7242 runs -   140.58 us/run - 469.76 MFLOP/run -   3.34 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 6669 runs -   152.93 us/run - 587.20 MFLOP/run -   3.84 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 5350 runs -   188.67 us/run - 939.52 MFLOP/run -   4.98 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                388 runs -  2589.46 us/run -  60.13 GFLOP/run -  23.22 TFLOPS
Backend Vulkan0: OK

Before (Q6_K):

llamacpp_test_builds\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q6_K.gguf -fa 0

model	size	params	backend	ngl	test	t/s
qwen35 9B Q6_K	6.94 GiB	8.95 B	Vulkan	99	tg128	47.37 ± 0.06

build: 1e5ad35 (9093)

llamacpp_test_builds\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q6_K.gguf -fa 1

model	size	params	backend	ngl	fa	test	t/s
qwen35 9B Q6_K	6.94 GiB	8.95 B	Vulkan	99	1	tg128	46.99 ± 0.04

build: 1e5ad35 (9093)

MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 6816 runs -   154.94 us/run - 117.44 MFLOP/run - 757.98 GFLOPS
 MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 6390 runs -   161.90 us/run - 234.88 MFLOP/run -   1.45 TFLOPS
 MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 5396 runs -   186.37 us/run - 352.32 MFLOP/run -   1.89 TFLOPS
 MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4260 runs -   242.81 us/run - 469.76 MFLOP/run -   1.93 TFLOPS
 MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 3591 runs -   285.34 us/run - 587.20 MFLOP/run -   2.06 TFLOPS
 MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2461 runs -   422.37 us/run - 939.52 MFLOP/run -   2.22 TFLOPS
 MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                402 runs -  2490.52 us/run -  60.13 GFLOP/run -  24.14 TFLOPS
 Backend Vulkan0: OK

After (Q6_K):

llamacpp_test_builds\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q6_K.gguf -fa 0

model	size	params	backend	ngl	test	t/s
qwen35 9B Q6_K	6.94 GiB	8.95 B	Vulkan	99	tg128	47.39 ± 0.05

build: 917fd19 (9094)

llamacpp_test_builds\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q6_K.gguf -fa 1

model	size	params	backend	ngl	fa	test	t/s
qwen35 9B Q6_K	6.94 GiB	8.95 B	Vulkan	99	1	tg128	47.00 ± 0.06

build: 917fd19 (9094)

MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 6816 runs -   155.60 us/run - 117.44 MFLOP/run - 754.77 GFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 5964 runs -   177.74 us/run - 234.88 MFLOP/run -   1.32 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 5680 runs -   178.69 us/run - 352.32 MFLOP/run -   1.97 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 5112 runs -   197.85 us/run - 469.76 MFLOP/run -   2.37 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4617 runs -   216.94 us/run - 587.20 MFLOP/run -   2.71 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 3959 runs -   254.18 us/run - 939.52 MFLOP/run -   3.70 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                400 runs -  2505.26 us/run -  60.13 GFLOP/run -  24.00 TFLOPS
Backend Vulkan0: OK

virajwad · 2026-05-14T18:26:03Z

I'll test on Xe3 also now

virajwad · 2026-05-14T19:08:29Z

Tested on Panther Lake Xe3 (Windows)

Similar case - generally no change, minor improvement for Q3_K, but the MUL_MAT shaders in test-backend-ops overall get better.

Before (Q3_K):

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_Pre_BlockLoad_Q3_Q6\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m C:\Users\dungeon\Desktop\models_other\Qwen3.5-9B-Q3_K_M.gguf -fa 1,0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	test	t/s
qwen35 9B Q3_K - Medium	4.34 GiB	8.95 B	Vulkan	99	1	tg128	19.86 ± 1.02
qwen35 9B Q3_K - Medium	4.34 GiB	8.95 B	Vulkan	99	0	tg128	19.35 ± 0.16

build: 1e5ad35 (9093)

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_Pre_BlockLoad_Q3_Q6\TheBlueMatt_Pre_BlockLoad_Q3_Q6>test-backend-ops.exe perf -o MUL_MAT -p "type_a=q3_K"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) B390 GPU
  Device memory: 37099 MB (36330 MB free)

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4260 runs -   250.84 us/run - 117.44 MFLOP/run - 468.18 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 3834 runs -   292.31 us/run - 234.88 MFLOP/run - 803.53 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 3124 runs -   333.51 us/run - 352.32 MFLOP/run -   1.06 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2556 runs -   399.76 us/run - 469.76 MFLOP/run -   1.18 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2223 runs -   465.99 us/run - 587.20 MFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  963 runs -  1160.71 us/run - 939.52 MFLOP/run - 809.44 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                202 runs -  4965.41 us/run -  60.13 GFLOP/run -  12.11 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

After (Q3_K)

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_BlockLoad_Q3_Q6\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m C:\Users\dungeon\Desktop\models_other\Qwen3.5-9B-Q3_K_M.gguf -fa 1,0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	test	t/s
qwen35 9B Q3_K - Medium	4.34 GiB	8.95 B	Vulkan	99	1	tg128	20.56 ± 1.27
qwen35 9B Q3_K - Medium	4.34 GiB	8.95 B	Vulkan	99	0	tg128	19.42 ± 0.07

build: 917fd19 (9094)

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_BlockLoad_Q3_Q6\TheBlueMatt_BlockLoad_Q3_Q6>test-backend-ops.exe perf -o MUL_MAT -p "type_a=q3_K"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) B390 GPU
  Device memory: 37099 MB (36330 MB free)

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4260 runs -   252.54 us/run - 117.44 MFLOP/run - 465.04 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4686 runs -   224.95 us/run - 234.88 MFLOP/run -   1.04 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4260 runs -   246.38 us/run - 352.32 MFLOP/run -   1.43 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 3834 runs -   263.70 us/run - 469.76 MFLOP/run -   1.78 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 3249 runs -   327.90 us/run - 587.20 MFLOP/run -   1.79 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2568 runs -   405.41 us/run - 939.52 MFLOP/run -   2.32 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                196 runs -  5119.98 us/run -  60.13 GFLOP/run -   11.74 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

Before (Q6_K):

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_Pre_BlockLoad_Q3_Q6\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m C:\Users\dungeon\Desktop\models_other\Qwen3.5-9B-Q6_K.gguf -fa 1,0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	test	t/s
qwen35 9B Q6_K	6.94 GiB	8.95 B	Vulkan	99	1	tg128	14.72 ± 0.18
qwen35 9B Q6_K	6.94 GiB	8.95 B	Vulkan	99	0	tg128	14.65 ± 0.01

build: 1e5ad35 (9093)

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_Pre_BlockLoad_Q3_Q6\TheBlueMatt_Pre_BlockLoad_Q3_Q6>test-backend-ops.exe perf -o MUL_MAT -p "type_a=q6_K"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) B390 GPU
  Device memory: 37099 MB (36330 MB free)

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2556 runs -   449.64 us/run - 117.44 MFLOP/run - 261.19 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2556 runs -   450.78 us/run - 234.88 MFLOP/run - 521.06 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2272 runs -   448.51 us/run - 352.32 MFLOP/run - 785.54 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2130 runs -   507.78 us/run - 469.76 MFLOP/run - 925.13 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 1881 runs -   570.52 us/run - 587.20 MFLOP/run -   1.03 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 1284 runs -   811.44 us/run - 939.52 MFLOP/run -   1.16 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                170 runs -  5963.36 us/run -  60.13 GFLOP/run -  10.08 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

After (Q6_K):

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_BlockLoad_Q3_Q6\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m C:\Users\dungeon\Desktop\models_other\Qwen3.5-9B-Q6_K.gguf -fa 1,0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	test	t/s
qwen35 9B Q6_K	6.94 GiB	8.95 B	Vulkan	99	1	tg128	14.74 ± 0.11
qwen35 9B Q6_K	6.94 GiB	8.95 B	Vulkan	99	0	tg128	14.67 ± 0.01

build: 917fd19 (9094)

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_BlockLoad_Q3_Q6\TheBlueMatt_BlockLoad_Q3_Q6>test-backend-ops.exe perf -o MUL_MAT -p "type_a=q6_K"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) B390 GPU
  Device memory: 37099 MB (36330 MB free)

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2556 runs -   451.21 us/run - 117.44 MFLOP/run - 260.28 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2556 runs -   433.68 us/run - 234.88 MFLOP/run - 541.60 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2272 runs -   441.27 us/run - 352.32 MFLOP/run - 798.42 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2343 runs -   448.37 us/run - 469.76 MFLOP/run -   1.05 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2223 runs -   456.46 us/run - 587.20 MFLOP/run -   1.29 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2247 runs -   467.04 us/run - 939.52 MFLOP/run -   2.01 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                210 runs -  4772.29 us/run -  60.13 GFLOP/run -  12.60 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

virajwad · 2026-05-14T19:10:35Z

This PR is fine by me, especially if it improves Ubuntu perf w/ Mesa that would be great. @rillomas could you please take a look at this PR?

TheBlueMatt · 2026-05-14T19:35:03Z

Pushed an update for nvidia and also to take MMVQ path for small k on Q2/Q3/Q6 which seems like a win (similar to NVIDIA).

diff since last push:

$ git diff-tree -U1 917fd1934 35a592de8
diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index ee918349f..9531c4ee8 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -7867,3 +7867,3 @@ static bool ggml_vk_should_use_mmvq(const vk_device& device, uint32_t m, uint32_
     case VK_VENDOR_ID_NVIDIA:
-        if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_IQ1_S || src0_type == GGML_TYPE_IQ1_M) {
+        if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_IQ1_S || src0_type == GGML_TYPE_IQ1_M) {
             return true;
@@ -7900,2 +7900,8 @@ static bool ggml_vk_should_use_mmvq(const vk_device& device, uint32_t m, uint32_

+        if (device->architecture == vk_device_architecture::INTEL_XE2) {
+            if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_Q6_K) {
+                return true;
+            }
+        }
+
         if (k < 2048) {

rillomas · 2026-05-15T01:50:08Z

This PR is fine by me, especially if it improves Ubuntu perf w/ Mesa that would be great. @rillomas could you please take a look at this PR?

You may not have MMVQ enabled correctly in your benchmarks due to this block so I think we should check and benchmark again to see if there are any benefits for Xe2+ even on Windows.

virajwad · 2026-05-22T18:32:00Z

@TheBlueMatt @rillomas
Sorry for the delay on this, and thanks for pointing out my mistake. I fixed the build to turn MMVQ=ON for Intel Win, here's some sample numbers:

llama-bench.exe -p 0 -r 3 -n 128

Xe2 B580:
llama 1B Q3_K - Medium (before) 253.58 ± 2.38
llama 1B Q3_K - Medium (after) 287.41 ± 5.71

qwen35 9B Q3_K - Medium (before) 56.70 ± 0.12
qwen35 9B Q3_K - Medium (after) 57.58 ± 0.06

llama 1B Q6_K (before) 235.47 ± 1.41
llama 1B Q6_K (after) 250.71 ± 2.18

qwen35 9B Q6_K (before) 47.37 ± 0.03
qwen35 9B Q6_K (after) 45.72 ± 0.03 --> tried multiple reruns of this data and got same results

@rillomas Helped point me to the original issue (#17628) which got MMVQ disabled due to A770, I think if re-enabling MMVQ for Intel Windows it should be only Xe2+.

I'll try Xe3 very soon.

virajwad · 2026-05-22T18:47:53Z

@TheBlueMatt @rillomas
Here's Xe3 Win:

llama-bench.exe -p 0 -r 3 -n 128

Xe3 PTL:

llama 1B Q3_K - Medium (before) 115.28 ± 1.41
llama 1B Q3_K - Medium (after) 122.32 ± 3.02

qwen35 9B Q3_K - Medium (before) 21.40 ± 0.42
qwen35 9B Q3_K - Medium (after) 21.90 ± 0.46

llama 1B Q6_K (before) 87.98 ± 2.01
llama 1B Q6_K (after) 90.46 ± 0.27

qwen35 9B Q6_K (before) 14.93 ± 0.08
qwen35 9B Q6_K (after) 14.86 ± 0.18

virajwad · 2026-05-22T20:52:50Z

Personally I think the gains from the data look good. I'm ok with re-enabling MMVQ for Q3_K and Q6_K for Intel Xe2+ specifically (Win + Ubuntu). I think @rillomas will be back Monday if he has any feedback :)

Thanks @TheBlueMatt for your great efforts!!

Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned, and Q3_K still wins on NVIDIA as well. mesa isn't all that great at coalescing back-to-back loads from alternating arrays, so we force it instead. Further, we can do subtraction directly on a full int32_t rather than an i8vec4 with bit twiddling because the high bit is always free to start. On Intel BMG on mesa, the switch to MMVQ provides an immediate ~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. The futher switch to block loads leads to a ~24% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA override for K quants on Xe2 as well.

TheBlueMatt · 2026-05-28T07:07:28Z

Okay, disabled the Q3K block on AMD and switched to accepting Q6K for all Intel/Linux. Also moved to accepting Q2K/Q3K/Q6K on Intel Windows. Thanks for all the benchmarking. @virajwad note that I just assumed here that Q2K is also faster on Intel Windows like it is for mesa, happy to change if not, though Q2K isn't all that popular anyway.

0cc4m · 2026-06-01T09:38:45Z

@ggml-org/maintainers Another approval needed.

LuxKeiwoker · 2026-06-01T14:28:44Z

Does Alchemist, e.g an A770 also profit from this change?

0cc4m · 2026-06-01T14:51:36Z

Yes

* origin/master: (36 commits) vendor : update cpp-httplib to 0.46.1 (ggml-org#23980) llama: limit max outputs of `llama_context` (ggml-org#23861) metal: template GLU kernels to support f16/f32 (ggml-org#23882) vulkan: don't hold the device mutex while compiling pipelines (ggml-org#23641) vulkan: reduce host memory lock contention (ggml-org#23376) vocab: add normalizer.lowercase support to WPM (ggml-org#23899) TP: quantized KV cache support (ggml-org#23792) security : disable private disclosures (ggml-org#23963) model: Add EXAONE 4.5 implementations (ggml-org#21733) vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (ggml-org#23056) vulkan: Removed unused functions (ggml-org#23175) common : support manually triggering the reasoning budget end sequence (ggml-org#23949) ci : add missing Linux label to cpu-x64-high-perf runner (ggml-org#23958) [SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (ggml-org#23812) [SYCL] Add more types in GET_ROWS OP (ggml-org#23710) sycl : Optimize Q3_K mul_mat by reorder (ggml-org#23725) ci: remove redundant or duplicate jobs (ggml-org#23927) server : handle If-None-Match weak ETags (ggml-org#23916) ci : limit trigger paths for the CPU workflow (ggml-org#23938) vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756) ...

LuxKeiwoker · 2026-06-01T21:09:02Z

Just did a benchmark. Doesn't seem to have any effect on performance. on a A770. Win 11 VM with latest Arc drivers installed.

virajwad · 2026-06-01T21:27:35Z

@LuxKeiwoker It's likely due to the if (device->architecture == vk_device_architecture::INTEL_XE2) { check. If A770 (Xe1 or "Xe") perf also improves with this change, we can enable it for that too, but we probably need to create a code flag like INTEL_XE1 similar to the existing INTEL_XE2 flag, because I don't know if we would see perf upside on any older platforms than Xe for this PR.

virajwad · 2026-06-01T21:35:17Z

Hi @0cc4m Is your A770 on Ubuntu / Linux, or Windows?

0cc4m · 2026-06-02T06:45:59Z

Hi @0cc4m Is your A770 on Ubuntu / Linux, or Windows?

Linux.

It's likely due to the if (device->architecture == vk_device_architecture::INTEL_XE2) { check. If A770 (Xe1 or "Xe") perf also improves with this change, we can enable it for that too, but we probably need to create a code flag like INTEL_XE1 similar to the existing INTEL_XE2 flag, because I don't know if we would see perf upside on any older platforms than Xe for this PR.

No, Q3_K and Q6_K will also go through the MMVQ path on Linux now, as long as k > 2048. I don't use Windows, so I can't tune for or test that. There Xe1 is excluded, yes.

…l-org#23056) Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned, and Q3_K still wins on NVIDIA as well. mesa isn't all that great at coalescing back-to-back loads from alternating arrays, so we force it instead. Further, we can do subtraction directly on a full int32_t rather than an i8vec4 with bit twiddling because the high bit is always free to start. On Intel BMG on mesa, the switch to MMVQ provides an immediate ~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. The futher switch to block loads leads to a ~24% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA override for K quants on Xe2 as well.

LuxKeiwoker · 2026-06-02T08:02:37Z

There Xe1 is excluded, yes.

I'm on Windows, so that's probably why it doesn't have any effect. Is there a way to enable it under windows?

0cc4m · 2026-06-02T09:21:37Z

You can just take out the Xe2 limit, e.g. like this:

diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index e7d04634b..bff53dbf3 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -8432,10 +8432,8 @@ static bool ggml_vk_should_use_mmvq(const vk_device& device, uint32_t m, uint32_
             return true;
         }
     case VK_VENDOR_ID_INTEL:
-        if (device->architecture == vk_device_architecture::INTEL_XE2) {
-            if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_Q6_K) {
-                return true;
-            }
+        if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_Q6_K) {
+            return true;
         }
 
         if (device->driver_id == vk::DriverId::eIntelProprietaryWindows) {

That would get Xe1 to run with the same configuration as Xe2 on Windows.

TheBlueMatt requested a review from a team as a code owner May 14, 2026 14:26

TheBlueMatt mentioned this pull request May 14, 2026

vulkan: Pad Q3_K/Q6_K tensors out to 32-bit alignment #22951

Draft

TheBlueMatt force-pushed the 2026-05-intel-q3-q6-mmvq branch from c0432e8 to 917fd19 Compare May 14, 2026 15:41

jeffbolznv reviewed May 14, 2026

View reviewed changes

Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 14, 2026

TheBlueMatt force-pushed the 2026-05-intel-q3-q6-mmvq branch 2 times, most recently from 9e0a417 to 35a592d Compare May 14, 2026 19:33

0cc4m reviewed May 27, 2026

View reviewed changes

Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp

Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated

Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated

TheBlueMatt force-pushed the 2026-05-intel-q3-q6-mmvq branch from 35a592d to 642aeaf Compare May 28, 2026 07:05

0cc4m approved these changes May 28, 2026

View reviewed changes

ServeurpersoCom approved these changes Jun 1, 2026

View reviewed changes

0cc4m merged commit 1962000 into ggml-org:master Jun 1, 2026
30 checks passed

Conversation

TheBlueMatt commented May 14, 2026

Uh oh!

jeffbolznv commented May 14, 2026

Uh oh!

Uh oh!

virajwad commented May 14, 2026

Uh oh!

virajwad commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tested on BMG B580 (Windows)

Uh oh!

virajwad commented May 14, 2026

Uh oh!

virajwad commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tested on Panther Lake Xe3 (Windows)

Uh oh!

virajwad commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rillomas commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

virajwad commented May 22, 2026

Uh oh!

virajwad commented May 22, 2026

Uh oh!

virajwad commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TheBlueMatt commented May 28, 2026

Uh oh!

0cc4m commented Jun 1, 2026

Uh oh!

Uh oh!

LuxKeiwoker commented Jun 1, 2026

Uh oh!

0cc4m commented Jun 1, 2026

Uh oh!

LuxKeiwoker commented Jun 1, 2026

Uh oh!

virajwad commented Jun 1, 2026

Uh oh!

virajwad commented Jun 1, 2026

Uh oh!

0cc4m commented Jun 2, 2026

Uh oh!

LuxKeiwoker commented Jun 2, 2026

Uh oh!

0cc4m commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

virajwad commented May 14, 2026 •

edited

Loading

virajwad commented May 14, 2026 •

edited

Loading

virajwad commented May 14, 2026 •

edited

Loading

TheBlueMatt commented May 14, 2026 •

edited

Loading

rillomas commented May 15, 2026 •

edited

Loading

virajwad commented May 22, 2026 •

edited

Loading