Skip to content

vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints#23056

Merged
0cc4m merged 1 commit into
ggml-org:masterfrom
TheBlueMatt:2026-05-intel-q3-q6-mmvq
Jun 1, 2026
Merged

vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints#23056
0cc4m merged 1 commit into
ggml-org:masterfrom
TheBlueMatt:2026-05-intel-q3-q6-mmvq

Conversation

@TheBlueMatt
Copy link
Copy Markdown
Contributor

This is the non-padding part of #22951.

Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned.

mesa isn't all that great at coalescing back-to-back loads from alternating arrays, so we force it instead. Further, we can do subtraction directly on a full int32_t rather than an i8vec4 with bit twiddling because the high bit is always free to start.

Obviously this only impacts Q3_K and Q6_K which aren't all that commonly used AFAICT, but some mixed-quant Q4_K and Q5_K models have some Q6_K tensors in them, so there's still a win for some BMG users here.

On Intel BMG on mesa, the switch to MMVQ provides an immediate ~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

The futher switch to block loads leads to a ~24% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

Stole the subtraction trick from #22066.

@virajwad any chance you could test this on windows?

@jeffbolznv
Copy link
Copy Markdown
Contributor

I get a nice boost for q3_k on RTX 5090:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 20 --prio 1 -m c:\models\Qwen3.5-9B-Q3_K_S.gguf

master:
default
| qwen35 9B Q3_K - Small         |   4.01 GiB |     8.95 B | Vulkan     |  99 |  1 |           tg128 |        201.31 ± 5.32 |

q3_k allowed
| qwen35 9B Q3_K - Small         |   4.01 GiB |     8.95 B | Vulkan     |  99 |  1 |           tg128 |        201.81 ± 3.79 |

q3_k forced
| qwen35 9B Q3_K - Small         |   4.01 GiB |     8.95 B | Vulkan     |  99 |  1 |           tg128 |        200.83 ± 5.80 |

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                50268 runs -    20.19 us/run - 117.44 MFLOP/run -   5.82 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                40470 runs -    24.80 us/run - 234.88 MFLOP/run -   9.47 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                34932 runs -    28.68 us/run - 352.32 MFLOP/run -  12.29 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                30033 runs -    33.31 us/run - 469.76 MFLOP/run -  14.10 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                25137 runs -    39.85 us/run - 587.20 MFLOP/run -  14.74 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                15087 runs -    66.55 us/run - 939.52 MFLOP/run -  14.12 TFLOPS
  
pr:
default
| qwen35 9B Q3_K - Small         |   4.01 GiB |     8.95 B | Vulkan     |  99 |  1 |           tg128 |        200.45 ± 6.09 |

q3_k allowed
| qwen35 9B Q3_K - Small         |   4.01 GiB |     8.95 B | Vulkan     |  99 |  1 |           tg128 |        205.01 ± 7.03 |

q3_k forced
| qwen35 9B Q3_K - Small         |   4.01 GiB |     8.95 B | Vulkan     |  99 |  1 |           tg128 |        210.46 ± 6.44 |

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                57936 runs -    17.27 us/run - 117.44 MFLOP/run -   6.80 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                45582 runs -    22.13 us/run - 234.88 MFLOP/run -  10.61 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                36636 runs -    27.33 us/run - 352.32 MFLOP/run -  12.89 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                30672 runs -    32.70 us/run - 469.76 MFLOP/run -  14.36 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                22572 runs -    44.47 us/run - 587.20 MFLOP/run -  13.20 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                16157 runs -    62.16 us/run - 939.52 MFLOP/run -  15.12 TFLOPS

q6_k is still slower with mmvq, which matches my expectations (mmvq generally helps for small quants because we're not quite bandwidth limited, and hurts for larger quants where we are bandwidth limited and it adds overhead to quantize activations).

Would you mind updating the PR to also force on q3_k for NVIDIA?

@TheBlueMatt TheBlueMatt force-pushed the 2026-05-intel-q3-q6-mmvq branch from c0432e8 to 917fd19 Compare May 14, 2026 15:41
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 14, 2026
@virajwad
Copy link
Copy Markdown
Contributor

Yes I'll try this for Intel now

@virajwad
Copy link
Copy Markdown
Contributor

virajwad commented May 14, 2026

Tested on BMG B580 (Windows)

I don't see any model level perf improvement, but I do see improvement on the MUL_MATs in test-backend-ops for both Q3_K and Q6_K cases.

Before (Q3_K):

llamacpp_test_builds\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q3_K_M.gguf -fa 1

model size params backend ngl fa test t/s
qwen35 9B Q3_K - Medium 4.34 GiB 8.95 B Vulkan 99 1 tg128 55.94 ± 0.15

build: 1e5ad35 (9093)

llamacpp_test_builds\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q3_K_M.gguf -fa 0

model size params backend ngl test t/s
qwen35 9B Q3_K - Medium 4.34 GiB 8.95 B Vulkan 99 tg128 56.20 ± 0.27

build: 1e5ad35 (9093)

MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 6816 runs -   159.66 us/run - 117.44 MFLOP/run - 735.58 GFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 5964 runs -   170.97 us/run - 234.88 MFLOP/run -   1.37 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4828 runs -   207.68 us/run - 352.32 MFLOP/run -   1.70 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4473 runs -   229.22 us/run - 469.76 MFLOP/run -   2.05 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 3591 runs -   291.77 us/run - 587.20 MFLOP/run -   2.01 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2033 runs -   501.40 us/run - 939.52 MFLOP/run -   1.87 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                386 runs -  2599.23 us/run -  60.13 GFLOP/run -  23.13 TFLOPS
Backend Vulkan0: OK

After (Q3_K):

llamacpp_test_builds\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q3_K_M.gguf -fa 1

model size params backend ngl fa test t/s
qwen35 9B Q3_K - Medium 4.34 GiB 8.95 B Vulkan 99 1 tg128 55.96 ± 0.11

build: 917fd19 (9094)

llamacpp_test_builds\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q3_K_M.gguf -fa 0

model size params backend ngl test t/s
qwen35 9B Q3_K - Medium 4.34 GiB 8.95 B Vulkan 99 tg128 56.14 ± 0.16

build: 917fd19 (9094)

MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 6816 runs -   164.56 us/run - 117.44 MFLOP/run - 713.68 GFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 8520 runs -   117.76 us/run - 234.88 MFLOP/run -   1.99 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 7952 runs -   127.82 us/run - 352.32 MFLOP/run -   2.76 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 7242 runs -   140.58 us/run - 469.76 MFLOP/run -   3.34 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 6669 runs -   152.93 us/run - 587.20 MFLOP/run -   3.84 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 5350 runs -   188.67 us/run - 939.52 MFLOP/run -   4.98 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                388 runs -  2589.46 us/run -  60.13 GFLOP/run -  23.22 TFLOPS
Backend Vulkan0: OK

Before (Q6_K):

llamacpp_test_builds\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q6_K.gguf -fa 0

model size params backend ngl test t/s
qwen35 9B Q6_K 6.94 GiB 8.95 B Vulkan 99 tg128 47.37 ± 0.06

build: 1e5ad35 (9093)

llamacpp_test_builds\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q6_K.gguf -fa 1

model size params backend ngl fa test t/s
qwen35 9B Q6_K 6.94 GiB 8.95 B Vulkan 99 1 tg128 46.99 ± 0.04

build: 1e5ad35 (9093)

MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 6816 runs -   154.94 us/run - 117.44 MFLOP/run - 757.98 GFLOPS
 MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 6390 runs -   161.90 us/run - 234.88 MFLOP/run -   1.45 TFLOPS
 MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 5396 runs -   186.37 us/run - 352.32 MFLOP/run -   1.89 TFLOPS
 MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4260 runs -   242.81 us/run - 469.76 MFLOP/run -   1.93 TFLOPS
 MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 3591 runs -   285.34 us/run - 587.20 MFLOP/run -   2.06 TFLOPS
 MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2461 runs -   422.37 us/run - 939.52 MFLOP/run -   2.22 TFLOPS
 MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                402 runs -  2490.52 us/run -  60.13 GFLOP/run -  24.14 TFLOPS
 Backend Vulkan0: OK

After (Q6_K):

llamacpp_test_builds\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q6_K.gguf -fa 0

model size params backend ngl test t/s
qwen35 9B Q6_K 6.94 GiB 8.95 B Vulkan 99 tg128 47.39 ± 0.05

build: 917fd19 (9094)

llamacpp_test_builds\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m Qwen3.5-9B-Q6_K.gguf -fa 1

model size params backend ngl fa test t/s
qwen35 9B Q6_K 6.94 GiB 8.95 B Vulkan 99 1 tg128 47.00 ± 0.06

build: 917fd19 (9094)

MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 6816 runs -   155.60 us/run - 117.44 MFLOP/run - 754.77 GFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 5964 runs -   177.74 us/run - 234.88 MFLOP/run -   1.32 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 5680 runs -   178.69 us/run - 352.32 MFLOP/run -   1.97 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 5112 runs -   197.85 us/run - 469.76 MFLOP/run -   2.37 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4617 runs -   216.94 us/run - 587.20 MFLOP/run -   2.71 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 3959 runs -   254.18 us/run - 939.52 MFLOP/run -   3.70 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                400 runs -  2505.26 us/run -  60.13 GFLOP/run -  24.00 TFLOPS
Backend Vulkan0: OK

@virajwad
Copy link
Copy Markdown
Contributor

I'll test on Xe3 also now

@virajwad
Copy link
Copy Markdown
Contributor

virajwad commented May 14, 2026

Tested on Panther Lake Xe3 (Windows)

Similar case - generally no change, minor improvement for Q3_K, but the MUL_MAT shaders in test-backend-ops overall get better.

Before (Q3_K):

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_Pre_BlockLoad_Q3_Q6\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m C:\Users\dungeon\Desktop\models_other\Qwen3.5-9B-Q3_K_M.gguf -fa 1,0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
qwen35 9B Q3_K - Medium 4.34 GiB 8.95 B Vulkan 99 1 tg128 19.86 ± 1.02
qwen35 9B Q3_K - Medium 4.34 GiB 8.95 B Vulkan 99 0 tg128 19.35 ± 0.16

build: 1e5ad35 (9093)

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_Pre_BlockLoad_Q3_Q6\TheBlueMatt_Pre_BlockLoad_Q3_Q6>test-backend-ops.exe perf -o MUL_MAT -p "type_a=q3_K"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) B390 GPU
  Device memory: 37099 MB (36330 MB free)

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4260 runs -   250.84 us/run - 117.44 MFLOP/run - 468.18 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 3834 runs -   292.31 us/run - 234.88 MFLOP/run - 803.53 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 3124 runs -   333.51 us/run - 352.32 MFLOP/run -   1.06 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2556 runs -   399.76 us/run - 469.76 MFLOP/run -   1.18 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2223 runs -   465.99 us/run - 587.20 MFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                  963 runs -  1160.71 us/run - 939.52 MFLOP/run - 809.44 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                202 runs -  4965.41 us/run -  60.13 GFLOP/run -  12.11 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

After (Q3_K)

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_BlockLoad_Q3_Q6\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m C:\Users\dungeon\Desktop\models_other\Qwen3.5-9B-Q3_K_M.gguf -fa 1,0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
qwen35 9B Q3_K - Medium 4.34 GiB 8.95 B Vulkan 99 1 tg128 20.56 ± 1.27
qwen35 9B Q3_K - Medium 4.34 GiB 8.95 B Vulkan 99 0 tg128 19.42 ± 0.07

build: 917fd19 (9094)

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_BlockLoad_Q3_Q6\TheBlueMatt_BlockLoad_Q3_Q6>test-backend-ops.exe perf -o MUL_MAT -p "type_a=q3_K"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) B390 GPU
  Device memory: 37099 MB (36330 MB free)

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4260 runs -   252.54 us/run - 117.44 MFLOP/run - 465.04 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4686 runs -   224.95 us/run - 234.88 MFLOP/run -   1.04 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 4260 runs -   246.38 us/run - 352.32 MFLOP/run -   1.43 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 3834 runs -   263.70 us/run - 469.76 MFLOP/run -   1.78 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 3249 runs -   327.90 us/run - 587.20 MFLOP/run -   1.79 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2568 runs -   405.41 us/run - 939.52 MFLOP/run -   2.32 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                196 runs -  5119.98 us/run -  60.13 GFLOP/run -   11.74 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

Before (Q6_K):

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_Pre_BlockLoad_Q3_Q6\TheBlueMatt_Pre_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m C:\Users\dungeon\Desktop\models_other\Qwen3.5-9B-Q6_K.gguf -fa 1,0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
qwen35 9B Q6_K 6.94 GiB 8.95 B Vulkan 99 1 tg128 14.72 ± 0.18
qwen35 9B Q6_K 6.94 GiB 8.95 B Vulkan 99 0 tg128 14.65 ± 0.01

build: 1e5ad35 (9093)

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_Pre_BlockLoad_Q3_Q6\TheBlueMatt_Pre_BlockLoad_Q3_Q6>test-backend-ops.exe perf -o MUL_MAT -p "type_a=q6_K"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) B390 GPU
  Device memory: 37099 MB (36330 MB free)

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2556 runs -   449.64 us/run - 117.44 MFLOP/run - 261.19 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2556 runs -   450.78 us/run - 234.88 MFLOP/run - 521.06 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2272 runs -   448.51 us/run - 352.32 MFLOP/run - 785.54 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2130 runs -   507.78 us/run - 469.76 MFLOP/run - 925.13 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 1881 runs -   570.52 us/run - 587.20 MFLOP/run -   1.03 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 1284 runs -   811.44 us/run - 939.52 MFLOP/run -   1.16 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                170 runs -  5963.36 us/run -  60.13 GFLOP/run -  10.08 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

After (Q6_K):

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_BlockLoad_Q3_Q6\TheBlueMatt_BlockLoad_Q3_Q6>llama-bench.exe -p 0 -r 5 -n 128 -m C:\Users\dungeon\Desktop\models_other\Qwen3.5-9B-Q6_K.gguf -fa 1,0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
qwen35 9B Q6_K 6.94 GiB 8.95 B Vulkan 99 1 tg128 14.74 ± 0.11
qwen35 9B Q6_K 6.94 GiB 8.95 B Vulkan 99 0 tg128 14.67 ± 0.01

build: 917fd19 (9094)

C:\Users\dungeon\Desktop\llamacpp\TheBlueMatt_BlockLoad_Q3_Q6\TheBlueMatt_BlockLoad_Q3_Q6>test-backend-ops.exe perf -o MUL_MAT -p "type_a=q6_K"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) B390 GPU (Intel Corporation) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: Intel(R) Arc(TM) B390 GPU
  Device memory: 37099 MB (36330 MB free)

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2556 runs -   451.21 us/run - 117.44 MFLOP/run - 260.28 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2556 runs -   433.68 us/run - 234.88 MFLOP/run - 541.60 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2272 runs -   441.27 us/run - 352.32 MFLOP/run - 798.42 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2343 runs -   448.37 us/run - 469.76 MFLOP/run -   1.05 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2223 runs -   456.46 us/run - 587.20 MFLOP/run -   1.29 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 2247 runs -   467.04 us/run - 939.52 MFLOP/run -   2.01 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                210 runs -  4772.29 us/run -  60.13 GFLOP/run -  12.60 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
OK

@virajwad
Copy link
Copy Markdown
Contributor

virajwad commented May 14, 2026

This PR is fine by me, especially if it improves Ubuntu perf w/ Mesa that would be great. @rillomas could you please take a look at this PR?

@TheBlueMatt TheBlueMatt force-pushed the 2026-05-intel-q3-q6-mmvq branch 2 times, most recently from 9e0a417 to 35a592d Compare May 14, 2026 19:33
@TheBlueMatt
Copy link
Copy Markdown
Contributor Author

TheBlueMatt commented May 14, 2026

Pushed an update for nvidia and also to take MMVQ path for small k on Q2/Q3/Q6 which seems like a win (similar to NVIDIA).

diff since last push:

$ git diff-tree -U1 917fd1934 35a592de8
diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index ee918349f..9531c4ee8 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -7867,3 +7867,3 @@ static bool ggml_vk_should_use_mmvq(const vk_device& device, uint32_t m, uint32_
     case VK_VENDOR_ID_NVIDIA:
-        if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_IQ1_S || src0_type == GGML_TYPE_IQ1_M) {
+        if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_IQ1_S || src0_type == GGML_TYPE_IQ1_M) {
             return true;
@@ -7900,2 +7900,8 @@ static bool ggml_vk_should_use_mmvq(const vk_device& device, uint32_t m, uint32_

+        if (device->architecture == vk_device_architecture::INTEL_XE2) {
+            if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_Q6_K) {
+                return true;
+            }
+        }
+
         if (k < 2048) {

@rillomas
Copy link
Copy Markdown
Contributor

rillomas commented May 15, 2026

This PR is fine by me, especially if it improves Ubuntu perf w/ Mesa that would be great. @rillomas could you please take a look at this PR?

You may not have MMVQ enabled correctly in your benchmarks due to this block so I think we should check and benchmark again to see if there are any benefits for Xe2+ even on Windows.

@virajwad
Copy link
Copy Markdown
Contributor

@TheBlueMatt @rillomas
Sorry for the delay on this, and thanks for pointing out my mistake. I fixed the build to turn MMVQ=ON for Intel Win, here's some sample numbers:

llama-bench.exe -p 0 -r 3 -n 128

Xe2 B580:
llama 1B Q3_K - Medium (before) 253.58 ± 2.38
llama 1B Q3_K - Medium (after) 287.41 ± 5.71

qwen35 9B Q3_K - Medium (before) 56.70 ± 0.12
qwen35 9B Q3_K - Medium (after) 57.58 ± 0.06

llama 1B Q6_K (before) 235.47 ± 1.41
llama 1B Q6_K (after) 250.71 ± 2.18

qwen35 9B Q6_K (before) 47.37 ± 0.03
qwen35 9B Q6_K (after) 45.72 ± 0.03 --> tried multiple reruns of this data and got same results

@rillomas Helped point me to the original issue (#17628) which got MMVQ disabled due to A770, I think if re-enabling MMVQ for Intel Windows it should be only Xe2+.

I'll try Xe3 very soon.

@virajwad
Copy link
Copy Markdown
Contributor

@TheBlueMatt @rillomas
Here's Xe3 Win:

llama-bench.exe -p 0 -r 3 -n 128

Xe3 PTL:

llama 1B Q3_K - Medium (before) 115.28 ± 1.41
llama 1B Q3_K - Medium (after) 122.32 ± 3.02

qwen35 9B Q3_K - Medium (before) 21.40 ± 0.42
qwen35 9B Q3_K - Medium (after) 21.90 ± 0.46

llama 1B Q6_K (before) 87.98 ± 2.01
llama 1B Q6_K (after) 90.46 ± 0.27

qwen35 9B Q6_K (before) 14.93 ± 0.08
qwen35 9B Q6_K (after) 14.86 ± 0.18

@virajwad
Copy link
Copy Markdown
Contributor

virajwad commented May 22, 2026

Personally I think the gains from the data look good. I'm ok with re-enabling MMVQ for Q3_K and Q6_K for Intel Xe2+ specifically (Win + Ubuntu). I think @rillomas will be back Monday if he has any feedback :)

Thanks @TheBlueMatt for your great efforts!!

Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated
Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even
though they're only 2-byte aligned, and Q3_K still wins on
NVIDIA as well.

mesa isn't all that great at coalescing back-to-back loads from
alternating arrays, so we force it instead. Further, we can do
subtraction directly on a full int32_t rather than an i8vec4
with bit twiddling because the high bit is always free to start.

On Intel BMG on mesa, the switch to MMVQ provides an immediate
~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and
~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

The futher switch to block loads leads to a ~24% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA
override for K quants on Xe2 as well.
@TheBlueMatt TheBlueMatt force-pushed the 2026-05-intel-q3-q6-mmvq branch from 35a592d to 642aeaf Compare May 28, 2026 07:05
@TheBlueMatt
Copy link
Copy Markdown
Contributor Author

Okay, disabled the Q3K block on AMD and switched to accepting Q6K for all Intel/Linux. Also moved to accepting Q2K/Q3K/Q6K on Intel Windows. Thanks for all the benchmarking. @virajwad note that I just assumed here that Q2K is also faster on Intel Windows like it is for mesa, happy to change if not, though Q2K isn't all that popular anyway.

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Jun 1, 2026

@ggml-org/maintainers Another approval needed.

@0cc4m 0cc4m merged commit 1962000 into ggml-org:master Jun 1, 2026
30 checks passed
@LuxKeiwoker
Copy link
Copy Markdown

Does Alchemist, e.g an A770 also profit from this change?

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Jun 1, 2026

Yes

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jun 1, 2026
* origin/master: (36 commits)
vendor : update cpp-httplib to 0.46.1 (ggml-org#23980)
llama: limit max outputs of `llama_context` (ggml-org#23861)
metal: template GLU kernels to support f16/f32 (ggml-org#23882)
vulkan: don't hold the device mutex while compiling pipelines (ggml-org#23641)
vulkan: reduce host memory lock contention (ggml-org#23376)
vocab: add normalizer.lowercase support to WPM (ggml-org#23899)
TP: quantized KV cache support (ggml-org#23792)
security : disable private disclosures (ggml-org#23963)
model: Add EXAONE 4.5 implementations (ggml-org#21733)
vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (ggml-org#23056)
vulkan: Removed unused functions (ggml-org#23175)
common : support manually triggering the reasoning budget end sequence (ggml-org#23949)
ci : add missing Linux label to cpu-x64-high-perf runner (ggml-org#23958)
[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (ggml-org#23812)
[SYCL] Add more types in GET_ROWS OP (ggml-org#23710)
sycl : Optimize Q3_K mul_mat by reorder (ggml-org#23725)
ci: remove redundant or duplicate jobs (ggml-org#23927)
server : handle If-None-Match weak ETags (ggml-org#23916)
ci : limit trigger paths for the CPU workflow (ggml-org#23938)
vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756)
...
@LuxKeiwoker
Copy link
Copy Markdown

image

Just did a benchmark. Doesn't seem to have any effect on performance. on a A770. Win 11 VM with latest Arc drivers installed.

@virajwad
Copy link
Copy Markdown
Contributor

virajwad commented Jun 1, 2026

@LuxKeiwoker It's likely due to the if (device->architecture == vk_device_architecture::INTEL_XE2) { check. If A770 (Xe1 or "Xe") perf also improves with this change, we can enable it for that too, but we probably need to create a code flag like INTEL_XE1 similar to the existing INTEL_XE2 flag, because I don't know if we would see perf upside on any older platforms than Xe for this PR.

@virajwad
Copy link
Copy Markdown
Contributor

virajwad commented Jun 1, 2026

Hi @0cc4m Is your A770 on Ubuntu / Linux, or Windows?

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Jun 2, 2026

Hi @0cc4m Is your A770 on Ubuntu / Linux, or Windows?

Linux.

It's likely due to the if (device->architecture == vk_device_architecture::INTEL_XE2) { check. If A770 (Xe1 or "Xe") perf also improves with this change, we can enable it for that too, but we probably need to create a code flag like INTEL_XE1 similar to the existing INTEL_XE2 flag, because I don't know if we would see perf upside on any older platforms than Xe for this PR.

No, Q3_K and Q6_K will also go through the MMVQ path on Linux now, as long as k > 2048. I don't use Windows, so I can't tune for or test that. There Xe1 is excluded, yes.

turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
…l-org#23056)

Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even
though they're only 2-byte aligned, and Q3_K still wins on
NVIDIA as well.

mesa isn't all that great at coalescing back-to-back loads from
alternating arrays, so we force it instead. Further, we can do
subtraction directly on a full int32_t rather than an i8vec4
with bit twiddling because the high bit is always free to start.

On Intel BMG on mesa, the switch to MMVQ provides an immediate
~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and
~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

The futher switch to block loads leads to a ~24% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA
override for K quants on Xe2 as well.
@LuxKeiwoker
Copy link
Copy Markdown

There Xe1 is excluded, yes.

I'm on Windows, so that's probably why it doesn't have any effect. Is there a way to enable it under windows?

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Jun 2, 2026

You can just take out the Xe2 limit, e.g. like this:

diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index e7d04634b..bff53dbf3 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -8432,10 +8432,8 @@ static bool ggml_vk_should_use_mmvq(const vk_device& device, uint32_t m, uint32_
             return true;
         }
     case VK_VENDOR_ID_INTEL:
-        if (device->architecture == vk_device_architecture::INTEL_XE2) {
-            if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_Q6_K) {
-                return true;
-            }
+        if (src0_type == GGML_TYPE_Q2_K || src0_type == GGML_TYPE_Q3_K || src0_type == GGML_TYPE_Q6_K) {
+            return true;
         }
 
         if (device->driver_id == vk::DriverId::eIntelProprietaryWindows) {

That would get Xe1 to run with the same configuration as Xe2 on Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants