Vulkan Repack PoC by 0cc4m · Pull Request #21024 · ggml-org/llama.cpp

0cc4m · 2026-03-26T12:03:17Z

This is a basic PoC to see how much of a difference quant alignment makes across GPU vendors. It's not complete and performance is better in many cases, but not universally so. I'll post benchmarks later. I assume the alignment moved the deltas into other memory pages, which in some cases is worse than the previous unaligned state. I'll gather some more data and see if it can be improved.

Claude was used for assistance, but code was written by me.

savvadesogle · 2026-03-27T11:16:15Z

Hi, Ruben
Is there a possibility that the coopmat might work on the intel alchemists (A770)?

0cc4m · 2026-03-27T12:26:43Z

Off topic, but please look here: #20776

inforithmics · 2026-05-03T08:57:17Z

I was curious of the performance improvements:

ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 1 | warp size: 64 | shared

main:

model	size	params	backend	ngl	test	t/s
qwen35 2B Q4_0	1.12 GiB	1.88 B	Vulkan	99	pp512	1411.77 ± 9.45
qwen35 2B Q4_0	1.12 GiB	1.88 B	Vulkan	99	tg128	48.18 ± 1.44

pr:

model	size	params	backend	ngl	test	t/s
qwen35 2B Q4_0	1.12 GiB	1.88 B	Vulkan	99	pp512	1176.18 ± 1.39
qwen35 2B Q4_0	1.12 GiB	1.88 B	Vulkan	99	tg128	49.03 ± 0.10

(I updated the values for the correct main branch comparision (same build tag as the pullrequest)
So a slight Improvement in tg and a reduction pp (for this model and hardware).

inforithmics · 2026-05-15T13:17:39Z

main:

model	size	params	backend	ngl	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	pp512	511.76 ± 2.53
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	tg128	25.00 ± 0.01
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	pp512	325.63 ± 0.15
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	tg128	13.62 ± 0.01
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	pp512	2355.66 ± 5.87
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	tg128	70.40 ± 0.10

pr:

model	size	params	backend	ngl	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	pp512	492.51 ± 2.89
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	tg128	24.35 ± 0.02
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	pp512	218.67 ± 1.92
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	tg128	13.54 ± 0.15
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	pp512	2130.99 ± 5.78
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	tg128	69.36 ± 0.20

It seems that the Performance is now always slower after repacking for this quants models on this hardwar.

0cc4m · 2026-05-15T14:33:59Z

The current state on some of my systems.

RTX 3090

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	4753.45 ± 23.50	4735.27 ± 45.99	-0.4%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	146.73 ± 0.54	145.78 ± 0.50	-0.6%
llama 8B Q4_1	4.77 GiB	8.03 B	99	1	pp512	4397.53 ± 41.71	4457.75 ± 6.10	+1.4%
llama 8B Q4_1	4.77 GiB	8.03 B	99	1	tg128	138.34 ± 0.25	136.14 ± 0.61	-1.6%
llama 8B Q8_0	7.95 GiB	8.03 B	99	1	pp512	4830.91 ± 17.72	2178.92 ± 2.51	-54.9%
llama 8B Q8_0	7.95 GiB	8.03 B	99	1	tg128	95.05 ± 0.04	95.18 ± 0.06	+0.1%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	4616.91 ± 47.95	5286.86 ± 87.38	+14.5%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	181.32 ± 0.65	195.27 ± 1.24	+7.7%

AMD RX 9070 XT

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	4922.26 ± 105.86	4122.39 ± 15.29	-16.3%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	115.98 ± 0.03	117.38 ± 0.12	+1.2%
llama 8B Q4_1	4.77 GiB	8.03 B	99	1	pp512	4956.14 ± 105.61	3982.23 ± 22.04	-19.7%
llama 8B Q4_1	4.77 GiB	8.03 B	99	1	tg128	108.52 ± 0.05	108.07 ± 0.05	-0.4%
llama 8B Q8_0	7.95 GiB	8.03 B	99	1	pp512	4409.39 ± 74.03	3848.66 ± 23.74	-12.7%
llama 8B Q8_0	7.95 GiB	8.03 B	99	1	tg128	71.17 ± 0.00	71.44 ± 0.02	+0.4%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	4545.10 ± 38.54	6299.91 ± 147.95	+38.6%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	177.89 ± 0.10	202.37 ± 0.35	+13.8%

AMD Radeon Pro VII

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	838.02 ± 4.51	860.31 ± 0.99	+2.7%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	102.59 ± 0.36	106.93 ± 0.55	+4.2%
llama 8B Q4_1	4.77 GiB	8.03 B	99	1	pp512	823.77 ± 1.02	825.67 ± 0.74	+0.2%
llama 8B Q4_1	4.77 GiB	8.03 B	99	1	tg128	106.33 ± 0.24	87.98 ± 0.12	-17.3%
llama 8B Q8_0	7.95 GiB	8.03 B	99	1	pp512	748.96 ± 0.96	760.86 ± 0.52	+1.6%
llama 8B Q8_0	7.95 GiB	8.03 B	99	1	tg128	73.47 ± 0.05	64.73 ± 0.05	-11.9%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	1331.76 ± 13.33	1792.74 ± 1.84	+34.6%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	137.98 ± 0.10	137.84 ± 0.15	-0.1%

Intel A770

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	1263.32 ± 1.24	1394.83 ± 2.20	+10.4%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	38.03 ± 0.02	38.01 ± 0.04	-0.1%
llama 8B Q4_1	4.77 GiB	8.03 B	99	1	pp512	1291.10 ± 1.90	1326.23 ± 3.00	+2.7%
llama 8B Q4_1	4.77 GiB	8.03 B	99	1	tg128	47.15 ± 0.04	44.61 ± 0.03	-5.4%
llama 8B Q8_0	7.95 GiB	8.03 B	99	1	pp512	720.82 ± 0.43	709.86 ± 0.39	-1.5%
llama 8B Q8_0	7.95 GiB	8.03 B	99	1	tg128	34.67 ± 0.03	34.37 ± 0.03	-0.9%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	pp512	1170.56 ± 7.70	1584.87 ± 4.62	+35.4%
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	99	1	tg128	41.36 ± 0.01	45.55 ± 0.03	+10.1%

0cc4m · 2026-05-15T14:34:31Z

@inforithmics I think the commit you tested was broken, so your result might not be valid. Not sure how it even worked for you, I just got a segfault.

inforithmics · 2026-05-15T15:32:07Z

Strange I did run the Benchmarks again (with updated pr) and they are similar, but I did run them on windows.

I did run the same benches on Windows again on a Radeon VII: And there were some improvements.

data

main:

model	size	params	backend	ngl	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	pp512	241.74 ± 2.66
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	tg128	63.31 ± 0.03
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	pp512	173.00 ± 0.27
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	tg128	43.27 ± 0.04
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	pp512	1489.29 ± 0.87
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	tg128	140.49 ± 0.21

pr:

model	size	params	backend	ngl	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	pp512	245.68 ± 2.78
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	tg128	69.17 ± 0.04
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	pp512	172.65 ± 0.50
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	tg128	43.62 ± 0.04
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	pp512	1504.59 ± 2.50
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	tg128	152.94 ± 0.31

model	test	t/s (before	t/s (after)	diff	diff %
gpt-oss 20B MXFP4 MoE	pp512	241.7	245.7	3.9	1.6%
gpt-oss 20B MXFP4 MoE	tg128	63.3	69.2	5.9	9.3%
qwen35 9B Q4_0	pp512	173.0	172.7	-0.3	-0.2%
qwen35 9B Q4_0	tg128	43.3	43.6	0.3	0.8%
qwen35 0.8B Q8_0	pp512	1489.3	1504.6	15.3	1.0%
qwen35 0.8B Q8_0	tg128	140.5	152.9	12.5	8.9%

data

main:

model	size	params	backend	ngl	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	pp512	492.18 ± 21.47
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	tg128	24.42 ± 0.91
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	pp512	318.52 ± 7.66
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	tg128	13.63 ± 0.01
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	pp512	2356.06 ± 7.86
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	tg128	69.27 ± 0.15

pr:

model	size	params	backend	ngl	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	pp512	495.85 ± 3.06
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	tg128	24.42 ± 0.02
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	pp512	219.02 ± 0.50
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	tg128	13.73 ± 0.01
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	pp512	2173.36 ± 6.31
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	tg128	69.12 ± 0.04

model	test	t/s (before	t/s (after)	diff	diff %
gpt-oss 20B MXFP4 MoE	pp512	492.2	495.9	3.7	0.7%
gpt-oss 20B MXFP4 MoE	tg128	24.4	24.4	0.0	0.0%
qwen35 9B Q4_0	pp512	318.5	219.0	-99.5	-31.2%
qwen35 9B Q4_0	tg128	13.6	13.7	0.1	0.7%
qwen35 0.8B Q8_0	pp512	2356.1	2173.4	-182.7	-7.8%
qwen35 0.8B Q8_0	tg128	69.3	69.1	-0.1	-0.2%

I formatted and updated the data for 780m so for this chip it reduces pp sometimes.

I Saw that the other results where with mmap 0 so i reran the the tests with mmap off

data

main:

model	size	params	backend	ngl	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	pp512	512.40 ± 3.71
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	tg128	25.19 ± 0.02
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	pp512	325.36 ± 1.13
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	tg128	13.72 ± 0.01
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	pp512	2378.04 ± 4.92
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	tg128	70.48 ± 0.08

pr:

model	size	params	backend	ngl	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	pp512	605.82 ± 0.76
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	tg128	25.20 ± 0.02
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	pp512	220.96 ± 0.54
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	tg128	13.82 ± 0.01
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	pp512	2187.84 ± 4.04
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	tg128	69.42 ± 0.16

model	test	t/s (before)	t/s (after)	diff	diff %
gpt-oss 20B MXFP4 MoE	pp512	512.4	605.8	93.4	18.2%
gpt-oss 20B MXFP4 MoE	tg128	25.2	25.2	0.0	0.0%
qwen35 9B Q4_0	pp512	325.4	221.0	-104.4	-32.1%
qwen35 9B Q4_0	tg128	13.7	13.8	0.1	0.7%
qwen35 0.8B Q8_0	pp512	2378.0	2187.8	-190.2	-8.0%
qwen35 0.8B Q8_0	tg128	70.5	69.4	-1.1	-1.5%

Then i testet again mmap off and flash attention on

data

main:

model	size	params	backend	ngl	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	pp512	580.59 ± 3.63
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	tg128	25.54 ± 0.06
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	1	pp512	329.59 ± 0.57
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	1	tg128	13.69 ± 0.01
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	1	pp512	2421.20 ± 11.48
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	1	tg128	71.01 ± 0.48

pr:

model	size	params	backend	ngl	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	pp512	698.99 ± 0.90
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Vulkan	99	1	tg128	25.53 ± 0.03
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	1	pp512	222.43 ± 0.62
qwen35 9B Q4_0	5.00 GiB	8.95 B	Vulkan	99	1	tg128	13.79 ± 0.01
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	1	pp512	2232.75 ± 2.68
qwen35 0.8B Q8_0	763.78 MiB	752.39 M	Vulkan	99	1	tg128	69.58 ± 0.26

model	test	t/s (before)	t/s (after)	diff	diff %
gpt-oss 20B MXFP4 MoE	pp512	580.6	699.0	118.4	20.4%
gpt-oss 20B MXFP4 MoE	tg128	25.5	25.5	0.0	0.0%
qwen35 9B Q4_0	pp512	329.6	222.4	-107.2	-32.5%
qwen35 9B Q4_0	tg128	13.7	13.8	0.1	0.7%
qwen35 0.8B Q8_0	pp512	2421.2	2232.8	-188.5	-7.8%
qwen35 0.8B Q8_0	tg128	71.0	69.6	-1.4	-2.0%

AndreNijman · 2026-06-10T14:00:30Z

Datapoint from a bandwidth-bound iGPU: AMD Radeon 780M (RDNA3/gfx1103, RADV PHOENIX, Mesa 26.0.6), Ryzen 7 PRO 250 laptop with single-channel DDR5-5600 (~44.8 GB/s pin rate), Linux 7.0.11. Device caps: uma: 1 | fp16: 1 | int dot: 1 | matrix cores: KHR_coopmat.

Model: gpt-oss-20b MXFP4 (ggml-org GGUF — experts MXFP4, attention + output Q8_0, so this exercises both the mxfp4 and q8_0 repack paths). -ngl 99 -fa 1, idle machine, interleaved A/B (master → PR → master → PR), -r 3.

test	master `49f3542` (round 1 / 2)	PR 938872e (round 1 / 2)	Δ
pp512	314.0 / 313.5	309.7 / 310.6	−1.2%
tg64	14.06 / 14.10	14.36 / 14.39	+2.2%
tg128	14.03 / 14.08	14.38 / 14.40	+2.4%

Quality check: perplexity identical to 4 decimal places on both builds (9.8898 ± 0.4984, same corpus and chunk count). Greedy outputs diverge after a few hundred tokens (near-tie token flips from FP reordering), which matches the expectation that repack changes summation order but not quality.

Context for why +2.4% is meaningful here: decode on this setup is hard against the memory wall — per-token weight traffic is ~2.56 GB, and test-backend-ops shows the mul_mat_vec / mul_mat_id kernels already sustaining ~38–39 GB/s of the ~44.8 GB/s pin rate. So the gain reads as real bandwidth efficiency recovered, and it was consistent across rounds with tight stddev (±0.02–0.05 t/s).

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 26, 2026

jeffbolznv mentioned this pull request May 11, 2026

vulkan: Pad Q3_K/Q6_K tensors out to 32-bit alignment #22951

Draft

TheBlueMatt mentioned this pull request May 12, 2026

vulkan : transpose A-matrix data layout for K-quant mul_mat performance #22970

Open

0cc4m added 8 commits May 15, 2026 11:20

vulkan: repack q4_0 into aligned arrays

c285bb9

add coopmat2 support

5c1e95c

fix double semicolon

b1243aa

add mxfp4 repacking

b4e2621

add missing repacking functions

b64f294

replace malloc/free with thread_local memory

6906f78

add q4_1, q8_0, iq4_nl repacking

57fb74f

deduplicate repacking code

13a55c8

0cc4m force-pushed the 0cc4m/vulkan-repack branch from df488da to 13a55c8 Compare May 15, 2026 11:38

0cc4m added 2 commits May 15, 2026 15:22

wider loads

ff6ad60

fix partial writes

938872e

0cc4m changed the title ~~Vulkan Q4_0 Repack PoC~~ Vulkan Repack PoC May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan Repack PoC#21024

Vulkan Repack PoC#21024
0cc4m wants to merge 10 commits into
masterfrom
0cc4m/vulkan-repack

0cc4m commented Mar 26, 2026 •

edited

Loading

Uh oh!

savvadesogle commented Mar 27, 2026

Uh oh!

0cc4m commented Mar 27, 2026

Uh oh!

inforithmics commented May 3, 2026 •

edited

Loading

Uh oh!

inforithmics commented May 15, 2026

Uh oh!

0cc4m commented May 15, 2026

Uh oh!

0cc4m commented May 15, 2026

Uh oh!

inforithmics commented May 15, 2026 •

edited

Loading

Uh oh!

AndreNijman commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

0cc4m commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

savvadesogle commented Mar 27, 2026

Uh oh!

0cc4m commented Mar 27, 2026

Uh oh!

inforithmics commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inforithmics commented May 15, 2026

Uh oh!

0cc4m commented May 15, 2026

RTX 3090

AMD RX 9070 XT

AMD Radeon Pro VII

Intel A770

Uh oh!

0cc4m commented May 15, 2026

Uh oh!

inforithmics commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndreNijman commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

0cc4m commented Mar 26, 2026 •

edited

Loading

inforithmics commented May 3, 2026 •

edited

Loading

inforithmics commented May 15, 2026 •

edited

Loading