ggml: optimized runtime for x86 cpu backend and Q4_K quantized weights paired with Q8_K activations #18495

PeterEngler · 2025-12-30T16:48:29Z

optimized runtime of the functions ggml_gemv_q4_K_8x8_q8_K() and ggml_gemm_q4_K_8x8_q8_K() for the x86 cpu backend

For performance enhancement, see:

Perplexity:

test-backend-ops:

…_q8_K()

…q8_k

taronaeo · 2025-12-31T13:53:18Z

Can you provide some benchmark differences between upstream/master and your PR?

PeterEngler · 2026-01-03T12:41:49Z

Regarding the benachmarks I added some screenshots to my first post

PeterEngler · 2026-01-03T12:42:45Z

What can I do to fix this ?
Seems to be a server problem ?

taronaeo · 2026-01-03T13:15:04Z

What can I do to fix this ? Seems to be a server problem ?

I'll re-trigger the CI again and see if it's an ephemeral problem.

taronaeo · 2026-01-03T16:10:50Z

I did a test on your PR and there seem to be a regression.

upstream/master

model	size	params	backend	threads	n_batch	test	t/s
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CPU	8	128	pp512	24.76 ± 0.07
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CPU	8	128	tg128	9.18 ± 0.07

pr/18495

model	size	params	backend	threads	n_batch	test	t/s
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CPU	8	128	pp512	23.91 ± 0.02
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CPU	8	128	tg128	8.42 ± 0.03

Speedup Diff

Model	Test	t/s master	t/s 18495	Speedup
llama 7B Q4_K_M	pp512	24.76	23.91	0.97
llama 7B Q4_K_M	tg128	9.18	8.42	0.92

taronaeo · 2026-01-03T16:43:10Z

Likewise, using a similar command to yours provided in the PR description:

upstream/master

model	size	params	backend	threads	cpu_mask	n_batch	test	t/s
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CPU	2	0x5	128	pp512	6.45 ± 0.00
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CPU	2	0x5	128	tg128	4.44 ± 0.05

pr/18495

model	size	params	backend	threads	cpu_mask	n_batch	test	t/s
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CPU	2	0x5	128	pp512	6.17 ± 0.01
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	CPU	2	0x5	128	tg128	4.22 ± 0.00

Diff

Model	Test	t/s master	t/s 18495	Speedup
llama 7B Q4_K_M	pp512	6.45	6.17	0.96
llama 7B Q4_K_M	tg128	4.44	4.22	0.95

PeterEngler · 2026-01-04T09:18:26Z

I did the optimization for the AVX2 instruction set. The reason you cannot see a better performance could be that llama-bench automatically choose e.g. AVX512 instead of AVX2, I didn't optimize AVX512.

Second it appears to me that you started the upstream/master first and after that the pr/18495
The decrease in performance that you see then is because of the cpu's thermal throttling during the first run.

And third I am well aware that this AVX2 optimization isn't of great value, because AVX2 is not used often enough, but I have seen that similar optimizations could be done regarding the AVX512 instruction set. And I am quite sure that there also is optimization potential in other architectures like CUDA or ARM.

Alcpz

👋 I've checked the PR out of curiosity as I've been working to optimize q4_K for ARM and wanted to see if there were changes that could benefit the other arch.

In previous PRs I submitted, I found it is a bit challenging to verify that changes are correct because Perplexity doesn't check GEMV, only GEMM (#17494 (comment)), and test-backend-ops doesn't traverse the REPACK codepath (also see how you tested the CUDA backend, not the CPU backend). There was work to improve that, but I think it's still being worked on.

While the PR looks good, I would check some form of text generation to make sure there aren't new introduced issues.

Take my comments with a grain of salt, your performance improvement is quite good!

Alcpz · 2026-01-12T10:12:58Z

ggml/src/ggml-cpu/arch/x86/repack.cpp

Out of curiosity: kmask_3 was added above while kmask3 is still using the original mask. Was there any reason to not extend the optimizations from the block above down here?

NIT: If it's intended I'd declare kmask3 closer to where it is used.

Yes, the reason was that these loop (from line 3065 to 3393) is not used often enough to contribute to the performance improvement.
But you are right, I should do it the same way as above.

And kmask3 is also used in the AVX512 implementation, and I am not able to test AVX512 with my CPU, so I think the best is to rename kmask_3 to kmask4 (to have the same naming scheme)

Alcpz · 2026-01-12T10:17:16Z

ggml/src/ggml-cpu/arch/x86/repack.cpp

-                    uint32_t utmp_0[4], utmp_1[4];
-
-                    // Scales and Mins of corresponding sub blocks from different Q4_K structures are stored together
+                    // Scales and Mins of corresponding sub blocks from different Q8_K structures are stored together


Q8_K should be Q4_K I think

that's correct

My bad then!

No, I meant you where correct ;-)

Ah... !My bad then :)

PeterEngler · 2026-01-12T12:45:52Z

In test-backend-ops I compared CUDO to the CPU backend, so I thought my changes would be checked against CUDA.

optimized runtime of ggml_gemv_q4_K_8x8_q8_K() and ggml_gemm_q4_K_8x8…

d076560

…_q8_K()

PeterEngler requested a review from ggerganov as a code owner December 30, 2025 16:48

loci-dev mentioned this pull request Dec 30, 2025

UPSTREAM PR #18495: ggml: optimized runtime for x86 cpu backend and Q4_K quantized weights paired with Q8_K activations auroralabs-loci/llama.cpp#763

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 30, 2025

PeterEngler added 2 commits December 31, 2025 09:43

fixed gcc compilation issues

f892bae

Merge branch 'ggml-org:master' into pe/ggml-runtime-opt-x86-cpu-q4_k-…

efa8dd0

…q8_k

deleted trailing whitespaces

0f2d806

Alcpz reviewed Jan 12, 2026

View reviewed changes

Recommended changes from PR

0d6c4a0

ggml: optimized runtime for x86 cpu backend and Q4_K quantized weights paired with Q8_K activations #18495

Are you sure you want to change the base?

ggml: optimized runtime for x86 cpu backend and Q4_K quantized weights paired with Q8_K activations #18495

Uh oh!

Conversation

PeterEngler commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taronaeo commented Dec 31, 2025

Uh oh!

PeterEngler commented Jan 3, 2026

Uh oh!

PeterEngler commented Jan 3, 2026

Uh oh!

taronaeo commented Jan 3, 2026

Uh oh!

taronaeo commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taronaeo commented Jan 3, 2026

Uh oh!

PeterEngler commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alcpz left a comment

Choose a reason for hiding this comment

Uh oh!

Alcpz Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

PeterEngler Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Alcpz Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

PeterEngler Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Alcpz Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

PeterEngler Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Alcpz Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

PeterEngler commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PeterEngler commented Dec 30, 2025 •

edited

Loading

taronaeo commented Jan 3, 2026 •

edited

Loading

PeterEngler commented Jan 4, 2026 •

edited

Loading