Skip to content

Conversation

@PeterEngler
Copy link

@PeterEngler PeterEngler commented Dec 30, 2025

optimized runtime of the functions ggml_gemv_q4_K_8x8_q8_K() and ggml_gemm_q4_K_8x8_q8_K() for the x86 cpu backend

For performance enhancement, see:
result

Perplexity:
perplexity

test-backend-ops:
test-backend-ops

@taronaeo
Copy link
Collaborator

Can you provide some benchmark differences between upstream/master and your PR?

@PeterEngler
Copy link
Author

Regarding the benachmarks I added some screenshots to my first post

@PeterEngler
Copy link
Author

What can I do to fix this ?
Seems to be a server problem ?

image

@taronaeo
Copy link
Collaborator

taronaeo commented Jan 3, 2026

What can I do to fix this ? Seems to be a server problem ?
image

I'll re-trigger the CI again and see if it's an ephemeral problem.

@taronaeo
Copy link
Collaborator

taronaeo commented Jan 3, 2026

I did a test on your PR and there seem to be a regression.

upstream/master

model size params backend threads n_batch test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 8 128 pp512 24.76 ± 0.07
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 8 128 tg128 9.18 ± 0.07

pr/18495

model size params backend threads n_batch test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 8 128 pp512 23.91 ± 0.02
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 8 128 tg128 8.42 ± 0.03

Speedup Diff

Model Test t/s master t/s 18495 Speedup
llama 7B Q4_K_M pp512 24.76 23.91 0.97
llama 7B Q4_K_M tg128 9.18 8.42 0.92

@taronaeo
Copy link
Collaborator

taronaeo commented Jan 3, 2026

Likewise, using a similar command to yours provided in the PR description:

upstream/master

model size params backend threads cpu_mask n_batch test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 2 0x5 128 pp512 6.45 ± 0.00
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 2 0x5 128 tg128 4.44 ± 0.05

pr/18495

model size params backend threads cpu_mask n_batch test t/s
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 2 0x5 128 pp512 6.17 ± 0.01
llama 7B Q4_K - Medium 3.80 GiB 6.74 B CPU 2 0x5 128 tg128 4.22 ± 0.00

Diff

Model Test t/s master t/s 18495 Speedup
llama 7B Q4_K_M pp512 6.45 6.17 0.96
llama 7B Q4_K_M tg128 4.44 4.22 0.95

@PeterEngler
Copy link
Author

PeterEngler commented Jan 4, 2026

I did the optimization for the AVX2 instruction set. The reason you cannot see a better performance could be that llama-bench automatically choose e.g. AVX512 instead of AVX2, I didn't optimize AVX512.

Second it appears to me that you started the upstream/master first and after that the pr/18495
The decrease in performance that you see then is because of the cpu's thermal throttling during the first run.

And third I am well aware that this AVX2 optimization isn't of great value, because AVX2 is not used often enough, but I have seen that similar optimizations could be done regarding the AVX512 instruction set. And I am quite sure that there also is optimization potential in other architectures like CUDA or ARM.

Copy link
Collaborator

@Alcpz Alcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 I've checked the PR out of curiosity as I've been working to optimize q4_K for ARM and wanted to see if there were changes that could benefit the other arch.

In previous PRs I submitted, I found it is a bit challenging to verify that changes are correct because Perplexity doesn't check GEMV, only GEMM (#17494 (comment)), and test-backend-ops doesn't traverse the REPACK codepath (also see how you tested the CUDA backend, not the CPU backend). There was work to improve that, but I think it's still being worked on.

While the PR looks good, I would check some form of text generation to make sure there aren't new introduced issues.

Take my comments with a grain of salt, your performance improvement is quite good!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: kmask_3 was added above while kmask3 is still using the original mask. Was there any reason to not extend the optimizations from the block above down here?

NIT: If it's intended I'd declare kmask3 closer to where it is used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the reason was that these loop (from line 3065 to 3393) is not used often enough to contribute to the performance improvement.
But you are right, I should do it the same way as above.

And kmask3 is also used in the AVX512 implementation, and I am not able to test AVX512 with my CPU, so I think the best is to rename kmask_3 to kmask4 (to have the same naming scheme)

uint32_t utmp_0[4], utmp_1[4];

// Scales and Mins of corresponding sub blocks from different Q4_K structures are stored together
// Scales and Mins of corresponding sub blocks from different Q8_K structures are stored together
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q8_K should be Q4_K I think

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad then!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant you where correct ;-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... !My bad then :)

@PeterEngler
Copy link
Author

In test-backend-ops I compared CUDO to the CPU backend, so I thought my changes would be checked against CUDA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants