-
Notifications
You must be signed in to change notification settings - Fork 14.9k
ggml: optimized runtime for x86 cpu backend and Q4_K quantized weights paired with Q8_K activations #18495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
ggml: optimized runtime for x86 cpu backend and Q4_K quantized weights paired with Q8_K activations #18495
Conversation
|
Can you provide some benchmark differences between |
|
Regarding the benachmarks I added some screenshots to my first post |
|
I did a test on your PR and there seem to be a regression.
Speedup Diff
|
|
Likewise, using a similar command to yours provided in the PR description:
Diff
|
|
I did the optimization for the AVX2 instruction set. The reason you cannot see a better performance could be that llama-bench automatically choose e.g. AVX512 instead of AVX2, I didn't optimize AVX512. Second it appears to me that you started the upstream/master first and after that the pr/18495 And third I am well aware that this AVX2 optimization isn't of great value, because AVX2 is not used often enough, but I have seen that similar optimizations could be done regarding the AVX512 instruction set. And I am quite sure that there also is optimization potential in other architectures like CUDA or ARM. |
Alcpz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 I've checked the PR out of curiosity as I've been working to optimize q4_K for ARM and wanted to see if there were changes that could benefit the other arch.
In previous PRs I submitted, I found it is a bit challenging to verify that changes are correct because Perplexity doesn't check GEMV, only GEMM (#17494 (comment)), and test-backend-ops doesn't traverse the REPACK codepath (also see how you tested the CUDA backend, not the CPU backend). There was work to improve that, but I think it's still being worked on.
While the PR looks good, I would check some form of text generation to make sure there aren't new introduced issues.
Take my comments with a grain of salt, your performance improvement is quite good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity: kmask_3 was added above while kmask3 is still using the original mask. Was there any reason to not extend the optimizations from the block above down here?
NIT: If it's intended I'd declare kmask3 closer to where it is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the reason was that these loop (from line 3065 to 3393) is not used often enough to contribute to the performance improvement.
But you are right, I should do it the same way as above.
And kmask3 is also used in the AVX512 implementation, and I am not able to test AVX512 with my CPU, so I think the best is to rename kmask_3 to kmask4 (to have the same naming scheme)
| uint32_t utmp_0[4], utmp_1[4]; | ||
|
|
||
| // Scales and Mins of corresponding sub blocks from different Q4_K structures are stored together | ||
| // Scales and Mins of corresponding sub blocks from different Q8_K structures are stored together |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q8_K should be Q4_K I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I meant you where correct ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... !My bad then :)
|
In test-backend-ops I compared CUDO to the CPU backend, so I thought my changes would be checked against CUDA. |


optimized runtime of the functions ggml_gemv_q4_K_8x8_q8_K() and ggml_gemm_q4_K_8x8_q8_K() for the x86 cpu backend
For performance enhancement, see:

Perplexity:

test-backend-ops:
