-
Notifications
You must be signed in to change notification settings - Fork 14.8k
ggml : block repack support for Q4_K quanti for AArch64 architecture #15719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
* new quanti: block_q4_kx4 with offline repack impl * new quantize path: neon impl for ggml_quantize_mat_q8_K_4x8 * new gemv kernel: ggml_gemv_q4_K_4x8_q8_K based on dotprod * new gemm kernel: ggml_gemm_q4_K_4x8_q8_K based on i8mm * performance boost for both S_PP and S_TG --------- Co-authored-by: yuanjia111 <yuan.jia@sanechips.com.cn>
543e8eb to
126ce2c
Compare
|
@ggerganov Hi,I just updated this PR to fix some CI issue for old version in Sept and already sync with latest master, performance data are also updated. But the workflow seems blocked now. |
|
I'm not yet sure it is worth adding this additional implementation. Can it not be consolidated with the rest of the repacking schemes? cc @Alcpz for opinions |
@ggerganov Understand this concern. The 4x8 scheme of this PR mainly aims at ne[1]%4==0, a more general scenario compared to 8x4/8x8. Each of the 3 schemes has a different scenario. BTW, the only failed CI check seems to be a JSON parsing issue, not related to this PR I think. |
|
@yuanjia11 I've been away for a bit, I'll give it a look. |
Thank you, @Alcpz! 😊 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents, be aware that I'm quite biased towards PP performance over TG as TG is decent enough from what I've seen and need. Due to the current logic to choose this version of the repack, if it's chosen as a backup, a lot of the models won't use this PR.
if (cur->ne[1] % 8 == 0) {
return &q4_K_8x8_q8_K;
}
if (cur->ne[1] % 4 == 0) {
return &q4_K_4x8_q8_K;
}As it stands, It's hard to justify the complexity of a new Q4_K variant with very little net benefit (not for performance, but because it won't be used by the models I've tested). If this version is set to the default, then we would need to either, add a runtime flag to avoid the PP regression (even more complexity to maintain and for users to run, so I don't think this is great) or remove the 8x8 i8mm version.
So we simply need to discuss what we prefer, ~+10%,~14t/s TG or ~5%,~12t/s PP. (That's why I am being upfront with my PP bias, which I prefer)
@ggerganov: consolidating into the existing implementation is quite tricky since the novelty here is both a different bsum layout for q8_k and a pre-decoded block scales in q4_K.
Unless proven otherwise we should assume these are most likely the source of the performance changes we see. Any consolidation would need to handle both formats, so we would end up with almost the same code.
I've left a couple of comments in the PR to help with the review in case you decide it's worth merged in. Happy to help with whatever decision you end taking.
|
|
||
| // we don't support permuted src0 or src1 | ||
| GGML_ASSERT(nb00 == ggml_type_size(src0->type)); | ||
| //GGML_ASSERT(nb00 == ggml_type_size(src0->type)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 6bit to 8bit unpacking made this assert invalid? this removes a safety check with no replacement though
| void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k); | ||
| void ggml_quantize_mat_q8_K_4x4(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k); | ||
| void ggml_quantize_mat_q8_K_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k); | ||
| void ggml_quantize_mat_q8_K_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k, int64_t nc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the inclusion of nc as you essentially have a new block layout, but this creates a really inconsistent signature for all the ggml_quantize ops. Also the generic version is receiving the parameter while only the ARM version is using it. Looks very frail as it hides the fact that nc is used to determine the block layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is by all means needed by your PR, it would need to be properly documented somewhere within the code if the changes end up merged in this state. I would probably try to think of a different mechanism, but without spending time thinking of it I am not sure how to address this since the block type is decided at runtime looking at the layout of the weights tensor.
| } | ||
|
|
||
| // change tensor shape as block_q4_kx4 brings space size change | ||
| //t->nb[0] = ggml_type_size(type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some comments from development still there.
|
|
||
| // change tensor shape as block_q4_kx4 brings space size change | ||
| //t->nb[0] = ggml_type_size(type); | ||
| t->nb[0] = sizeof(block_q4_Kx4) / 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this is correct because the q4_Kx8 block is repacked using the generic implementation
|
@Alcpz Really appreciate your review and summary of your insight. After a comprehensive evaluation, we acknowledge that this PR has following limitations:
We will place greater emphasis on code maintainability in our future work. Hi @ggerganov, I believe we can cease further discussions on this PR and proceed to close it. |
|
@hongyang-7 I think your PR has really interesting contributions nonetheless. I've tried expanding the scales from 6 bits to 8 bits offline, but I end with the same performance since the computational gains are traded-off with memory accesses. I'm very interested in the Q8_K bsums custom ordering. If you think this could yield performance to ARM devices, I would be really interested in discussing and contributing to get the make_block functions be architecture based (similar to how gemm and gemv have generics to fallback to). I'd be happy to contribute to or help review any PRs regarding that if you decide to continue contributing to llama.cpp. |
|
if you like I would like to help testing this PR. I can build on nVidia Jetson AGX Orin and I am interested to improvements related to Q4_K quants. Please could you tell me the building options if any specific? Also I have very limited free RAM memory (~ 16 GB) due the embedded Linux env. Thank you so much. |

This PR improves q4_k_q8_k kernel with repacking support for AArch64 platform.
It has 2 enabling conditions:
Following structures and functions are implemented:
block_q4_kx4based on 4 q4_k blocks, along with offline repacking functionblock_q8_Kx4inggml_quantize_mat_q8_K_4x8()ggml_gemv_q4_K_4x8_q8_K()dotprod kernelggml_gemm_q4_K_4x8_q8_K()i8mm kernelFor now, q4_k_q8_k repacking has 3 implementations including this PR, here's a brief summary:
Each implementation has its specific suitable scenario.

For this PR, I temprorarily put the priority of 4x8 after 8x8:
However for performance test, comprehensive comparisons are conducted among the 3 impls on same models. This is done by uncomenting different if branches when doing tensor_traits choosing for q4_k tensor type.
Test environment
Bench results
(1) meta-llama-3-8b-instruct.Q4_K_M.gguf
S_PP (t/s)
(no repack)
(this PR)
S_TG (t/s)
(no repack)
(this PR)
Compared to 8x4/8x8, 4x8 performs better at TG and worse at PP. All the 3 outperform the base(no repack).
(2) DeepSeek-V3-Q4_k_M.gguf
S_PP (t/s)
(no repack)
(this PR)
S_TG (t/s)
(no repack)
(this PR)
Compared to 8x4/8x8, 4x8 performs better at TG and worse at PP. All the 3 outperform the base(no repack).
Perplexity
Notes