ggml : block repack support for Q4_K quanti for AArch64 architecture #15719

hongyang-7 · 2025-09-01T16:46:40Z

This PR improves q4_k_q8_k kernel with repacking support for AArch64 platform.

It has 2 enabling conditions:

i8mm support.
tensor.ne[1] % 4 == 0

Following structures and functions are implemented:

new quanti: block_q4_kx4 based on 4 q4_k blocks, along with offline repacking function
new quantize path: neon implementation for block_q8_Kx4 in ggml_quantize_mat_q8_K_4x8()
new gemv kernel: ggml_gemv_q4_K_4x8_q8_K() dotprod kernel
new gemm kernel: ggml_gemm_q4_K_4x8_q8_K() i8mm kernel

For now, q4_k_q8_k repacking has 3 implementations including this PR, here's a brief summary:

mode	packing column number	ISA dependency	tensor shape	PR
q4_K_8x4_q8_K	8	dotprod	ne[1]%8==0	#17494
q4_K_8x8_q8_K	8	i8mm	ne[1]%8==0	#16739
q4_K_4x8_q8_K	4	i8mm	ne[1]%4==0	this PR

Each implementation has its specific suitable scenario.
For this PR, I temprorarily put the priority of 4x8 after 8x8:

However for performance test, comprehensive comparisons are conducted among the 3 impls on same models. This is done by uncomenting different if branches when doing tensor_traits choosing for q4_k tensor type.

Test environment

Server: Neoverse-N2
System_info: n_threads = 64 (n_threads_batch = 64) / 256 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
Competitors: ① base (no q4_k repack) ② 8x4 impl ③ 8x8 impl ④ 4x8 impl
Models: 2 of different scales

models	storage size	param size	quanti
meta-llama-3-8b-instruct.Q4_K_M.gguf	4.6G	8.03B	Q4_K_M
DeepSeek-V3-Q4_k_M.gguf	377G	671B	Q4_K_M

Bench results

(1) meta-llama-3-8b-instruct.Q4_K_M.gguf

./bin/llama-batched-bench -m /mnt/models/meta-llama-3-8b-instruct.Q4_K_M.gguf -c 8192 -b 2048 -ub 512  -npp 128 -ntg 128 -npl 1,4,8,16 -t 64 --no-mmap

S_PP (t/s)

B	①base (no repack)	② 8x4	③ 8x8	④ 4x8 (this PR)	④ vs ①	④ vs ②	④ vs ③
1	171.58	234.90	246.05	241.49	140.7%	102.8%	98.1%
4	179.66	246.19	258.52	245.70	136.8%	99.8%	95.0%
8	180.63	247.54	259.75	247.18	136.8%	99.9%	95.2%
16	180.32	247.65	259.84	247.59	137.3%	100.0%	95.3%

S_TG (t/s)

B	①base (no repack)	② 8x4	③ 8x8	④ 4x8 (this PR)	④ vs ①	④ vs ②	④ vs ③
1	39.27	36.62	36.69	36.22	92.2%	98.9%	98.7%
4	76.93	82.85	83.22	94.90	123.4%	114.5%	114.0%
8	103.94	116.79	118.76	129.74	124.8%	111.1%	109.2%
16	126.75	149.26	153.59	167.59	132.2%	112.3%	109.1%

Compared to 8x4/8x8, 4x8 performs better at TG and worse at PP. All the 3 outperform the base(no repack).

(2) DeepSeek-V3-Q4_k_M.gguf

./bin/llama-batched-bench -m /mnt/models/DeepSeek-V3-Q4_k_M.gguf -c 8192 -b 2048 -ub 512  -npp 128 -ntg 128 -npl 1,4,8,16 -t 64 --no-mmap

S_PP (t/s)

B	①base (no repack)	② 8x4	③ 8x8	④ 4x8 (this PR)	④ vs ①	④ vs ②	④ vs ③
1	21.92	29.67	28.78	27.94	127.5%	94.2%	97.1%
4	25.20	33.62	32.3	31.03	123.1%	92.3%	96.1%
8	25.56	33.78	32.42	31.12	121.8%	92.1%	96.0%
16	25.55	33.79	32.39	31.11	121.8%	92.1%	96.0%

S_TG (t/s)

B	①base (no repack)	② 8x4	③ 8x8	④ 4x8 (this PR)	④ vs ①	④ vs ②	④ vs ③
1	7.28	7.15	7.17	7.12	97.8%	99.6%	99.3%
4	12.61	12.46	11.52	13.19	104.6%	105.9%	114.5%
8	15.10	15.45	14.74	15.91	105.4%	103.0%	107.9%
16	17.05	18.16	17.54	18.06	105.9%	99.4%	103.0%

Compared to 8x4/8x8, 4x8 performs better at TG and worse at PP. All the 3 outperform the base(no repack).

Perplexity

models	①base	② 8x4	③ 8x8	④ 4x8
meta-llama-3-8b-instruct	3.7482 +/- 0.14273	3.7615 +/- 0.14341	3.7562 +/- 0.14319	3.7576 +/- 0.14305
DeepSeek-V3	1.0382 +/- 0.00630	1.0382 +/- 0.00626	1.0380 +/- 0.00623	1.0400 +/- 0.00642

Notes

This PR defines block_q4_Kx4 structure, which is not a simple four-block_q4_K sturcture, but move the scales dequant step (recover from 6bit to 8bit) ahead to the offline repacking stage, causing the repacking result storage a bit larger than before. To fix possible memory allocation problem, this PR also introduces some mechnism to expand the memory space for repacking.
This PR has different q8_k online repacking layout compared to the C generic version.
This PR's repacking idea is originally for q4_0 on arm: Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization #5780

* new quanti: block_q4_kx4 with offline repack impl * new quantize path: neon impl for ggml_quantize_mat_q8_K_4x8 * new gemv kernel: ggml_gemv_q4_K_4x8_q8_K based on dotprod * new gemm kernel: ggml_gemm_q4_K_4x8_q8_K based on i8mm * performance boost for both S_PP and S_TG --------- Co-authored-by: yuanjia111 <yuan.jia@sanechips.com.cn>

hongyang-7 · 2025-12-19T03:07:03Z

@ggerganov Hi，I just updated this PR to fix some CI issue for old version in Sept and already sync with latest master, performance data are also updated. But the workflow seems blocked now.
The hint shows some approval is needed, I thought the CI should be executed automatically:

ggerganov · 2025-12-19T07:14:29Z

I'm not yet sure it is worth adding this additional implementation. Can it not be consolidated with the rest of the repacking schemes?

cc @Alcpz for opinions

hongyang-7 · 2025-12-19T07:20:00Z

I'm not yet sure it is worth adding this additional implementation. Can it not be consolidated with the rest of the repacking schemes?

cc @Alcpz for opinions

@ggerganov Understand this concern. The 4x8 scheme of this PR mainly aims at ne[1]%4==0, a more general scenario compared to 8x4/8x8. Each of the 3 schemes has a different scenario.

BTW, the only failed CI check seems to be a JSON parsing issue, not related to this PR I think.

yuanjia111 · 2025-12-23T01:03:09Z

I'm not yet sure it is worth adding this additional implementation. Can it not be consolidated with the rest of the repacking schemes?

cc @Alcpz for opinions

Hi， @Alcpz , I’d really appreciate it if you could take some time to discuss this issue. Thank you very much!

Alcpz · 2026-01-07T09:34:02Z

@yuanjia11 I've been away for a bit, I'll give it a look.

yuanjia111 · 2026-01-07T09:43:53Z

@yuanjia11我离开了一段时间，我会看看的。

Thank you, @Alcpz! 😊
I completely understand you’ve been away — please take your time.
I just wanted to gently follow up since your input on this design direction would mean a lot.
Looking forward to your feedback whenever you’re ready!

Alcpz

My 2 cents, be aware that I'm quite biased towards PP performance over TG as TG is decent enough from what I've seen and need. Due to the current logic to choose this version of the repack, if it's chosen as a backup, a lot of the models won't use this PR.

            if (cur->ne[1] % 8 == 0) {
                return &q4_K_8x8_q8_K;
            }
            if (cur->ne[1] % 4 == 0) {
                return &q4_K_4x8_q8_K;
            }

As it stands, It's hard to justify the complexity of a new Q4_K variant with very little net benefit (not for performance, but because it won't be used by the models I've tested). If this version is set to the default, then we would need to either, add a runtime flag to avoid the PP regression (even more complexity to maintain and for users to run, so I don't think this is great) or remove the 8x8 i8mm version.

So we simply need to discuss what we prefer, ~+10%,~14t/s TG or ~5%,~12t/s PP. (That's why I am being upfront with my PP bias, which I prefer)

@ggerganov: consolidating into the existing implementation is quite tricky since the novelty here is both a different bsum layout for q8_k and a pre-decoded block scales in q4_K.
Unless proven otherwise we should assume these are most likely the source of the performance changes we see. Any consolidation would need to handle both formats, so we would end up with almost the same code.

I've left a couple of comments in the PR to help with the review in case you decide it's worth merged in. Happy to help with whatever decision you end taking.

Alcpz · 2026-01-07T12:06:49Z

ggml/src/ggml-cpu/repack.cpp


        // we don't support permuted src0 or src1
-        GGML_ASSERT(nb00 == ggml_type_size(src0->type));
+        //GGML_ASSERT(nb00 == ggml_type_size(src0->type));


the 6bit to 8bit unpacking made this assert invalid? this removes a safety check with no replacement though

Alcpz · 2026-01-07T12:12:02Z

ggml/src/ggml-cpu/repack.h

 void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
 void ggml_quantize_mat_q8_K_4x4(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
-void ggml_quantize_mat_q8_K_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
+void ggml_quantize_mat_q8_K_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k, int64_t nc);


I understand the inclusion of nc as you essentially have a new block layout, but this creates a really inconsistent signature for all the ggml_quantize ops. Also the generic version is receiving the parameter while only the ARM version is using it. Looks very frail as it hides the fact that nc is used to determine the block layout.

Since this is by all means needed by your PR, it would need to be properly documented somewhere within the code if the changes end up merged in this state. I would probably try to think of a different mechanism, but without spending time thinking of it I am not sure how to address this since the block type is decided at runtime looking at the layout of the weights tensor.

Alcpz · 2026-01-07T14:42:26Z

ggml/src/ggml-cpu/repack.cpp

+    }
+
+    // change tensor shape as block_q4_kx4 brings space size change
+    //t->nb[0] = ggml_type_size(type);


There are some comments from development still there.

Alcpz · 2026-01-07T15:32:19Z

ggml/src/ggml-cpu/repack.cpp

+
+    // change tensor shape as block_q4_kx4 brings space size change
+    //t->nb[0] = ggml_type_size(type);
+    t->nb[0] = sizeof(block_q4_Kx4) / 4;


I'm assuming this is correct because the q4_Kx8 block is repacked using the generic implementation

hongyang-7 · 2026-01-21T16:42:25Z

@Alcpz Really appreciate your review and summary of your insight.

After a comprehensive evaluation, we acknowledge that this PR has following limitations:

Performance aspect: While we may hold different views on the relative importance of TG and PP, I agree that the performance data is not yet sufficiently convincing.
Scenario aspect: It seems that most of the tested models to date are not applicable to this PR if 8x8 and 4x8 coexists. The existing 8x8 model has more stringent requirements than our 4x8 model (a multiple of 8 is certainly a multiple of 4). If nearly all cases fall into the 8x8 category, the 4x8 model will have little practical applicability. Counterexamples are hard to find, making it difficult to justify this as a high-value fallback implementation.
Code maintainability aspect: Even if issue 2 were resolved and such scenarios existed where the 4x8 model needs to coexist with the 8x8 model, the maintenance would actually be rather cumbersome. The current 8x8/8x4 implementations are all based on the packed layout of the original q4_k and q8_k from the x86 architecture, with no changes made to the data structures and layouts. In contrast, our implementation involves quite extensive modifications: for instance, the expansion from 6-bit to 8-bit is moved forward to the offline packing stage, and the bsums layout of q8_k is modified. These changes will require adding architecture-specific parameters for the arm64 implementation in the common framework code shared across different architectures, which is far from ideal for maintainers.

We will place greater emphasis on code maintainability in our future work.

Hi @ggerganov, I believe we can cease further discussions on this PR and proceed to close it.

Alcpz · 2026-01-21T16:50:54Z

@hongyang-7 I think your PR has really interesting contributions nonetheless. I've tried expanding the scales from 6 bits to 8 bits offline, but I end with the same performance since the computational gains are traded-off with memory accesses. I'm very interested in the Q8_K bsums custom ordering. If you think this could yield performance to ARM devices, I would be really interested in discussing and contributing to get the make_block functions be architecture based (similar to how gemm and gemv have generics to fallback to). I'd be happy to contribute to or help review any PRs regarding that if you decide to continue contributing to llama.cpp.

elfarolab · 2026-01-21T17:09:02Z

@hongyang-7

if you like I would like to help testing this PR.

I can build on nVidia Jetson AGX Orin and I am interested to improvements related to Q4_K quants.

Please could you tell me the building options if any specific?

Also I have very limited free RAM memory (~ 16 GB) due the embedded Linux env.
What model would you like me to test?

Thank you so much.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 1, 2025

hongyang-7 and others added 6 commits December 18, 2025 14:14

ggml : add c implementation for the "Q4_K quanti for AArch64" patch

8a4e25d

fix bug of ggml_gemv_q4_K_4x8_q8_K_generic

86be98c

fix bug of ggml_gemm_q4_K_4x8_q8_K_generic

49aa628

improve code quality

da606bd

fix compatibility with other q4_k repacking models

126ce2c

hongyang-7 force-pushed the dev_arm_q4k_repack branch from 543e8eb to 126ce2c Compare December 18, 2025 07:39

hongyang-7 requested a review from ggerganov as a code owner December 18, 2025 07:39

loci-dev mentioned this pull request Dec 18, 2025

UPSTREAM PR #15719: ggml : block repack support for Q4_K quanti for AArch64 architecture auroralabs-loci/llama.cpp#612

Open

fix x86 call due to prototype change

41780fe

Alcpz reviewed Jan 7, 2026

View reviewed changes

ggml : block repack support for Q4_K quanti for AArch64 architecture #15719

Are you sure you want to change the base?

ggml : block repack support for Q4_K quanti for AArch64 architecture #15719

Conversation

hongyang-7 commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test environment

Bench results

Perplexity

Notes

Uh oh!

hongyang-7 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Dec 19, 2025

Uh oh!

hongyang-7 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuanjia111 commented Dec 23, 2025

Uh oh!

Alcpz commented Jan 7, 2026

Uh oh!

yuanjia111 commented Jan 7, 2026

Uh oh!

Alcpz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alcpz Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Alcpz Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Alcpz Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Alcpz Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Alcpz Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

hongyang-7 commented Jan 21, 2026

Uh oh!

Alcpz commented Jan 21, 2026

Uh oh!

elfarolab commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hongyang-7 commented Sep 1, 2025 •

edited

Loading

hongyang-7 commented Dec 19, 2025 •

edited

Loading

hongyang-7 commented Dec 19, 2025 •

edited

Loading

Alcpz left a comment •

edited

Loading

elfarolab commented Jan 21, 2026 •

edited

Loading