Skip to content

Conversation

@hongyang-7
Copy link

@hongyang-7 hongyang-7 commented Sep 1, 2025

This PR improves q4_k_q8_k kernel with repacking support for AArch64 platform.

It has 2 enabling conditions:

  1. i8mm support.
  2. tensor.ne[1] % 4 == 0

Following structures and functions are implemented:

  • new quanti: block_q4_kx4 based on 4 q4_k blocks, along with offline repacking function
  • new quantize path: neon implementation for block_q8_Kx4 in ggml_quantize_mat_q8_K_4x8()
  • new gemv kernel: ggml_gemv_q4_K_4x8_q8_K() dotprod kernel
  • new gemm kernel: ggml_gemm_q4_K_4x8_q8_K() i8mm kernel

For now, q4_k_q8_k repacking has 3 implementations including this PR, here's a brief summary:

mode packing column number ISA dependency tensor shape PR
q4_K_8x4_q8_K 8 dotprod ne[1]%8==0 #17494
q4_K_8x8_q8_K 8 i8mm ne[1]%8==0 #16739
q4_K_4x8_q8_K 4 i8mm ne[1]%4==0 this PR

Each implementation has its specific suitable scenario.
For this PR, I temprorarily put the priority of 4x8 after 8x8:
image

However for performance test, comprehensive comparisons are conducted among the 3 impls on same models. This is done by uncomenting different if branches when doing tensor_traits choosing for q4_k tensor type.

Test environment

  • Server: Neoverse-N2
  • System_info: n_threads = 64 (n_threads_batch = 64) / 256 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
  • Competitors: ① base (no q4_k repack) ② 8x4 impl ③ 8x8 impl ④ 4x8 impl
  • Models: 2 of different scales
models storage size param size quanti
meta-llama-3-8b-instruct.Q4_K_M.gguf 4.6G 8.03B Q4_K_M
DeepSeek-V3-Q4_k_M.gguf 377G 671B Q4_K_M

Bench results

(1) meta-llama-3-8b-instruct.Q4_K_M.gguf

./bin/llama-batched-bench -m /mnt/models/meta-llama-3-8b-instruct.Q4_K_M.gguf -c 8192 -b 2048 -ub 512  -npp 128 -ntg 128 -npl 1,4,8,16 -t 64 --no-mmap

S_PP (t/s)

B ①base
(no repack)
② 8x4 ③ 8x8 ④ 4x8
(this PR)
④ vs ① ④ vs ② ④ vs ③
1 171.58 234.90 246.05 241.49 140.7% 102.8% 98.1%
4 179.66 246.19 258.52 245.70 136.8% 99.8% 95.0%
8 180.63 247.54 259.75 247.18 136.8% 99.9% 95.2%
16 180.32 247.65 259.84 247.59 137.3% 100.0% 95.3%

S_TG (t/s)

B ①base
(no repack)
② 8x4 ③ 8x8 ④ 4x8
(this PR)
④ vs ① ④ vs ② ④ vs ③
1 39.27 36.62 36.69 36.22 92.2% 98.9% 98.7%
4 76.93 82.85 83.22 94.90 123.4% 114.5% 114.0%
8 103.94 116.79 118.76 129.74 124.8% 111.1% 109.2%
16 126.75 149.26 153.59 167.59 132.2% 112.3% 109.1%

Compared to 8x4/8x8, 4x8 performs better at TG and worse at PP. All the 3 outperform the base(no repack).

(2) DeepSeek-V3-Q4_k_M.gguf

./bin/llama-batched-bench -m /mnt/models/DeepSeek-V3-Q4_k_M.gguf -c 8192 -b 2048 -ub 512  -npp 128 -ntg 128 -npl 1,4,8,16 -t 64 --no-mmap

S_PP (t/s)

B ①base
(no repack)
② 8x4 ③ 8x8 ④ 4x8
(this PR)
④ vs ① ④ vs ② ④ vs ③
1 21.92 29.67 28.78 27.94 127.5% 94.2% 97.1%
4 25.20 33.62 32.3 31.03 123.1% 92.3% 96.1%
8 25.56 33.78 32.42 31.12 121.8% 92.1% 96.0%
16 25.55 33.79 32.39 31.11 121.8% 92.1% 96.0%

S_TG (t/s)

B ①base
(no repack)
② 8x4 ③ 8x8 ④ 4x8
(this PR)
④ vs ① ④ vs ② ④ vs ③
1 7.28 7.15 7.17 7.12 97.8% 99.6% 99.3%
4 12.61 12.46 11.52 13.19 104.6% 105.9% 114.5%
8 15.10 15.45 14.74 15.91 105.4% 103.0% 107.9%
16 17.05 18.16 17.54 18.06 105.9% 99.4% 103.0%

Compared to 8x4/8x8, 4x8 performs better at TG and worse at PP. All the 3 outperform the base(no repack).

Perplexity

models ①base ② 8x4 ③ 8x8 ④ 4x8
meta-llama-3-8b-instruct 3.7482 +/- 0.14273 3.7615 +/- 0.14341 3.7562 +/- 0.14319 3.7576 +/- 0.14305
DeepSeek-V3 1.0382 +/- 0.00630 1.0382 +/- 0.00626 1.0380 +/- 0.00623 1.0400 +/- 0.00642

Notes

  1. This PR defines block_q4_Kx4 structure, which is not a simple four-block_q4_K sturcture, but move the scales dequant step (recover from 6bit to 8bit) ahead to the offline repacking stage, causing the repacking result storage a bit larger than before. To fix possible memory allocation problem, this PR also introduces some mechnism to expand the memory space for repacking.
  2. This PR has different q8_k online repacking layout compared to the C generic version.
  3. This PR's repacking idea is originally for q4_0 on arm: Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization #5780

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 1, 2025
hongyang-7 and others added 6 commits December 18, 2025 14:14
* new quanti: block_q4_kx4 with offline repack impl

* new quantize path: neon impl for ggml_quantize_mat_q8_K_4x8

* new gemv kernel: ggml_gemv_q4_K_4x8_q8_K based on dotprod

* new gemm kernel: ggml_gemm_q4_K_4x8_q8_K based on i8mm

* performance boost for both S_PP and S_TG

---------

Co-authored-by: yuanjia111 <yuan.jia@sanechips.com.cn>
@hongyang-7
Copy link
Author

hongyang-7 commented Dec 19, 2025

@ggerganov Hi,I just updated this PR to fix some CI issue for old version in Sept and already sync with latest master, performance data are also updated. But the workflow seems blocked now.
The hint shows some approval is needed, I thought the CI should be executed automatically:
image

@ggerganov
Copy link
Member

I'm not yet sure it is worth adding this additional implementation. Can it not be consolidated with the rest of the repacking schemes?

cc @Alcpz for opinions

@hongyang-7
Copy link
Author

hongyang-7 commented Dec 19, 2025

I'm not yet sure it is worth adding this additional implementation. Can it not be consolidated with the rest of the repacking schemes?

cc @Alcpz for opinions

@ggerganov Understand this concern. The 4x8 scheme of this PR mainly aims at ne[1]%4==0, a more general scenario compared to 8x4/8x8. Each of the 3 schemes has a different scenario.

BTW, the only failed CI check seems to be a JSON parsing issue, not related to this PR I think.

@yuanjia111
Copy link

I'm not yet sure it is worth adding this additional implementation. Can it not be consolidated with the rest of the repacking schemes?

cc @Alcpz for opinions

Hi, @Alcpz , I’d really appreciate it if you could take some time to discuss this issue. Thank you very much!

@Alcpz
Copy link
Collaborator

Alcpz commented Jan 7, 2026

@yuanjia11 I've been away for a bit, I'll give it a look.

@yuanjia111
Copy link

@yuanjia11我离开了一段时间,我会看看的。

Thank you, @Alcpz! 😊
I completely understand you’ve been away — please take your time.
I just wanted to gently follow up since your input on this design direction would mean a lot.
Looking forward to your feedback whenever you’re ready!

Copy link
Collaborator

@Alcpz Alcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents, be aware that I'm quite biased towards PP performance over TG as TG is decent enough from what I've seen and need. Due to the current logic to choose this version of the repack, if it's chosen as a backup, a lot of the models won't use this PR.

            if (cur->ne[1] % 8 == 0) {
                return &q4_K_8x8_q8_K;
            }
            if (cur->ne[1] % 4 == 0) {
                return &q4_K_4x8_q8_K;
            }

As it stands, It's hard to justify the complexity of a new Q4_K variant with very little net benefit (not for performance, but because it won't be used by the models I've tested). If this version is set to the default, then we would need to either, add a runtime flag to avoid the PP regression (even more complexity to maintain and for users to run, so I don't think this is great) or remove the 8x8 i8mm version.

So we simply need to discuss what we prefer, ~+10%,~14t/s TG or ~5%,~12t/s PP. (That's why I am being upfront with my PP bias, which I prefer)

@ggerganov: consolidating into the existing implementation is quite tricky since the novelty here is both a different bsum layout for q8_k and a pre-decoded block scales in q4_K.
Unless proven otherwise we should assume these are most likely the source of the performance changes we see. Any consolidation would need to handle both formats, so we would end up with almost the same code.

I've left a couple of comments in the PR to help with the review in case you decide it's worth merged in. Happy to help with whatever decision you end taking.


// we don't support permuted src0 or src1
GGML_ASSERT(nb00 == ggml_type_size(src0->type));
//GGML_ASSERT(nb00 == ggml_type_size(src0->type));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 6bit to 8bit unpacking made this assert invalid? this removes a safety check with no replacement though

void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void ggml_quantize_mat_q8_K_4x4(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void ggml_quantize_mat_q8_K_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k);
void ggml_quantize_mat_q8_K_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k, int64_t nc);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the inclusion of nc as you essentially have a new block layout, but this creates a really inconsistent signature for all the ggml_quantize ops. Also the generic version is receiving the parameter while only the ARM version is using it. Looks very frail as it hides the fact that nc is used to determine the block layout.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is by all means needed by your PR, it would need to be properly documented somewhere within the code if the changes end up merged in this state. I would probably try to think of a different mechanism, but without spending time thinking of it I am not sure how to address this since the block type is decided at runtime looking at the layout of the weights tensor.

}

// change tensor shape as block_q4_kx4 brings space size change
//t->nb[0] = ggml_type_size(type);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some comments from development still there.


// change tensor shape as block_q4_kx4 brings space size change
//t->nb[0] = ggml_type_size(type);
t->nb[0] = sizeof(block_q4_Kx4) / 4;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this is correct because the q4_Kx8 block is repacked using the generic implementation

@hongyang-7
Copy link
Author

@Alcpz Really appreciate your review and summary of your insight.

After a comprehensive evaluation, we acknowledge that this PR has following limitations:

  1. Performance aspect: While we may hold different views on the relative importance of TG and PP, I agree that the performance data is not yet sufficiently convincing.
  2. Scenario aspect: It seems that most of the tested models to date are not applicable to this PR if 8x8 and 4x8 coexists. The existing 8x8 model has more stringent requirements than our 4x8 model (a multiple of 8 is certainly a multiple of 4). If nearly all cases fall into the 8x8 category, the 4x8 model will have little practical applicability. Counterexamples are hard to find, making it difficult to justify this as a high-value fallback implementation.
  3. Code maintainability aspect: Even if issue 2 were resolved and such scenarios existed where the 4x8 model needs to coexist with the 8x8 model, the maintenance would actually be rather cumbersome. The current 8x8/8x4 implementations are all based on the packed layout of the original q4_k and q8_k from the x86 architecture, with no changes made to the data structures and layouts. In contrast, our implementation involves quite extensive modifications: for instance, the expansion from 6-bit to 8-bit is moved forward to the offline packing stage, and the bsums layout of q8_k is modified. These changes will require adding architecture-specific parameters for the arm64 implementation in the common framework code shared across different architectures, which is far from ideal for maintainers.

We will place greater emphasis on code maintainability in our future work.

Hi @ggerganov, I believe we can cease further discussions on this PR and proceed to close it.

@Alcpz
Copy link
Collaborator

Alcpz commented Jan 21, 2026

@hongyang-7 I think your PR has really interesting contributions nonetheless. I've tried expanding the scales from 6 bits to 8 bits offline, but I end with the same performance since the computational gains are traded-off with memory accesses. I'm very interested in the Q8_K bsums custom ordering. If you think this could yield performance to ARM devices, I would be really interested in discussing and contributing to get the make_block functions be architecture based (similar to how gemm and gemv have generics to fallback to). I'd be happy to contribute to or help review any PRs regarding that if you decide to continue contributing to llama.cpp.

@elfarolab
Copy link

elfarolab commented Jan 21, 2026

@hongyang-7

if you like I would like to help testing this PR.

I can build on nVidia Jetson AGX Orin and I am interested to improvements related to Q4_K quants.

Please could you tell me the building options if any specific?

Also I have very limited free RAM memory (~ 16 GB) due the embedded Linux env.
What model would you like me to test?

Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants