UPSTREAM PR #15719: ggml : block repack support for Q4_K quanti for AArch64 architecture #612

loci-dev · 2025-12-18T08:42:57Z

This PR improves q4_k_q8_k kernel with block repacking support for AArch64 architecture, based on NEON.

Following structures and functions are implemented:

new quanti: block_q4_kx4 based on four q4_k blocks, along with offline repacking function
new quantize path: add NEON implementation for block_q8_Kx4 in ggml_quantize_mat_q8_K_4x8()
new gemv kernel: new ggml_gemv_q4_K_4x8_q8_K() NEON kernel for GGML_OP_MUL_MAT_ID/GGML_OP_MUL_MAT ops
new gemm kernel: new ggml_gemm_q4_K_4x8_q8_K() NEON kernel for GGML_OP_MUL_MAT_ID/GGML_OP_MUL_MAT ops

Test environment

Server: Neoverse-N2
System_info: n_threads = 64 (n_threads_batch = 64) / 256 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
Models: 2 of different scales

models	storage size	param size	quanti
meta-llama-3-8b-instruct.Q4_K_M.gguf	4.6G	8.03B	Q4_K_M
DeepSeek-V3-Q4_k_M.gguf	377G	671B	Q4_K_M

Bench results

Good gains were observed with this PR, for both S_PP and S_TG:

(1) meta-llama-3-8b-instruct.Q4_K_M.gguf

./bin/llama-batched-bench -m /mnt/models/meta-llama-3-8b-instruct.Q4_K_M.gguf -c 8192 -b 2048 -ub 512  -npp 128 -ntg 128 -npl 1,4,8,16,32 -t 64 --no-mmap

B	S_PP t/s (original)	S_PP t/s (this PR)	S_PP speedup	S_TG t/s (original)	S_TG t/s (this PR)	S_TG speedup
1	168.99	258.42	152.9%	36.34	35.91	98.8%
4	178.88	273.85	153.1%	76.84	95.93	124.8%
8	180.94	280.88	155.2%	102.88	125.94	122.4%
16	180.77	280.69	155.3%	127.70	174.44	136.6%
32	180.65	280.71	155.4%	139.46	194.32	139.3%
geomean			154.4%			123.5%

(2) DeepSeek-V3-Q4_k_M.gguf

./bin/llama-batched-bench -m /mnt/models/DeepSeek-V3-Q4_k_M.gguf -c 8192 -b 2048 -ub 512  -npp 128 -ntg 128 -npl 1,4,8,16,32 -t 64 --no-mmap

B	S_PP t/s (original)	S_PP t/s (this PR)	S_PP speedup	S_TG t/s (original)	S_TG t/s (this PR)	S_TG speedup
1	24.17	30.13	124.7%	6.52	6.46	99.1%
4	25.36	33.13	130.6%	12.18	12.65	103.9%
8	25.43	33.15	130.4%	14.85	15.41	103.8%
16	25.41	33.12	130.3%	16.76	17.72	105.7%
32	25.40	33.10	130.3%	18.19	19.82	109.0%
geomean			129.2%			104.2%

Perplexity

(1) meta-llama-3-8b-instruct.Q4_K_M.gguf

model	perplexity (Final estimate PPL)	Commit id
original	3.7533 +/- 0.14294	77dee9d
this PR	3.7589 +/- 0.14312	543e8eb

(2) DeepSeek-V3-Q4_k_M.gguf

model	perplexity (Final estimate PPL)	Commit id
original	1.0396 +/- 0.00654	77dee9d
this PR	1.0370 +/- 0.00611	543e8eb

Reference

Similar repack patch for q4_k on x86: Block interleaving support for Q4_K quantization for x86 AVX2 architecture ggml-org/llama.cpp#12332
PS: the x86 patch share the same structure block_q8_Kx4 with this patch, but the detailed layout is different.
Similar repack idea for q4_0 on arm: Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization ggml-org/llama.cpp#5780

* new quanti: block_q4_kx4 with offline repack impl * new quantize path: neon impl for ggml_quantize_mat_q8_K_4x8 * new gemv kernel: ggml_gemv_q4_K_4x8_q8_K based on dotprod * new gemm kernel: ggml_gemm_q4_K_4x8_q8_K based on i8mm * performance boost for both S_PP and S_TG --------- Co-authored-by: yuanjia111 <yuan.jia@sanechips.com.cn>

loci-review · 2025-12-18T09:36:51Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #612

Overview

This PR introduces Q4_K block repacking optimization for AArch64 architecture using NEON SIMD instructions. The implementation adds a new 4x8 interleaving pattern (block_q4_Kx4) with three new kernel functions: ggml_quantize_mat_q8_K_4x8, ggml_gemv_q4_K_4x8_q8_K, and ggml_gemm_q4_K_4x8_q8_K. Changes span 4 files with 936 additions and 17 deletions, primarily affecting the GGML CPU backend's quantization and matrix multiplication paths.

Key Findings

Performance-Critical Function Changes

Matrix Quantization (ggml_quantize_mat_q8_K_4x8)

Throughput improved by 2085 ns (from 2150 ns to 65 ns)
Response time increased by 79 ns (from 2332 ns to 2411 ns)
The function now delegates quantization work to specialized NEON kernels, reducing self-execution time while slightly increasing total path time due to function call overhead

Matrix-Vector Multiply (ggml_gemv_q4_K_4x8_q8_K)

New NEON-optimized implementation using DOTPROD instructions
Processes 4 columns simultaneously with 8-byte interleaving
Replaces generic scalar implementation with vectorized operations

Matrix-Matrix Multiply (ggml_gemm_q4_K_4x8_q8_K)

New NEON-optimized implementation using MATMUL_INT8 extension
Processes 16 rows × 4 columns simultaneously
Utilizes vmmlaq_s32 intrinsic for 2×8 matrix multiply operations

Impact on Inference Performance

The modified functions operate at the quantization and low-level matrix operation layer, not the tokenization or inference orchestration layer. Functions like llama_decode, llama_encode, and llama_tokenize remain unchanged in this PR. Therefore, tokens per second metrics are not directly impacted by these changes. The optimizations affect the underlying computational kernels that these higher-level functions call, potentially improving their execution efficiency on AArch64 hardware with NEON support, but the inference API layer itself is unmodified.

Power Consumption Analysis

Binary: libggml-cpu.so

Power consumption increased by 2517 nJ (from 119985 nJ to 122502 nJ, +2.10%)
The increase stems from cumulative overhead in parameter-setting utility functions that are called frequently during tensor operations
All other binaries (libllama.so, libggml-base.so, llama-run, etc.) show no measurable power consumption change

The power increase is concentrated in the CPU backend library where the new repacking logic executes. The 2.10% increase represents the energy cost of pre-processing (repacking) operations that enable faster runtime execution on NEON-capable hardware.

hongyang-7 and others added 6 commits December 18, 2025 14:14

ggml : add c implementation for the "Q4_K quanti for AArch64" patch

8a4e25d

fix bug of ggml_gemv_q4_K_4x8_q8_K_generic

86be98c

fix bug of ggml_gemm_q4_K_4x8_q8_K_generic

49aa628

improve code quality

da606bd

fix compatibility with other q4_k repacking models

126ce2c

loci-dev temporarily deployed to PROD__AL_DEMO December 18, 2025 08:43 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 22 times, most recently from cc6b7b1 to 7ceec3c Compare December 21, 2025 09:08

loci-dev force-pushed the main branch 30 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #15719: ggml : block repack support for Q4_K quanti for AArch64 architecture #612

UPSTREAM PR #15719: ggml : block repack support for Q4_K quanti for AArch64 architecture #612

Uh oh!

loci-dev commented Dec 18, 2025

Uh oh!

loci-review bot commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #15719: ggml : block repack support for Q4_K quanti for AArch64 architecture #612

Are you sure you want to change the base?

UPSTREAM PR #15719: ggml : block repack support for Q4_K quanti for AArch64 architecture #612

Uh oh!

Conversation

loci-dev commented Dec 18, 2025

Test environment

Bench results

Perplexity

Reference

Uh oh!

loci-review bot commented Dec 18, 2025

Performance Analysis Summary: PR #612

Overview

Key Findings

Performance-Critical Function Changes

Impact on Inference Performance

Power Consumption Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants