Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#15719

This PR improves q4_k_q8_k kernel with block repacking support for AArch64 architecture, based on NEON.

Following structures and functions are implemented:

  • new quanti: block_q4_kx4 based on four q4_k blocks, along with offline repacking function
  • new quantize path: add NEON implementation for block_q8_Kx4 in ggml_quantize_mat_q8_K_4x8()
  • new gemv kernel: new ggml_gemv_q4_K_4x8_q8_K() NEON kernel for GGML_OP_MUL_MAT_ID/GGML_OP_MUL_MAT ops
  • new gemm kernel: new ggml_gemm_q4_K_4x8_q8_K() NEON kernel for GGML_OP_MUL_MAT_ID/GGML_OP_MUL_MAT ops

Test environment

  • Server: Neoverse-N2
  • System_info: n_threads = 64 (n_threads_batch = 64) / 256 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | REPACK = 1 |
  • Models: 2 of different scales
models storage size param size quanti
meta-llama-3-8b-instruct.Q4_K_M.gguf 4.6G 8.03B Q4_K_M
DeepSeek-V3-Q4_k_M.gguf 377G 671B Q4_K_M

Bench results

Good gains were observed with this PR, for both S_PP and S_TG:

(1) meta-llama-3-8b-instruct.Q4_K_M.gguf

./bin/llama-batched-bench -m /mnt/models/meta-llama-3-8b-instruct.Q4_K_M.gguf -c 8192 -b 2048 -ub 512  -npp 128 -ntg 128 -npl 1,4,8,16,32 -t 64 --no-mmap
B S_PP t/s
(original)
S_PP t/s
(this PR)
S_PP
speedup
S_TG t/s
(original)
S_TG t/s
(this PR)
S_TG
speedup
1 168.99 258.42 152.9% 36.34 35.91 98.8%
4 178.88 273.85 153.1% 76.84 95.93 124.8%
8 180.94 280.88 155.2% 102.88 125.94 122.4%
16 180.77 280.69 155.3% 127.70 174.44 136.6%
32 180.65 280.71 155.4% 139.46 194.32 139.3%
geomean     154.4%     123.5%

(2) DeepSeek-V3-Q4_k_M.gguf

./bin/llama-batched-bench -m /mnt/models/DeepSeek-V3-Q4_k_M.gguf -c 8192 -b 2048 -ub 512  -npp 128 -ntg 128 -npl 1,4,8,16,32 -t 64 --no-mmap
B S_PP t/s
(original)
S_PP t/s
(this PR)
S_PP
speedup
S_TG t/s
(original)
S_TG t/s
(this PR)
S_TG
speedup
1 24.17 30.13 124.7% 6.52 6.46 99.1%
4 25.36 33.13 130.6% 12.18 12.65 103.9%
8 25.43 33.15 130.4% 14.85 15.41 103.8%
16 25.41 33.12 130.3% 16.76 17.72 105.7%
32 25.40 33.10 130.3% 18.19 19.82 109.0%
geomean     129.2%     104.2%

Perplexity

(1) meta-llama-3-8b-instruct.Q4_K_M.gguf

model perplexity (Final estimate PPL) Commit id
original 3.7533 +/- 0.14294 77dee9d
this PR 3.7589 +/- 0.14312 543e8eb

(2) DeepSeek-V3-Q4_k_M.gguf

model perplexity (Final estimate PPL) Commit id
original 1.0396 +/- 0.00654 77dee9d
this PR 1.0370 +/- 0.00611 543e8eb

Reference

  1. Similar repack patch for q4_k on x86: Block interleaving support for Q4_K quantization for x86 AVX2 architecture ggml-org/llama.cpp#12332
    PS: the x86 patch share the same structure block_q8_Kx4 with this patch, but the detailed layout is different.
  2. Similar repack idea for q4_0 on arm: Arm AArch64: optimized GEMV and GEMM kernels for q4_0_q8_0, and q8_0_q8_0 quantization ggml-org/llama.cpp#5780

hongyang-7 and others added 6 commits December 18, 2025 14:14
* new quanti: block_q4_kx4 with offline repack impl

* new quantize path: neon impl for ggml_quantize_mat_q8_K_4x8

* new gemv kernel: ggml_gemv_q4_K_4x8_q8_K based on dotprod

* new gemm kernel: ggml_gemm_q4_K_4x8_q8_K based on i8mm

* performance boost for both S_PP and S_TG

---------

Co-authored-by: yuanjia111 <yuan.jia@sanechips.com.cn>
@loci-review
Copy link

loci-review bot commented Dec 18, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #612

Overview

This PR introduces Q4_K block repacking optimization for AArch64 architecture using NEON SIMD instructions. The implementation adds a new 4x8 interleaving pattern (block_q4_Kx4) with three new kernel functions: ggml_quantize_mat_q8_K_4x8, ggml_gemv_q4_K_4x8_q8_K, and ggml_gemm_q4_K_4x8_q8_K. Changes span 4 files with 936 additions and 17 deletions, primarily affecting the GGML CPU backend's quantization and matrix multiplication paths.

Key Findings

Performance-Critical Function Changes

Matrix Quantization (ggml_quantize_mat_q8_K_4x8)

  • Throughput improved by 2085 ns (from 2150 ns to 65 ns)
  • Response time increased by 79 ns (from 2332 ns to 2411 ns)
  • The function now delegates quantization work to specialized NEON kernels, reducing self-execution time while slightly increasing total path time due to function call overhead

Matrix-Vector Multiply (ggml_gemv_q4_K_4x8_q8_K)

  • New NEON-optimized implementation using DOTPROD instructions
  • Processes 4 columns simultaneously with 8-byte interleaving
  • Replaces generic scalar implementation with vectorized operations

Matrix-Matrix Multiply (ggml_gemm_q4_K_4x8_q8_K)

  • New NEON-optimized implementation using MATMUL_INT8 extension
  • Processes 16 rows × 4 columns simultaneously
  • Utilizes vmmlaq_s32 intrinsic for 2×8 matrix multiply operations

Impact on Inference Performance

The modified functions operate at the quantization and low-level matrix operation layer, not the tokenization or inference orchestration layer. Functions like llama_decode, llama_encode, and llama_tokenize remain unchanged in this PR. Therefore, tokens per second metrics are not directly impacted by these changes. The optimizations affect the underlying computational kernels that these higher-level functions call, potentially improving their execution efficiency on AArch64 hardware with NEON support, but the inference API layer itself is unmodified.

Power Consumption Analysis

Binary: libggml-cpu.so

  • Power consumption increased by 2517 nJ (from 119985 nJ to 122502 nJ, +2.10%)
  • The increase stems from cumulative overhead in parameter-setting utility functions that are called frequently during tensor operations
  • All other binaries (libllama.so, libggml-base.so, llama-run, etc.) show no measurable power consumption change

The power increase is concentrated in the CPU backend library where the new repacking logic executes. The 2.10% increase represents the energy cost of pre-processing (repacking) operations that enable faster runtime execution on NEON-capable hardware.

@loci-dev loci-dev force-pushed the main branch 22 times, most recently from cc6b7b1 to 7ceec3c Compare December 21, 2025 09:08
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants