UPSTREAM PR #18096: ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) #589

loci-dev · 2025-12-16T14:41:37Z

This is similar work to #17494 and #16739.
The primary motivation for q8_0 repack is to optimize prefill/prompt processing performance, which can be useful for, for example, audio models.
Token generation/decode is mixed (It's slightly better in M4, and slightly worse on RPI5, probably device dependent).

This PR implements:

ARM specific gemv and gemms for q8_0 (Dotprod and I8mm)
Generic fallback implementation
Repack functions to prepare weight matrices in interleaved format

M4 max - i8mm path (GGML_NATIVE=ON native+dotprod+i8mm+nosve+sme)

Model	Test Type	t/s (repack)	t/s (no-repack)	Speedup
lfm2 1.2B Q8_0	pp256	1085.44	303.67	3.57x
lfm2 1.2B Q8_0	tg128	177.58	166.51	1.07x
lfm2 700M Q8_0	pp256	1732.73	495.01	3.50x
lfm2 700M Q8_0	tg128	262.94	247.08	1.06x
qwen3 8B Q8_0	pp256	160.03	43.03	3.72x
qwen3 8B Q8_0	tg128	28.75	28.71	1.00x

build: 26ef677 (7365)

Rpi5 - dotprod path (GGML_NATIVE=ON cortex-a76+crypto+dotprod+noi8mm+nosve)

Model	Test Type	t/s (repack)	t/s (no-repack)	Speedup
lfm2 350M Q8_0	pp256	360.45	182.33	1.98x
lfm2 350M Q8_0	tg128	29.77	30.15	0.99x
lfm2 700M Q8_0	pp256	105.62	85.60	1.23x
lfm2 700M Q8_0	tg128	14.51	14.96	0.97x

Perplexity

Perplexity is slightly different for the generic implementations, but I don't find any algorithmic differences except for the order of operations. Generic i8mm (4x8) gets better perplexity than the original implementation, while generic dotprod (4x4) is slightly worse. Repack vs no-repack has the same values.

Model	i8mm Repack PPL	i8mm No-repack PPL	i8mm Generic PPL
Llama 3.1 8B	8.6577	8.6645	8.6645
Qwen3 8B	11.1794	11.1909	10.8830
LFM2 1.2B	16.7665	16.7623	15.1945

Model	dotprod Repack PPL	dotprod No-repack PPL	dotprod Generic PPL
Llama 3.1 8B	8.2089	8.2050	8.6645
Qwen3 8B	10.8830	10.8729	11.1909
LFM2 1.2B	15.1945	15.1895	16.7623

Decode

Following the advise from previous PRs I checked the outputs of GEMVs and all are under the threshold defined in test-backend-ops (same as the one used here: ggml-org/llama.cpp#17494 (comment)), there are some individual weight differences. Looking directly at llama-completion and llama-server, I didn't find artifacts or bad answers.

Ci passes locally when routing through the generic path.

Thought these are not exactly accurate tests:

Some llama-cli queries

NO REPACK

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the enchanting city of lights, is renowned for its iconic landmarks such as the Eiffel Tower and the historic Louvre Museum, attracting millions of visitors from around the globe each year to experience its timeless charm, rich history, and vibrant culture.

[ Prompt: 356.3 t/s | Generation: 140.9 t/s ]

=================================================

i8mm

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the City of Light, is renowned for its stunning architecture, artistic heritage, and romantic ambiance, making it a quintessential destination for travelers worldwide.

[ Prompt: 551.6 t/s | Generation: 118.8 t/s ]

Paris is the capital of?

Yes, Paris is indeed the capital of France. It serves as the political, cultural, and economic heart of the country.

[ Prompt: 562.1 t/s | Generation: 131.3 t/s ]

===============================================================

DOTPROD

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the City of Light, is renowned globally for its rich history, stunning architecture, and cultural treasures like the Eiffel Tower and the Louvre Museum.

[ Prompt: 540.8 t/s | Generation: 119.9 t/s ]

Paris is the capital of?

Paris is the capital of France.

[ Prompt: 552.4 t/s | Generation: 133.3 t/s ]

===============================================================

Generic

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the enchanting capital city of France, is renowned for its timeless charm, iconic landmarks such as the Eiffel Tower, and romantic atmosphere that attracts millions of visitors each year.

[ Prompt: 139.7 t/s | Generation: 240.1 t/s ]

loci-review · 2025-12-16T15:38:46Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #589

Overview

PR #589 introduces ARM64-optimized Q8_0 quantized matrix operations using tensor repacking with NEON DOTPROD and I8MM instructions. The implementation adds 599 lines across 4 files, implementing specialized GEMV and GEMM kernels for Q8_0 quantization format. Analysis shows mixed performance characteristics with significant improvements in specific ARM64 operations but regressions in utility functions.

Key Findings

Performance-Critical Functions Impact

Matrix Operations (Performance-Critical Area):

The new Q8_0 repack implementation targets ggml_mul_mat() operations through specialized kernels. While the PR adds ARM64-specific optimizations, the performance analysis reveals regressions in related infrastructure functions:

ggml_repack_get_optimal_repack_type: Response time increased 317 ns (1694 ns → 2011 ns). This function determines optimal tensor repacking strategy and shows increased complexity from additional Q8_0 format checks and validation logic. The 91% callee overhead indicates the regression stems from tensor property queries rather than core logic.
ggml_gemv_q4_0_4x8_q8_0: Throughput increased 27 ns (146 ns → 173 ns). This ARM-specific quantized matrix-vector kernel shows modest regression despite being in the same optimization family as the new Q8_0 kernels.

Parameter Infrastructure Functions:

Multiple parameter-setting functions show consistent regressions:

ggml_set_op_params_i32 variants: Response time increased 13-20 ns across implementations (traits.cpp: +13 ns, binary-ops.cpp: +13 ns)
ggml_set_op_params_f32 variants: Response time increased 11-13 ns (binary-ops.cpp: +13 ns, mmq.cpp: +11 ns)
ggml_set_op_params: Response time increased 20 ns (128 ns → 148 ns)

These functions exhibit 25-65% callee overhead, suggesting shared infrastructure changes affecting parameter validation or storage logic. The consistent pattern across multiple implementations indicates systematic modification to parameter handling.

Optimized Functions:

ggml_vec_argmax_f32: Response time decreased 74 ns (381 ns → 307 ns). Improved SIMD vectorization reduces computation time for maximum element search operations.
quantize_row_q5_K: Response time decreased 12 ns (90 ns → 78 ns). Enhanced quantization kernel shows improved efficiency.

Inference Performance Impact

Token Generation Functions:

No direct modifications to llama_decode, llama_encode, or llama_tokenize functions. The PR targets lower-level GGML matrix operations that these functions call. Based on the reference metric (7% tokens/second reduction per 2 ms llama_decode slowdown), the observed nanosecond-level changes in matrix operation utilities would translate to negligible inference impact.

Estimated Impact:
The cumulative effect of parameter-setting regressions (approximately 60-80 ns per operation across multiple calls) and the 317 ns increase in repack type selection would contribute sub-millisecond overhead per inference pass. For typical inference workloads calling these functions thousands of times, the aggregate impact remains under 0.1 ms per token, translating to less than 0.35% tokens/second reduction.

Affected Functions:

ggml_repack_get_optimal_repack_type (called during model loading and tensor preparation)
ggml_set_op_params_* family (called during operation graph construction)
Matrix multiplication kernels (called during forward pass computation)

Power Consumption Analysis

Binary-Level Impact:

Power consumption analysis shows a 3.09% increase in libggml-cpu.so (116,386 nJ → 119,985 nJ, +3,600 nJ). This binary contains the core GGML CPU operations including the modified parameter-setting infrastructure and repack logic.

Breakdown:

The 3,600 nJ increase correlates with the cumulative throughput increases across parameter-setting functions and the repack type selection function
The power consumption model attributes this to increased instruction count and execution time in frequently called utility functions
All other binaries (libggml-base.so, libggml.so, libllama.so, executables) show no measurable power consumption changes

Context:
The power increase is confined to CPU backend operations and represents the energy cost of additional validation and type dispatch logic introduced for Q8_0 repack support. The 3.09% increase in a single binary translates to minimal overall system power impact given that libggml-cpu.so represents one component of the inference pipeline.

Code Change Analysis

Implementation Scope:

The PR implements a well-structured optimization following established patterns from previous quantization repack work (Q4_0, Q4_K, IQ4_NL). Key additions include:

4 generic fallback implementations for cross-platform compatibility
4 ARM64 NEON-optimized kernels using DOTPROD and I8MM instructions
Tensor repacking infrastructure (make_block_q8_0x4, repack_q8_0_to_q8_0_4_bl)
Runtime dispatch logic in ggml_repack_get_optimal_repack_type

Performance Trade-offs:

The implementation prioritizes ARM64 prompt processing performance (2-3.7x speedup reported in PR description) at the cost of increased complexity in repack type selection. The 317 ns regression in ggml_repack_get_optimal_repack_type reflects additional branching for Q8_0 format detection and tensor dimension validation. This one-time cost during model loading is amortized across subsequent inference operations.

The parameter-setting function regressions (13-20 ns) appear to stem from shared infrastructure modifications, possibly related to validation logic or parameter buffer handling that affects all quantization formats, not just Q8_0.

Alcpz added 8 commits December 11, 2025 17:53

wip: skeleton for q8_0 repack

30c9d1e

q8_0 repack GEMV implementations

fbe5fd4

GEMM implementations

0d14b67

Formatting

360cbd9

Fixed format consistency of repack gemm and gemv declarations

275237c

gemv and gemm generic location consistent with declarations

9584366

Removed non-correct unused variables statements

815123c

Cleanup, consistent style

26ef677

loci-dev temporarily deployed to PROD__AL_DEMO December 16, 2025 14:41 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 20 times, most recently from eda9f43 to 26e5d36 Compare December 18, 2025 11:09

loci-dev force-pushed the main branch 30 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18096: ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) #589

UPSTREAM PR #18096: ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) #589

Uh oh!

loci-dev commented Dec 16, 2025

Uh oh!

loci-review bot commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UPSTREAM PR #18096: ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) #589

Are you sure you want to change the base?

UPSTREAM PR #18096: ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) #589

Uh oh!

Conversation

loci-dev commented Dec 16, 2025

M4 max - i8mm path (GGML_NATIVE=ON native+dotprod+i8mm+nosve+sme)

Rpi5 - dotprod path (GGML_NATIVE=ON cortex-a76+crypto+dotprod+noi8mm+nosve)

Perplexity

Decode

Uh oh!

loci-review bot commented Dec 16, 2025

Performance Analysis Summary - PR #589

Overview

Key Findings

Performance-Critical Functions Impact

Inference Performance Impact

Power Consumption Analysis

Code Change Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants