Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18096

This is similar work to #17494 and #16739.
The primary motivation for q8_0 repack is to optimize prefill/prompt processing performance, which can be useful for, for example, audio models.
Token generation/decode is mixed (It's slightly better in M4, and slightly worse on RPI5, probably device dependent).

This PR implements:

  • ARM specific gemv and gemms for q8_0 (Dotprod and I8mm)
  • Generic fallback implementation
  • Repack functions to prepare weight matrices in interleaved format

M4 max - i8mm path (GGML_NATIVE=ON native+dotprod+i8mm+nosve+sme)

Model Test Type t/s (repack) t/s (no-repack) Speedup
lfm2 1.2B Q8_0 pp256 1085.44 303.67 3.57x
lfm2 1.2B Q8_0 tg128 177.58 166.51 1.07x
lfm2 700M Q8_0 pp256 1732.73 495.01 3.50x
lfm2 700M Q8_0 tg128 262.94 247.08 1.06x
qwen3 8B Q8_0 pp256 160.03 43.03 3.72x
qwen3 8B Q8_0 tg128 28.75 28.71 1.00x

build: 26ef677 (7365)

Rpi5 - dotprod path (GGML_NATIVE=ON cortex-a76+crypto+dotprod+noi8mm+nosve)

Model Test Type t/s (repack) t/s (no-repack) Speedup
lfm2 350M Q8_0 pp256 360.45 182.33 1.98x
lfm2 350M Q8_0 tg128 29.77 30.15 0.99x
lfm2 700M Q8_0 pp256 105.62 85.60 1.23x
lfm2 700M Q8_0 tg128 14.51 14.96 0.97x

Perplexity

Perplexity is slightly different for the generic implementations, but I don't find any algorithmic differences except for the order of operations. Generic i8mm (4x8) gets better perplexity than the original implementation, while generic dotprod (4x4) is slightly worse. Repack vs no-repack has the same values.

Model i8mm Repack PPL i8mm No-repack PPL i8mm Generic PPL
Llama 3.1 8B 8.6577 8.6645 8.6645
Qwen3 8B 11.1794 11.1909 10.8830
LFM2 1.2B 16.7665 16.7623 15.1945
Model dotprod Repack PPL dotprod No-repack PPL dotprod Generic PPL
Llama 3.1 8B 8.2089 8.2050 8.6645
Qwen3 8B 10.8830 10.8729 11.1909
LFM2 1.2B 15.1945 15.1895 16.7623

Decode

Following the advise from previous PRs I checked the outputs of GEMVs and all are under the threshold defined in test-backend-ops (same as the one used here: ggml-org/llama.cpp#17494 (comment)), there are some individual weight differences. Looking directly at llama-completion and llama-server, I didn't find artifacts or bad answers.

Ci passes locally when routing through the generic path.


Thought these are not exactly accurate tests:

Some llama-cli queries

NO REPACK

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the enchanting city of lights, is renowned for its iconic landmarks such as the Eiffel Tower and the historic Louvre Museum, attracting millions of visitors from around the globe each year to experience its timeless charm, rich history, and vibrant culture.

[ Prompt: 356.3 t/s | Generation: 140.9 t/s ]

=================================================

i8mm

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the City of Light, is renowned for its stunning architecture, artistic heritage, and romantic ambiance, making it a quintessential destination for travelers worldwide.

[ Prompt: 551.6 t/s | Generation: 118.8 t/s ]

Paris is the capital of?

Yes, Paris is indeed the capital of France. It serves as the political, cultural, and economic heart of the country.

[ Prompt: 562.1 t/s | Generation: 131.3 t/s ]

===============================================================

DOTPROD

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the City of Light, is renowned globally for its rich history, stunning architecture, and cultural treasures like the Eiffel Tower and the Louvre Museum.

[ Prompt: 540.8 t/s | Generation: 119.9 t/s ]

Paris is the capital of?

Paris is the capital of France.

[ Prompt: 552.4 t/s | Generation: 133.3 t/s ]

===============================================================

Generic

▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀

build : b7365-26ef6770d
model : LiquidAI/LFM2-700M-GGUF:Q8_0
modalities : text

available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file

Give me a sentence about Paris

Paris, the enchanting capital city of France, is renowned for its timeless charm, iconic landmarks such as the Eiffel Tower, and romantic atmosphere that attracts millions of visitors each year.

[ Prompt: 139.7 t/s | Generation: 240.1 t/s ]

@loci-review
Copy link

loci-review bot commented Dec 16, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #589

Overview

PR #589 introduces ARM64-optimized Q8_0 quantized matrix operations using tensor repacking with NEON DOTPROD and I8MM instructions. The implementation adds 599 lines across 4 files, implementing specialized GEMV and GEMM kernels for Q8_0 quantization format. Analysis shows mixed performance characteristics with significant improvements in specific ARM64 operations but regressions in utility functions.

Key Findings

Performance-Critical Functions Impact

Matrix Operations (Performance-Critical Area):

The new Q8_0 repack implementation targets ggml_mul_mat() operations through specialized kernels. While the PR adds ARM64-specific optimizations, the performance analysis reveals regressions in related infrastructure functions:

  • ggml_repack_get_optimal_repack_type: Response time increased 317 ns (1694 ns → 2011 ns). This function determines optimal tensor repacking strategy and shows increased complexity from additional Q8_0 format checks and validation logic. The 91% callee overhead indicates the regression stems from tensor property queries rather than core logic.

  • ggml_gemv_q4_0_4x8_q8_0: Throughput increased 27 ns (146 ns → 173 ns). This ARM-specific quantized matrix-vector kernel shows modest regression despite being in the same optimization family as the new Q8_0 kernels.

Parameter Infrastructure Functions:

Multiple parameter-setting functions show consistent regressions:

  • ggml_set_op_params_i32 variants: Response time increased 13-20 ns across implementations (traits.cpp: +13 ns, binary-ops.cpp: +13 ns)
  • ggml_set_op_params_f32 variants: Response time increased 11-13 ns (binary-ops.cpp: +13 ns, mmq.cpp: +11 ns)
  • ggml_set_op_params: Response time increased 20 ns (128 ns → 148 ns)

These functions exhibit 25-65% callee overhead, suggesting shared infrastructure changes affecting parameter validation or storage logic. The consistent pattern across multiple implementations indicates systematic modification to parameter handling.

Optimized Functions:

  • ggml_vec_argmax_f32: Response time decreased 74 ns (381 ns → 307 ns). Improved SIMD vectorization reduces computation time for maximum element search operations.
  • quantize_row_q5_K: Response time decreased 12 ns (90 ns → 78 ns). Enhanced quantization kernel shows improved efficiency.

Inference Performance Impact

Token Generation Functions:

No direct modifications to llama_decode, llama_encode, or llama_tokenize functions. The PR targets lower-level GGML matrix operations that these functions call. Based on the reference metric (7% tokens/second reduction per 2 ms llama_decode slowdown), the observed nanosecond-level changes in matrix operation utilities would translate to negligible inference impact.

Estimated Impact:
The cumulative effect of parameter-setting regressions (approximately 60-80 ns per operation across multiple calls) and the 317 ns increase in repack type selection would contribute sub-millisecond overhead per inference pass. For typical inference workloads calling these functions thousands of times, the aggregate impact remains under 0.1 ms per token, translating to less than 0.35% tokens/second reduction.

Affected Functions:

  • ggml_repack_get_optimal_repack_type (called during model loading and tensor preparation)
  • ggml_set_op_params_* family (called during operation graph construction)
  • Matrix multiplication kernels (called during forward pass computation)

Power Consumption Analysis

Binary-Level Impact:

Power consumption analysis shows a 3.09% increase in libggml-cpu.so (116,386 nJ → 119,985 nJ, +3,600 nJ). This binary contains the core GGML CPU operations including the modified parameter-setting infrastructure and repack logic.

Breakdown:

  • The 3,600 nJ increase correlates with the cumulative throughput increases across parameter-setting functions and the repack type selection function
  • The power consumption model attributes this to increased instruction count and execution time in frequently called utility functions
  • All other binaries (libggml-base.so, libggml.so, libllama.so, executables) show no measurable power consumption changes

Context:
The power increase is confined to CPU backend operations and represents the energy cost of additional validation and type dispatch logic introduced for Q8_0 repack support. The 3.09% increase in a single binary translates to minimal overall system power impact given that libggml-cpu.so represents one component of the inference pipeline.

Code Change Analysis

Implementation Scope:

The PR implements a well-structured optimization following established patterns from previous quantization repack work (Q4_0, Q4_K, IQ4_NL). Key additions include:

  • 4 generic fallback implementations for cross-platform compatibility
  • 4 ARM64 NEON-optimized kernels using DOTPROD and I8MM instructions
  • Tensor repacking infrastructure (make_block_q8_0x4, repack_q8_0_to_q8_0_4_bl)
  • Runtime dispatch logic in ggml_repack_get_optimal_repack_type

Performance Trade-offs:

The implementation prioritizes ARM64 prompt processing performance (2-3.7x speedup reported in PR description) at the cost of increased complexity in repack type selection. The 317 ns regression in ggml_repack_get_optimal_repack_type reflects additional branching for Q8_0 format detection and tensor dimension validation. This one-time cost during model loading is amortized across subsequent inference operations.

The parameter-setting function regressions (13-20 ns) appear to stem from shared infrastructure modifications, possibly related to validation logic or parameter buffer handling that affects all quantization formats, not just Q8_0.

@loci-dev loci-dev force-pushed the main branch 20 times, most recently from eda9f43 to 26e5d36 Compare December 18, 2025 11:09
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants