UPSTREAM PR #18860: ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) by loci-dev · Pull Request #931 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-15T11:36:32Z

This PR implements the REPACK version of q5_K, following most of the existing design used for q4_K, since Q5_K only differs from q4_K in having the qh field with the additional bit.
Most of the code is shared, but I didn't know how to abstract the common patterns without creating a convoluted mess of functions. Since only Q4_K and Q5_K share the same 6bit scales and mins decode, I opted to duplicate the code.

I also moved around some declarations for Q2_K because the structure seemed weird (it's inverted with what I've seen in quants.c). The Q2_K function declarations were left where they were to avoid polluting the diff and messing the blame. If you want me to revert it, just say so.

I've followed the testing practices from #17494 and #18096 and compared GEMM and GEMV output of both repack and generic vs the current vec_dot version when repack is disabled.

Performance

M4 max (-DGGML_BLAS=OFF -DGGML_METAL=OFF)

model	threads	test	REPACKOFF t/s	REPACK t/s	speedup
qwen3 8B Q5_K - Medium	8	pp256	39.67	82.01	2.07x
qwen3 8B Q5_K - Medium	8	tg128	29.08	32.28	1.11x
llama 8B Q5_K - Medium	8	pp256	37.43	78.86	2.11x
llama 8B Q5_K - Medium	8	tg128	29.46	31.21	1.06x
lfm2 350M Q5_K - Medium	8	pp256	1011.04	1824.21	1.80x
lfm2 350M Q5_K - Medium	8	tg128	490.27	472.60	0.96x
lfm2 1.2B Q5_K - Medium	8	pp256	299.96	556.11	1.85x
lfm2 1.2B Q5_K - Medium	8	tg128	190.83	188.26	0.99x

Exynos 2400

model	threads	test	REPACKOFF t/s	REPACK t/s	speedup
lfm2 350M Q5_K - Medium	3	pp512	101.99	224.94	2.21x
lfm2 350M Q5_K - Medium	3	tg256	60.74	67.41	1.11x
lfm2 1.2B Q5_K - Medium	3	pp512	29.61	67.77	2.29x
lfm2 1.2B Q5_K - Medium	3	tg256	21.74	22.56	1.04x

Perplexity

model	Repack ON	Generic	Repack OFF
LFM2 1.2B Q5_K_M	16.8450 ± 0.96470	16.8693 ± 0.96647	16.8795 ± 0.96706
Meta Llama 3.1 8B Instruct Q5_K_M	8.7533 ± 0.42939	8.7704 ± 0.43083	8.7506 ± 0.42931
Qwen3 8B 128K Q5_K_M	11.2690 ± 0.68575	11.2525 ± 0.68551	11.2632 ± 0.68616

Some outputs of llama-cli to test token generation:

llama-cli using repack

build      : b7722-9ba0fe61d
model      : LFM2-1.2B-Q5_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Italy?

The capital of Italy is Rome. Rome is the most historically significant city in Italy and is also the birthplace of the Renaissance. The city has a rich history dating back to ancient times and is home to numerous important cultural and architectural landmarks such as the Colosseum, the Pantheon, and Vatican City.

[ Prompt: 343.8 t/s | Generation: 200.2 t/s ]

> Can I visit the Eiffel Tower there?

Yes, you can visit the Eiffel Tower in Paris, France, which is the closest major city to Rome and within reach of European travelers visiting Italy. However, it's important to note that the Eiffel Tower is located in Paris, not Rome. You would need to plan your trip carefully to ensure that you can visit both Paris and Rome within your travel dates.

Here are a few points to consider:

1. **Transportation**: You can take a train from major Italian cities (like Milan, Turin, or Florence) to Paris. The journey takes about 3.5 to 4 hours, depending on the type of train and layovers. You might want to book a high-speed train to save time.

2. **Visiting the Eiffel Tower**: Once you arrive in Paris, you can easily take a metro ride (lines 4 and 6) to the Eiffel Tower area (Les Halles or Champ de Mars). The tower itself is open to the public several times a day, and ticket prices can vary based on the time of day and whether you prefer a guided tour.

3. **Alternatives**: If you prefer to visit Rome first, you might consider other nearby attractions like the Colosseum, the Vatican City, or Pompeii, which are also within easy reach by train.

Planning ahead and booking your tickets in advance will help ensure a smooth experience. Enjoy your trip!

[ Prompt: 397.7 t/s | Generation: 196.1 t/s ]

llama-cli using generic

Patch for routing through the generic implementation

  diff --git a/ggml/src/ggml-cpu/arch/arm/repack.cpp b/ggml/src/ggml-cpu/arch/arm/repack.cpp
  index ed2f9c668..440168e18 100644
  --- a/ggml/src/ggml-cpu/arch/arm/repack.cpp
  +++ b/ggml/src/ggml-cpu/arch/arm/repack.cpp
  @@ -805,7 +805,7 @@ void ggml_gemv_q5_K_8x8_q8_K(int                        n,
       UNUSED(ncols_interleaved);
       UNUSED(blocklen);

  -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
  +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
       constexpr int    col_pairs = ncols_interleaved / 2;
       const uint8x16_t m4b       = vdupq_n_u8(0x0f);
       const uint8x16_t mone = vdupq_n_u8(1);
  @@ -3017,7 +3017,7 @@ void ggml_gemm_q5_K_8x8_q8_K(int                        n,
       UNUSED(ncols_interleaved);
       UNUSED(blocklen);

  -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8)
  +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8)
       constexpr int    q8_k_blocklen = 4;
       constexpr int    col_pairs     = ncols_interleaved / 2;
       const uint8x16_t m4b           = vdupq_n_u8(0x0f);

build      : b7722-9ba0fe61d
model      : LFM2-1.2B-Q5_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Italy?

The capital of Italy is **Rome**.

It's also the country's most populous city, steeped in history and Renaissance art and architecture.


Other capital cities in Italy include Florence, Venice, and Milan.


[ Prompt: 32.9 t/s | Generation: 31.2 t/s ]

> Can I visit the Eiffel tower there?

No, you **cannot** visit the Eiffel Tower in Rome, Italy.

The Eiffel Tower is located in Paris, France, and is not part of Italy.

Visiting Rome and the Eiffel Tower are two very different experiences, each with its own unique history and attractions.

While Rome has its own historical sites and landmarks, there isn't anything like the Eiffel Tower within its borders.


[ Prompt: 33.2 t/s | Generation: 30.9 t/s ]

loci-review · 2026-01-15T12:25:42Z

Explore the complete analysis inside the Version Insights

Now I'll generate the performance review report based on all the gathered information:

Performance Review Report

Summary

This code change introduces Q5_K quantization optimizations for ARM architectures with minimal performance impact. The target version shows a 3.6% power consumption increase in the CPU backend (5.3kJ increase) while improving a critical matrix multiplication kernel by 21ns. Only two functions show measurable changes, both with negligible absolute impact.

Change Context

The 10 commits by Alberto Cabrera implement Q5_K quantization support with ARM-specific GEMM/GEMV optimizations (i8mm instructions). Changes span 119 modified files, 44 additions, and 23 deletions, primarily affecting quantization kernels in ggml-cpu/ and repack implementations for 5-bit quantized matrix operations.

Performance Impact Analysis

Power Consumption: The libggml-cpu.so binary increased from 148.3kJ to 153.6kJ (+5.3kJ, +3.6%). All other binaries show zero change, indicating the impact is isolated to CPU backend quantization code.

Function-Level Changes:

gemm_bloc<4,6> (ARM NEON kernel): Improved from 547ns to 526ns (-21ns, -3.8%). This 4×6 tile matrix multiplication kernel using FP16/FP32 mixed precision shows better instruction scheduling or binary layout. As a compute-critical function called thousands of times during inference, the cumulative benefit is meaningful despite small absolute improvement.
std::vector::end() (STL accessor): Increased from 80ns to 263ns (+183ns). This initialization-only function shows compiler instrumentation differences (likely debug symbols or security checks) with no runtime impact on inference paths.

Code Change Justification

The commits implement new Q5_K quantization capabilities—a 5-bit weight compression format that reduces memory bandwidth while maintaining model accuracy. The added functionality (repack operations, ARM i8mm optimizations, GEMM/GEMV implementations) justifies the modest power increase. The 21ns improvement in the existing GEMM kernel suggests compiler optimizations benefited from code reorganization.

The 3.6% power increase reflects additional quantization code paths rather than performance regression. Q5_K support enables users to run larger models in memory-constrained environments, trading minimal compute overhead for significant memory savings.

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

loci-review · 2026-01-16T09:57:08Z

Analysis didn’t complete successfully. Explore Version Insights for details.

loci-dev temporarily deployed to PROD__AL_DEMO January 15, 2026 11:36 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 4 times, most recently from bbbac3d to 5194aba Compare January 15, 2026 20:10

Alcpz added 10 commits January 16, 2026 08:52

Boilerplate for q5_Kx8 REPACK on ARM and fallback

63fe191

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

Implements make_block_q5_Kx8 by extending make_block_q4_Kx8

f623c1a

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

q5_K repack gemm and gemv generics

c3ed67b

Gemm and Gemv ARM implementations (i8mm)

d729b73

Improved qh manipulation looking at non-repack vec_dot implementation

9d03854

Full unroll

ee77548

Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.

2c1e322

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

Fix wrong fallback definitions of Q5_K

7e7223f

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

Fixed comments. Reverted unnecessary formatting

72cdc9a

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

Fixed typo in generic definitions

f4a7a91

loci-dev force-pushed the main branch from 5194aba to ad54807 Compare January 16, 2026 09:12

loci-dev force-pushed the upstream-PR18860-branch_Alcpz-Alcpz/arm_q5_K_repack branch from b517a1d to f4a7a91 Compare January 16, 2026 09:40

loci-dev temporarily deployed to PROD__AL_DEMO January 16, 2026 09:40 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 10 times, most recently from 839190f to f1b080b Compare January 18, 2026 02:54

loci-dev force-pushed the main branch 20 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18860: ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm)#931

UPSTREAM PR #18860: ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm)#931
loci-dev wants to merge 10 commits intomainfrom
upstream-PR18860-branch_Alcpz-Alcpz/arm_q5_K_repack

loci-dev commented Jan 15, 2026

Uh oh!

loci-review bot commented Jan 15, 2026

Uh oh!

loci-review bot commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

loci-dev commented Jan 15, 2026

Performance

Perplexity

Some outputs of llama-cli to test token generation:

Uh oh!

loci-review bot commented Jan 15, 2026

Performance Review Report

Summary

Change Context

Performance Impact Analysis

Code Change Justification

Uh oh!

loci-review bot commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments