ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) by Alcpz · Pull Request #18860 · ggml-org/llama.cpp

Alcpz · 2026-01-15T10:58:03Z

This PR implements the REPACK version of q5_K, following most of the existing design used for q4_K, since Q5_K only differs from q4_K in having the qh field with the additional bit.
Most of the code is shared, but I didn't know how to abstract the common patterns without creating a convoluted mess of functions. Since only Q4_K and Q5_K share the same 6bit scales and mins decode, I opted to duplicate the code.

I also moved around some declarations for Q2_K because the structure seemed weird (it's inverted with what I've seen in quants.c). The Q2_K function declarations were left where they were to avoid polluting the diff and messing the blame. If you want me to revert it, just say so.

I've followed the testing practices from #17494 and #18096 and compared GEMM and GEMV output of both repack and generic vs the current vec_dot version when repack is disabled.

Performance

M4 max (-DGGML_BLAS=OFF -DGGML_METAL=OFF)

model	threads	test	REPACKOFF t/s	REPACK t/s	speedup
qwen3 8B Q5_K - Medium	8	pp512	39.67	83.10	2.10
qwen3 8B Q5_K - Medium	8	tg128	29.08	32.28	1.11
llama 8B Q5_K - Medium	8	pp512	37.43	79.24	2.12
llama 8B Q5_K - Medium	8	tg128	29.46	31.77	1.08
lfm2 350M Q5_K - Medium	8	pp512	1011.04	1868.82	1.85
lfm2 350M Q5_K - Medium	8	tg128	485.84	476.40	0.98
lfm2 1.2B Q5_K - Medium	8	pp512	299.96	581.59	1.94
lfm2 1.2B Q5_K - Medium	8	tg128	190.83	195.02	1.02

Exynos 2400

model	threads	test	REPACKOFF t/s	REPACK t/s	speedup
lfm2 350M Q5_K - Medium	3	pp512	101.99	224.94	2.21
lfm2 350M Q5_K - Medium	3	tg128	60.74	67.41	1.11
lfm2 1.2B Q5_K - Medium	3	pp512	29.61	67.77	2.29
lfm2 1.2B Q5_K - Medium	3	tg128	21.74	22.56	1.04

Perplexity

model	Repack ON	Generic	Repack OFF
LFM2 1.2B Q5_K_M	16.8450 ± 0.96470	16.8693 ± 0.96647	16.8795 ± 0.96706
Meta Llama 3.1 8B Instruct Q5_K_M	8.7533 ± 0.42939	8.7704 ± 0.43083	8.7506 ± 0.42931
Qwen3 8B 128K Q5_K_M	11.2690 ± 0.68575	11.2525 ± 0.68551	11.2632 ± 0.68616

Some outputs of llama-cli to test token generation:

llama-cli using repack

build      : b7722-9ba0fe61d
model      : LFM2-1.2B-Q5_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Italy?

The capital of Italy is Rome. Rome is the most historically significant city in Italy and is also the birthplace of the Renaissance. The city has a rich history dating back to ancient times and is home to numerous important cultural and architectural landmarks such as the Colosseum, the Pantheon, and Vatican City.

[ Prompt: 343.8 t/s | Generation: 200.2 t/s ]

> Can I visit the Eiffel Tower there?

Yes, you can visit the Eiffel Tower in Paris, France, which is the closest major city to Rome and within reach of European travelers visiting Italy. However, it's important to note that the Eiffel Tower is located in Paris, not Rome. You would need to plan your trip carefully to ensure that you can visit both Paris and Rome within your travel dates.

Here are a few points to consider:

1. **Transportation**: You can take a train from major Italian cities (like Milan, Turin, or Florence) to Paris. The journey takes about 3.5 to 4 hours, depending on the type of train and layovers. You might want to book a high-speed train to save time.

2. **Visiting the Eiffel Tower**: Once you arrive in Paris, you can easily take a metro ride (lines 4 and 6) to the Eiffel Tower area (Les Halles or Champ de Mars). The tower itself is open to the public several times a day, and ticket prices can vary based on the time of day and whether you prefer a guided tour.

3. **Alternatives**: If you prefer to visit Rome first, you might consider other nearby attractions like the Colosseum, the Vatican City, or Pompeii, which are also within easy reach by train.

Planning ahead and booking your tickets in advance will help ensure a smooth experience. Enjoy your trip!

[ Prompt: 397.7 t/s | Generation: 196.1 t/s ]

llama-cli using generic

Patch for routing through the generic implementation

  diff --git a/ggml/src/ggml-cpu/arch/arm/repack.cpp b/ggml/src/ggml-cpu/arch/arm/repack.cpp
  index ed2f9c668..440168e18 100644
  --- a/ggml/src/ggml-cpu/arch/arm/repack.cpp
  +++ b/ggml/src/ggml-cpu/arch/arm/repack.cpp
  @@ -805,7 +805,7 @@ void ggml_gemv_q5_K_8x8_q8_K(int                        n,
       UNUSED(ncols_interleaved);
       UNUSED(blocklen);

  -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
  +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
       constexpr int    col_pairs = ncols_interleaved / 2;
       const uint8x16_t m4b       = vdupq_n_u8(0x0f);
       const uint8x16_t mone = vdupq_n_u8(1);
  @@ -3017,7 +3017,7 @@ void ggml_gemm_q5_K_8x8_q8_K(int                        n,
       UNUSED(ncols_interleaved);
       UNUSED(blocklen);

  -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8)
  +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8)
       constexpr int    q8_k_blocklen = 4;
       constexpr int    col_pairs     = ncols_interleaved / 2;
       const uint8x16_t m4b           = vdupq_n_u8(0x0f);

build      : b7722-9ba0fe61d
model      : LFM2-1.2B-Q5_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Italy?

The capital of Italy is **Rome**.

It's also the country's most populous city, steeped in history and Renaissance art and architecture.


Other capital cities in Italy include Florence, Venice, and Milan.


[ Prompt: 32.9 t/s | Generation: 31.2 t/s ]

> Can I visit the Eiffel tower there?

No, you **cannot** visit the Eiffel Tower in Rome, Italy.

The Eiffel Tower is located in Paris, France, and is not part of Italy.

Visiting Rome and the Eiffel Tower are two very different experiences, each with its own unique history and attractions.

While Rome has its own historical sites and landmarks, there isn't anything like the Eiffel Tower within its borders.


[ Prompt: 33.2 t/s | Generation: 30.9 t/s ]

Alcpz · 2026-01-15T11:24:59Z

@tdakhran @ykhrustalev

Alcpz · 2026-01-15T14:42:32Z

Server failures are due to the removal of CURL. webgpu is failing on an unrelated test (webgpu tests)

taronaeo · 2026-01-15T21:46:47Z

Server failures are due to the removal of CURL.

Can you try rebasing with the master branch? Another PR I reviewed fixed it with rebase.

Alcpz · 2026-01-16T13:18:01Z

Server failures are due to the removal of CURL.

Can you try rebasing with the master branch? Another PR I reviewed fixed it with rebase.

All good. Webgpu still fails, but that was a different issue. Thanks!

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

…ions (i8mm) (ggml-org#18860) * Boilerplate for q5_Kx8 REPACK on ARM and fallback Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Implements make_block_q5_Kx8 by extending make_block_q4_Kx8 Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * q5_K repack gemm and gemv generics * Gemm and Gemv ARM implementations (i8mm) * Improved qh manipulation looking at non-repack vec_dot implementation * Full unroll * Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments. Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fix wrong fallback definitions of Q5_K Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed comments. Reverted unnecessary formatting Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed typo in generic definitions * Switching AND + Shift with Shift Insert. Better op interleaving. * Vectorize + unroll the block scales * Apply gemm optimizations to gemv * Improve bias calculation --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

…ions (i8mm) #18860 (#18888) * Boilerplate for q6_K repack * q6_K repack to q6_Kx8 implementation Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * q6_K generic gemv and gemm * wip, gemm_q6_K 8x8 * Still WIP: loading of q8s, q6h and q6l * first working version of q6_K gemm * Moved q6 loads outside of sb block, Unrolled inner loop * Replaced modulo with mask * First implementation of GEMV * ggml_vdotq_s32 -> vdotq_s32 * Reduce width of accumulators in q6_K gemv * Bsums instead of calc bias. Preload scales to use vget_lane. Unroll. * Reuse scales in GEMM (same GEMV opt) * Added todos for bsum and different qh repack * Arch fallback * VSLIQ for merging qh adn ql * Removed TODO, already tested * Apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Removed unused import --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…ions (i8mm) (ggml-org#18860) * Boilerplate for q5_Kx8 REPACK on ARM and fallback Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Implements make_block_q5_Kx8 by extending make_block_q4_Kx8 Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * q5_K repack gemm and gemv generics * Gemm and Gemv ARM implementations (i8mm) * Improved qh manipulation looking at non-repack vec_dot implementation * Full unroll * Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments. Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fix wrong fallback definitions of Q5_K Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed comments. Reverted unnecessary formatting Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed typo in generic definitions * Switching AND + Shift with Shift Insert. Better op interleaving. * Vectorize + unroll the block scales * Apply gemm optimizations to gemv * Improve bias calculation --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

…ions (i8mm) ggml-org#18860 (ggml-org#18888) * Boilerplate for q6_K repack * q6_K repack to q6_Kx8 implementation Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * q6_K generic gemv and gemm * wip, gemm_q6_K 8x8 * Still WIP: loading of q8s, q6h and q6l * first working version of q6_K gemm * Moved q6 loads outside of sb block, Unrolled inner loop * Replaced modulo with mask * First implementation of GEMV * ggml_vdotq_s32 -> vdotq_s32 * Reduce width of accumulators in q6_K gemv * Bsums instead of calc bias. Preload scales to use vget_lane. Unroll. * Reuse scales in GEMM (same GEMV opt) * Added todos for bsum and different qh repack * Arch fallback * VSLIQ for merging qh adn ql * Removed TODO, already tested * Apply suggestions Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Removed unused import --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Alcpz requested a review from ggerganov as a code owner January 15, 2026 10:58

loci-dev mentioned this pull request Jan 15, 2026

UPSTREAM PR #18860: ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm) auroralabs-loci/llama.cpp#931

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 15, 2026

Alcpz force-pushed the Alcpz/arm_q5_K_repack branch from b517a1d to f4a7a91 Compare January 16, 2026 08:52

Alcpz mentioned this pull request Jan 16, 2026

ggml-cpu: aarm64: q6_K repack gemm and gemv (and generic) implementations (i8mm) #18860 #18888

Merged

1 task

Alcpz added 14 commits January 22, 2026 18:25

Boilerplate for q5_Kx8 REPACK on ARM and fallback

9b2129b

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

Implements make_block_q5_Kx8 by extending make_block_q4_Kx8

7d944e9

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

q5_K repack gemm and gemv generics

5ea06c3

Gemm and Gemv ARM implementations (i8mm)

f5341c6

Improved qh manipulation looking at non-repack vec_dot implementation

a8e2fdb

Full unroll

960689d

Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.

1d8c0bd

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

Fix wrong fallback definitions of Q5_K

f9582a6

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

Fixed comments. Reverted unnecessary formatting

794e9ec

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

Fixed typo in generic definitions

d65e2ea

Switching AND + Shift with Shift Insert. Better op interleaving.

a6e2281

Vectorize + unroll the block scales

339734d

Apply gemm optimizations to gemv

365555d

Improve bias calculation

69b2477

Alcpz force-pushed the Alcpz/arm_q5_K_repack branch from e1f60b6 to 69b2477 Compare January 22, 2026 18:25

ggerganov approved these changes Jan 23, 2026

View reviewed changes

ggerganov merged commit 091a46c into ggml-org:master Jan 23, 2026
76 of 78 checks passed

Alcpz mentioned this pull request Jan 26, 2026

ggml-cpu: arm64: Q4_K repack (i8mm) scale unroll and vectorization #19108

Merged

Alcpz deleted the Alcpz/arm_q5_K_repack branch January 26, 2026 10:42

loci-dev mentioned this pull request Jan 26, 2026

UPSTREAM PR #19108: ggml-cpu: arm64: Q4_K repack (i8mm) scale unroll and vectorization auroralabs-loci/llama.cpp#1037

Open

Alcpz mentioned this pull request Feb 5, 2026

ggml-cpu: arm64: q5_K repack gemm and gemv (and generic) implementations (dotprod) #19356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm)#18860

ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm)#18860
ggerganov merged 14 commits intoggml-org:masterfrom
Alcpz:Alcpz/arm_q5_K_repack

Alcpz commented Jan 15, 2026 •

edited

Loading

Uh oh!

Alcpz commented Jan 15, 2026

Uh oh!

Alcpz commented Jan 15, 2026

Uh oh!

taronaeo commented Jan 15, 2026

Uh oh!

Alcpz commented Jan 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

Alcpz commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Perplexity

Some outputs of llama-cli to test token generation:

Uh oh!

Alcpz commented Jan 15, 2026

Uh oh!

Alcpz commented Jan 15, 2026

Uh oh!

taronaeo commented Jan 15, 2026

Uh oh!

Alcpz commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Alcpz commented Jan 15, 2026 •

edited

Loading

Alcpz commented Jan 16, 2026 •

edited

Loading