Skip to content

ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm)#18860

Merged
ggerganov merged 14 commits intoggml-org:masterfrom
Alcpz:Alcpz/arm_q5_K_repack
Jan 23, 2026
Merged

ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm)#18860
ggerganov merged 14 commits intoggml-org:masterfrom
Alcpz:Alcpz/arm_q5_K_repack

Conversation

@Alcpz
Copy link
Collaborator

@Alcpz Alcpz commented Jan 15, 2026

This PR implements the REPACK version of q5_K, following most of the existing design used for q4_K, since Q5_K only differs from q4_K in having the qh field with the additional bit.
Most of the code is shared, but I didn't know how to abstract the common patterns without creating a convoluted mess of functions. Since only Q4_K and Q5_K share the same 6bit scales and mins decode, I opted to duplicate the code.

I also moved around some declarations for Q2_K because the structure seemed weird (it's inverted with what I've seen in quants.c). The Q2_K function declarations were left where they were to avoid polluting the diff and messing the blame. If you want me to revert it, just say so.

I've followed the testing practices from #17494 and #18096 and compared GEMM and GEMV output of both repack and generic vs the current vec_dot version when repack is disabled.

Performance

M4 max (-DGGML_BLAS=OFF -DGGML_METAL=OFF)

model threads test REPACKOFF t/s REPACK t/s speedup
qwen3 8B Q5_K - Medium 8 pp512 39.67 83.10 2.10
qwen3 8B Q5_K - Medium 8 tg128 29.08 32.28 1.11
llama 8B Q5_K - Medium 8 pp512 37.43 79.24 2.12
llama 8B Q5_K - Medium 8 tg128 29.46 31.77 1.08
lfm2 350M Q5_K - Medium 8 pp512 1011.04 1868.82 1.85
lfm2 350M Q5_K - Medium 8 tg128 485.84 476.40 0.98
lfm2 1.2B Q5_K - Medium 8 pp512 299.96 581.59 1.94
lfm2 1.2B Q5_K - Medium 8 tg128 190.83 195.02 1.02

Exynos 2400

model threads test REPACKOFF t/s REPACK t/s speedup
lfm2 350M Q5_K - Medium 3 pp512 101.99 224.94 2.21
lfm2 350M Q5_K - Medium 3 tg128 60.74 67.41 1.11
lfm2 1.2B Q5_K - Medium 3 pp512 29.61 67.77 2.29
lfm2 1.2B Q5_K - Medium 3 tg128 21.74 22.56 1.04

Perplexity

model Repack ON Generic Repack OFF
LFM2 1.2B Q5_K_M 16.8450 ± 0.96470 16.8693 ± 0.96647 16.8795 ± 0.96706
Meta Llama 3.1 8B Instruct Q5_K_M 8.7533 ± 0.42939 8.7704 ± 0.43083 8.7506 ± 0.42931
Qwen3 8B 128K Q5_K_M 11.2690 ± 0.68575 11.2525 ± 0.68551 11.2632 ± 0.68616

Some outputs of llama-cli to test token generation:

llama-cli using repack
build      : b7722-9ba0fe61d
model      : LFM2-1.2B-Q5_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Italy?

The capital of Italy is Rome. Rome is the most historically significant city in Italy and is also the birthplace of the Renaissance. The city has a rich history dating back to ancient times and is home to numerous important cultural and architectural landmarks such as the Colosseum, the Pantheon, and Vatican City.

[ Prompt: 343.8 t/s | Generation: 200.2 t/s ]

> Can I visit the Eiffel Tower there?

Yes, you can visit the Eiffel Tower in Paris, France, which is the closest major city to Rome and within reach of European travelers visiting Italy. However, it's important to note that the Eiffel Tower is located in Paris, not Rome. You would need to plan your trip carefully to ensure that you can visit both Paris and Rome within your travel dates.

Here are a few points to consider:

1. **Transportation**: You can take a train from major Italian cities (like Milan, Turin, or Florence) to Paris. The journey takes about 3.5 to 4 hours, depending on the type of train and layovers. You might want to book a high-speed train to save time.

2. **Visiting the Eiffel Tower**: Once you arrive in Paris, you can easily take a metro ride (lines 4 and 6) to the Eiffel Tower area (Les Halles or Champ de Mars). The tower itself is open to the public several times a day, and ticket prices can vary based on the time of day and whether you prefer a guided tour.

3. **Alternatives**: If you prefer to visit Rome first, you might consider other nearby attractions like the Colosseum, the Vatican City, or Pompeii, which are also within easy reach by train.

Planning ahead and booking your tickets in advance will help ensure a smooth experience. Enjoy your trip!

[ Prompt: 397.7 t/s | Generation: 196.1 t/s ]
llama-cli using generic

Patch for routing through the generic implementation

  diff --git a/ggml/src/ggml-cpu/arch/arm/repack.cpp b/ggml/src/ggml-cpu/arch/arm/repack.cpp
  index ed2f9c668..440168e18 100644
  --- a/ggml/src/ggml-cpu/arch/arm/repack.cpp
  +++ b/ggml/src/ggml-cpu/arch/arm/repack.cpp
  @@ -805,7 +805,7 @@ void ggml_gemv_q5_K_8x8_q8_K(int                        n,
       UNUSED(ncols_interleaved);
       UNUSED(blocklen);

  -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
  +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
       constexpr int    col_pairs = ncols_interleaved / 2;
       const uint8x16_t m4b       = vdupq_n_u8(0x0f);
       const uint8x16_t mone = vdupq_n_u8(1);
  @@ -3017,7 +3017,7 @@ void ggml_gemm_q5_K_8x8_q8_K(int                        n,
       UNUSED(ncols_interleaved);
       UNUSED(blocklen);

  -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8)
  +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8)
       constexpr int    q8_k_blocklen = 4;
       constexpr int    col_pairs     = ncols_interleaved / 2;
       const uint8x16_t m4b           = vdupq_n_u8(0x0f);
build      : b7722-9ba0fe61d
model      : LFM2-1.2B-Q5_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Italy?

The capital of Italy is **Rome**.

It's also the country's most populous city, steeped in history and Renaissance art and architecture.


Other capital cities in Italy include Florence, Venice, and Milan.


[ Prompt: 32.9 t/s | Generation: 31.2 t/s ]

> Can I visit the Eiffel tower there?

No, you **cannot** visit the Eiffel Tower in Rome, Italy.

The Eiffel Tower is located in Paris, France, and is not part of Italy.

Visiting Rome and the Eiffel Tower are two very different experiences, each with its own unique history and attractions.

While Rome has its own historical sites and landmarks, there isn't anything like the Eiffel Tower within its borders.


[ Prompt: 33.2 t/s | Generation: 30.9 t/s ]


@Alcpz Alcpz requested a review from ggerganov as a code owner January 15, 2026 10:58
@Alcpz
Copy link
Collaborator Author

Alcpz commented Jan 15, 2026

@tdakhran @ykhrustalev

@Alcpz
Copy link
Collaborator Author

Alcpz commented Jan 15, 2026

Server failures are due to the removal of CURL. webgpu is failing on an unrelated test (webgpu tests)

@taronaeo
Copy link
Collaborator

Server failures are due to the removal of CURL.

Can you try rebasing with the master branch? Another PR I reviewed fixed it with rebase.

@Alcpz Alcpz force-pushed the Alcpz/arm_q5_K_repack branch from b517a1d to f4a7a91 Compare January 16, 2026 08:52
@Alcpz
Copy link
Collaborator Author

Alcpz commented Jan 16, 2026

Server failures are due to the removal of CURL.

Can you try rebasing with the master branch? Another PR I reviewed fixed it with rebase.

All good. Webgpu still fails, but that was a different issue. Thanks!

@Alcpz Alcpz force-pushed the Alcpz/arm_q5_K_repack branch from e1f60b6 to 69b2477 Compare January 22, 2026 18:25
@ggerganov ggerganov merged commit 091a46c into ggml-org:master Jan 23, 2026
76 of 78 checks passed
ronaldmannak pushed a commit to PicoMLX/llama.cpp that referenced this pull request Jan 24, 2026
…ions (i8mm) (ggml-org#18860)

* Boilerplate for q5_Kx8 REPACK on ARM and fallback

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Implements make_block_q5_Kx8 by extending make_block_q4_Kx8

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* q5_K repack gemm and gemv generics

* Gemm and Gemv ARM implementations (i8mm)

* Improved qh manipulation looking at non-repack vec_dot implementation

* Full unroll

* Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fix wrong fallback definitions of Q5_K

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed comments. Reverted unnecessary formatting

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed typo in generic definitions

* Switching AND + Shift with Shift Insert. Better op interleaving.

* Vectorize + unroll the block scales

* Apply gemm optimizations to gemv

* Improve bias calculation

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
ronaldmannak pushed a commit to PicoMLX/llama.cpp that referenced this pull request Jan 24, 2026
…ions (i8mm) (ggml-org#18860)

* Boilerplate for q5_Kx8 REPACK on ARM and fallback

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Implements make_block_q5_Kx8 by extending make_block_q4_Kx8

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* q5_K repack gemm and gemv generics

* Gemm and Gemv ARM implementations (i8mm)

* Improved qh manipulation looking at non-repack vec_dot implementation

* Full unroll

* Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fix wrong fallback definitions of Q5_K

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed comments. Reverted unnecessary formatting

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed typo in generic definitions

* Switching AND + Shift with Shift Insert. Better op interleaving.

* Vectorize + unroll the block scales

* Apply gemm optimizations to gemv

* Improve bias calculation

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
@Alcpz Alcpz deleted the Alcpz/arm_q5_K_repack branch January 26, 2026 10:42
ggerganov added a commit that referenced this pull request Jan 27, 2026
…ions (i8mm) #18860 (#18888)

* Boilerplate for q6_K repack

* q6_K repack to q6_Kx8 implementation

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* q6_K generic gemv and gemm

* wip, gemm_q6_K 8x8

* Still WIP: loading of q8s, q6h and q6l

* first working version of q6_K gemm

* Moved q6 loads outside of sb block, Unrolled inner loop

* Replaced modulo with mask

* First implementation of GEMV

* ggml_vdotq_s32 -> vdotq_s32

* Reduce width of accumulators in q6_K gemv

* Bsums instead of calc bias. Preload scales to use vget_lane. Unroll.

* Reuse scales in GEMM (same GEMV opt)

* Added todos for bsum and different qh repack

* Arch fallback

* VSLIQ for merging qh adn ql

* Removed TODO, already tested

* Apply suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Removed unused import

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request Feb 6, 2026
…ions (i8mm) (ggml-org#18860)

* Boilerplate for q5_Kx8 REPACK on ARM and fallback

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Implements make_block_q5_Kx8 by extending make_block_q4_Kx8

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* q5_K repack gemm and gemv generics

* Gemm and Gemv ARM implementations (i8mm)

* Improved qh manipulation looking at non-repack vec_dot implementation

* Full unroll

* Apply Q5_K Gemv vand and vshl optimizations to gemm. Improve comments.

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fix wrong fallback definitions of Q5_K

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed comments. Reverted unnecessary formatting

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* Fixed typo in generic definitions

* Switching AND + Shift with Shift Insert. Better op interleaving.

* Vectorize + unroll the block scales

* Apply gemm optimizations to gemv

* Improve bias calculation

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request Feb 6, 2026
…ions (i8mm) ggml-org#18860 (ggml-org#18888)

* Boilerplate for q6_K repack

* q6_K repack to q6_Kx8 implementation

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>

* q6_K generic gemv and gemm

* wip, gemm_q6_K 8x8

* Still WIP: loading of q8s, q6h and q6l

* first working version of q6_K gemm

* Moved q6 loads outside of sb block, Unrolled inner loop

* Replaced modulo with mask

* First implementation of GEMV

* ggml_vdotq_s32 -> vdotq_s32

* Reduce width of accumulators in q6_K gemv

* Bsums instead of calc bias. Preload scales to use vget_lane. Unroll.

* Reuse scales in GEMM (same GEMV opt)

* Added todos for bsum and different qh repack

* Arch fallback

* VSLIQ for merging qh adn ql

* Removed TODO, already tested

* Apply suggestions

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Removed unused import

---------

Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments