Skip to content

UPSTREAM PR #18860: ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm)#931

Open
loci-dev wants to merge 10 commits intomainfrom
upstream-PR18860-branch_Alcpz-Alcpz/arm_q5_K_repack
Open

UPSTREAM PR #18860: ggml-cpu: aarm64: q5_K repack gemm and gemv (and generic) implementations (i8mm)#931
loci-dev wants to merge 10 commits intomainfrom
upstream-PR18860-branch_Alcpz-Alcpz/arm_q5_K_repack

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18860

This PR implements the REPACK version of q5_K, following most of the existing design used for q4_K, since Q5_K only differs from q4_K in having the qh field with the additional bit.
Most of the code is shared, but I didn't know how to abstract the common patterns without creating a convoluted mess of functions. Since only Q4_K and Q5_K share the same 6bit scales and mins decode, I opted to duplicate the code.

I also moved around some declarations for Q2_K because the structure seemed weird (it's inverted with what I've seen in quants.c). The Q2_K function declarations were left where they were to avoid polluting the diff and messing the blame. If you want me to revert it, just say so.

I've followed the testing practices from #17494 and #18096 and compared GEMM and GEMV output of both repack and generic vs the current vec_dot version when repack is disabled.

Performance

M4 max (-DGGML_BLAS=OFF -DGGML_METAL=OFF)

model threads test REPACKOFF t/s REPACK t/s speedup
qwen3 8B Q5_K - Medium 8 pp256 39.67 82.01 2.07x
qwen3 8B Q5_K - Medium 8 tg128 29.08 32.28 1.11x
llama 8B Q5_K - Medium 8 pp256 37.43 78.86 2.11x
llama 8B Q5_K - Medium 8 tg128 29.46 31.21 1.06x
lfm2 350M Q5_K - Medium 8 pp256 1011.04 1824.21 1.80x
lfm2 350M Q5_K - Medium 8 tg128 490.27 472.60 0.96x
lfm2 1.2B Q5_K - Medium 8 pp256 299.96 556.11 1.85x
lfm2 1.2B Q5_K - Medium 8 tg128 190.83 188.26 0.99x

Exynos 2400

model threads test REPACKOFF t/s REPACK t/s speedup
lfm2 350M Q5_K - Medium 3 pp512 101.99 224.94 2.21x
lfm2 350M Q5_K - Medium 3 tg256 60.74 67.41 1.11x
lfm2 1.2B Q5_K - Medium 3 pp512 29.61 67.77 2.29x
lfm2 1.2B Q5_K - Medium 3 tg256 21.74 22.56 1.04x

Perplexity

model Repack ON Generic Repack OFF
LFM2 1.2B Q5_K_M 16.8450 ± 0.96470 16.8693 ± 0.96647 16.8795 ± 0.96706
Meta Llama 3.1 8B Instruct Q5_K_M 8.7533 ± 0.42939 8.7704 ± 0.43083 8.7506 ± 0.42931
Qwen3 8B 128K Q5_K_M 11.2690 ± 0.68575 11.2525 ± 0.68551 11.2632 ± 0.68616

Some outputs of llama-cli to test token generation:

llama-cli using repack
build      : b7722-9ba0fe61d
model      : LFM2-1.2B-Q5_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Italy?

The capital of Italy is Rome. Rome is the most historically significant city in Italy and is also the birthplace of the Renaissance. The city has a rich history dating back to ancient times and is home to numerous important cultural and architectural landmarks such as the Colosseum, the Pantheon, and Vatican City.

[ Prompt: 343.8 t/s | Generation: 200.2 t/s ]

> Can I visit the Eiffel Tower there?

Yes, you can visit the Eiffel Tower in Paris, France, which is the closest major city to Rome and within reach of European travelers visiting Italy. However, it's important to note that the Eiffel Tower is located in Paris, not Rome. You would need to plan your trip carefully to ensure that you can visit both Paris and Rome within your travel dates.

Here are a few points to consider:

1. **Transportation**: You can take a train from major Italian cities (like Milan, Turin, or Florence) to Paris. The journey takes about 3.5 to 4 hours, depending on the type of train and layovers. You might want to book a high-speed train to save time.

2. **Visiting the Eiffel Tower**: Once you arrive in Paris, you can easily take a metro ride (lines 4 and 6) to the Eiffel Tower area (Les Halles or Champ de Mars). The tower itself is open to the public several times a day, and ticket prices can vary based on the time of day and whether you prefer a guided tour.

3. **Alternatives**: If you prefer to visit Rome first, you might consider other nearby attractions like the Colosseum, the Vatican City, or Pompeii, which are also within easy reach by train.

Planning ahead and booking your tickets in advance will help ensure a smooth experience. Enjoy your trip!

[ Prompt: 397.7 t/s | Generation: 196.1 t/s ]
llama-cli using generic

Patch for routing through the generic implementation

  diff --git a/ggml/src/ggml-cpu/arch/arm/repack.cpp b/ggml/src/ggml-cpu/arch/arm/repack.cpp
  index ed2f9c668..440168e18 100644
  --- a/ggml/src/ggml-cpu/arch/arm/repack.cpp
  +++ b/ggml/src/ggml-cpu/arch/arm/repack.cpp
  @@ -805,7 +805,7 @@ void ggml_gemv_q5_K_8x8_q8_K(int                        n,
       UNUSED(ncols_interleaved);
       UNUSED(blocklen);

  -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
  +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD)
       constexpr int    col_pairs = ncols_interleaved / 2;
       const uint8x16_t m4b       = vdupq_n_u8(0x0f);
       const uint8x16_t mone = vdupq_n_u8(1);
  @@ -3017,7 +3017,7 @@ void ggml_gemm_q5_K_8x8_q8_K(int                        n,
       UNUSED(ncols_interleaved);
       UNUSED(blocklen);

  -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8)
  +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8)
       constexpr int    q8_k_blocklen = 4;
       constexpr int    col_pairs     = ncols_interleaved / 2;
       const uint8x16_t m4b           = vdupq_n_u8(0x0f);
build      : b7722-9ba0fe61d
model      : LFM2-1.2B-Q5_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> What is the capital of Italy?

The capital of Italy is **Rome**.

It's also the country's most populous city, steeped in history and Renaissance art and architecture.


Other capital cities in Italy include Florence, Venice, and Milan.


[ Prompt: 32.9 t/s | Generation: 31.2 t/s ]

> Can I visit the Eiffel tower there?

No, you **cannot** visit the Eiffel Tower in Rome, Italy.

The Eiffel Tower is located in Paris, France, and is not part of Italy.

Visiting Rome and the Eiffel Tower are two very different experiences, each with its own unique history and attractions.

While Rome has its own historical sites and landmarks, there isn't anything like the Eiffel Tower within its borders.


[ Prompt: 33.2 t/s | Generation: 30.9 t/s ]


@loci-review
Copy link

loci-review bot commented Jan 15, 2026

Explore the complete analysis inside the Version Insights

Now I'll generate the performance review report based on all the gathered information:


Performance Review Report

Summary

This code change introduces Q5_K quantization optimizations for ARM architectures with minimal performance impact. The target version shows a 3.6% power consumption increase in the CPU backend (5.3kJ increase) while improving a critical matrix multiplication kernel by 21ns. Only two functions show measurable changes, both with negligible absolute impact.

Change Context

The 10 commits by Alberto Cabrera implement Q5_K quantization support with ARM-specific GEMM/GEMV optimizations (i8mm instructions). Changes span 119 modified files, 44 additions, and 23 deletions, primarily affecting quantization kernels in ggml-cpu/ and repack implementations for 5-bit quantized matrix operations.

Performance Impact Analysis

Power Consumption: The libggml-cpu.so binary increased from 148.3kJ to 153.6kJ (+5.3kJ, +3.6%). All other binaries show zero change, indicating the impact is isolated to CPU backend quantization code.

Function-Level Changes:

  1. gemm_bloc<4,6> (ARM NEON kernel): Improved from 547ns to 526ns (-21ns, -3.8%). This 4×6 tile matrix multiplication kernel using FP16/FP32 mixed precision shows better instruction scheduling or binary layout. As a compute-critical function called thousands of times during inference, the cumulative benefit is meaningful despite small absolute improvement.

  2. std::vector::end() (STL accessor): Increased from 80ns to 263ns (+183ns). This initialization-only function shows compiler instrumentation differences (likely debug symbols or security checks) with no runtime impact on inference paths.

Code Change Justification

The commits implement new Q5_K quantization capabilities—a 5-bit weight compression format that reduces memory bandwidth while maintaining model accuracy. The added functionality (repack operations, ARM i8mm optimizations, GEMM/GEMV implementations) justifies the modest power increase. The 21ns improvement in the existing GEMM kernel suggests compiler optimizations benefited from code reorganization.

The 3.6% power increase reflects additional quantization code paths rather than performance regression. Q5_K support enables users to run larger models in memory-constrained environments, trading minimal compute overhead for significant memory savings.


@loci-dev loci-dev force-pushed the main branch 4 times, most recently from bbbac3d to 5194aba Compare January 15, 2026 20:10
@loci-dev loci-dev force-pushed the upstream-PR18860-branch_Alcpz-Alcpz/arm_q5_K_repack branch from b517a1d to f4a7a91 Compare January 16, 2026 09:40
@loci-review
Copy link

loci-review bot commented Jan 16, 2026

Analysis didn’t complete successfully. Explore Version Insights for details.

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 839190f to f1b080b Compare January 18, 2026 02:54
@loci-dev loci-dev force-pushed the main branch 20 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments