ggml-cpu: arm64: Q4_K repack (i8mm) scale unroll and vectorization by Alcpz · Pull Request #19108 · ggml-org/llama.cpp

Alcpz · 2026-01-26T10:39:25Z

While working on #18860 I found out a small perf optimization when loading the subblock scales.
Behavior unchanged, it's a manual unroll + vectorization.

Llama-bench:

model	test	old t/s	new t/s	speedup
lfm2 1.2B Q4_K	pp512	658.53	682.69	1.04
lfm2 350M Q4_K	pp512	2052.76	2159.47	1.05
Qwen 8B Q4_K - Medium	pp512	94.21	99.51	1.06

No changes observed in the perplexities for Qwen3 8B 128K Q4_K_M and lfm2 1.2B Q4_K_M

cc: @tdakhran

ggml-cpu: arm64: Q4_K scale unroll and vectorization

c5d2a77

Alcpz requested a review from ggerganov as a code owner January 26, 2026 10:39

loci-dev mentioned this pull request Jan 26, 2026

UPSTREAM PR #19108: ggml-cpu: arm64: Q4_K repack (i8mm) scale unroll and vectorization auroralabs-loci/llama.cpp#1037

Open

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 26, 2026

ggerganov approved these changes Jan 28, 2026

View reviewed changes

ggerganov merged commit 6ad70c5 into ggml-org:master Jan 28, 2026
77 of 78 checks passed

shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request Feb 6, 2026

ggml-cpu: arm64: Q4_K scale unroll and vectorization (ggml-org#19108)

8099907

Alcpz deleted the Alcpz/arm_q4_K_opt branch February 10, 2026 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: arm64: Q4_K repack (i8mm) scale unroll and vectorization#19108

ggml-cpu: arm64: Q4_K repack (i8mm) scale unroll and vectorization#19108
ggerganov merged 1 commit intoggml-org:masterfrom
Alcpz:Alcpz/arm_q4_K_opt

Alcpz commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Alcpz commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments