Skip to content

ggml-cpu: arm64: Q4_K repack (i8mm) scale unroll and vectorization#19108

Merged
ggerganov merged 1 commit intoggml-org:masterfrom
Alcpz:Alcpz/arm_q4_K_opt
Jan 28, 2026
Merged

ggml-cpu: arm64: Q4_K repack (i8mm) scale unroll and vectorization#19108
ggerganov merged 1 commit intoggml-org:masterfrom
Alcpz:Alcpz/arm_q4_K_opt

Conversation

@Alcpz
Copy link
Collaborator

@Alcpz Alcpz commented Jan 26, 2026

While working on #18860 I found out a small perf optimization when loading the subblock scales.
Behavior unchanged, it's a manual unroll + vectorization.

Llama-bench:

model test old t/s new t/s speedup
lfm2 1.2B Q4_K pp512 658.53 682.69 1.04
lfm2 350M Q4_K pp512 2052.76 2159.47 1.05
Qwen 8B Q4_K - Medium pp512 94.21 99.51 1.06

No changes observed in the perplexities for Qwen3 8B 128K Q4_K_M and lfm2 1.2B Q4_K_M

cc: @tdakhran

@Alcpz Alcpz requested a review from ggerganov as a code owner January 26, 2026 10:39
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 26, 2026
@ggerganov ggerganov merged commit 6ad70c5 into ggml-org:master Jan 28, 2026
77 of 78 checks passed
shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request Feb 6, 2026
@Alcpz Alcpz deleted the Alcpz/arm_q4_K_opt branch February 10, 2026 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments