UPSTREAM PR #19108: ggml-cpu: arm64: Q4_K repack (i8mm) scale unroll and vectorization by loci-dev · Pull Request #1037 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-26T10:42:55Z

Mirrored from ggml-org/llama.cpp#19108

While working on ggml-org/llama.cpp#18860 I found out a small perf optimization when loading the subblock scales.
Behavior unchanged, it's a manual unroll + vectorization.

Llama-bench:

model	test	old t/s	new t/s	speedup
lfm2 1.2B Q4_K	pp512	658.53	682.69	1.04
lfm2 350M Q4_K	pp512	2052.76	2159.47	1.05
Qwen 8B Q4_K - Medium	pp512	94.21	99.51	1.06

No changes observed in the perplexities for Qwen3 8B 128K Q4_K_M and lfm2 1.2B Q4_K_M

cc: @tdakhran

loci-review · 2026-01-26T11:39:54Z

Performance Review Report

Summary

No functions were identified for performance analysis between the base and target versions. This indicates that no meaningful performance changes occurred in this code revision.

Analysis

The absence of functions with significant response time or throughput time changes suggests that:

No Performance-Critical Modifications: The changes between versions did not impact the performance-sensitive areas of llama.cpp, such as:
- Matrix multiplication kernels (GEMM operations)
- Attention computation paths
- KV cache management
- Quantization/dequantization routines
- Token processing loops
Stable Performance Profile: The inference pipeline and core computational paths maintained their performance characteristics across versions.
Non-Performance Changes: Any modifications made were likely focused on:
- Bug fixes without performance impact
- Documentation updates
- Code refactoring that preserved performance
- Feature additions in non-critical paths

Conclusion

The target version exhibits no measurable performance regression or improvement compared to the base version. The core inference engine and computational kernels remain performance-neutral across this revision.

See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

ggml-cpu: arm64: Q4_K scale unroll and vectorization

c5d2a77

loci-dev temporarily deployed to PROD__AL_DEMO January 26, 2026 10:42 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 27 times, most recently from 62bf34b to 10471d1 Compare January 29, 2026 13:31

loci-dev force-pushed the main branch 20 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19108: ggml-cpu: arm64: Q4_K repack (i8mm) scale unroll and vectorization#1037

UPSTREAM PR #19108: ggml-cpu: arm64: Q4_K repack (i8mm) scale unroll and vectorization#1037
loci-dev wants to merge 1 commit intomainfrom
upstream-PR19108-branch_Alcpz-Alcpz/arm_q4_K_opt

loci-dev commented Jan 26, 2026

Uh oh!

loci-review bot commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

loci-dev commented Jan 26, 2026

Uh oh!

loci-review bot commented Jan 26, 2026

Performance Review Report

Summary

Analysis

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments