UPSTREAM PR #19108: ggml-cpu: arm64: Q4_K repack (i8mm) scale unroll and vectorization#1037
UPSTREAM PR #19108: ggml-cpu: arm64: Q4_K repack (i8mm) scale unroll and vectorization#1037
Conversation
Performance Review ReportSummaryNo functions were identified for performance analysis between the base and target versions. This indicates that no meaningful performance changes occurred in this code revision. AnalysisThe absence of functions with significant response time or throughput time changes suggests that:
ConclusionThe target version exhibits no measurable performance regression or improvement compared to the base version. The core inference engine and computational kernels remain performance-neutral across this revision. See the complete breakdown in Version Insights |
62bf34b to
10471d1
Compare
048ad94 to
6c1fde6
Compare
823244c to
bab7d39
Compare
Mirrored from ggml-org/llama.cpp#19108
While working on ggml-org/llama.cpp#18860 I found out a small perf optimization when loading the subblock scales.
Behavior unchanged, it's a manual unroll + vectorization.
Llama-bench:
No changes observed in the perplexities for Qwen3 8B 128K Q4_K_M and lfm2 1.2B Q4_K_M
cc: @tdakhran