Conversation
|
Explore the complete analysis inside the Version Insights Now I'll generate the performance review report based on all the gathered information: Performance Review ReportSummaryThis code change introduces Q5_K quantization optimizations for ARM architectures with minimal performance impact. The target version shows a 3.6% power consumption increase in the CPU backend (5.3kJ increase) while improving a critical matrix multiplication kernel by 21ns. Only two functions show measurable changes, both with negligible absolute impact. Change ContextThe 10 commits by Alberto Cabrera implement Q5_K quantization support with ARM-specific GEMM/GEMV optimizations (i8mm instructions). Changes span 119 modified files, 44 additions, and 23 deletions, primarily affecting quantization kernels in Performance Impact AnalysisPower Consumption: The Function-Level Changes:
Code Change JustificationThe commits implement new Q5_K quantization capabilities—a 5-bit weight compression format that reduces memory bandwidth while maintaining model accuracy. The added functionality (repack operations, ARM i8mm optimizations, GEMM/GEMV implementations) justifies the modest power increase. The 21ns improvement in the existing GEMM kernel suggests compiler optimizations benefited from code reorganization. The 3.6% power increase reflects additional quantization code paths rather than performance regression. Q5_K support enables users to run larger models in memory-constrained environments, trading minimal compute overhead for significant memory savings. |
bbbac3d to
5194aba
Compare
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
b517a1d to
f4a7a91
Compare
|
Analysis didn’t complete successfully. Explore Version Insights for details. |
839190f to
f1b080b
Compare
048ad94 to
6c1fde6
Compare
823244c to
bab7d39
Compare
Mirrored from ggml-org/llama.cpp#18860
This PR implements the REPACK version of q5_K, following most of the existing design used for q4_K, since Q5_K only differs from q4_K in having the
qhfield with the additional bit.Most of the code is shared, but I didn't know how to abstract the common patterns without creating a convoluted mess of functions. Since only Q4_K and Q5_K share the same 6bit scales and mins decode, I opted to duplicate the code.
I also moved around some declarations for Q2_K because the structure seemed weird (it's inverted with what I've seen in quants.c). The Q2_K function declarations were left where they were to avoid polluting the diff and messing the blame. If you want me to revert it, just say so.
I've followed the testing practices from #17494 and #18096 and compared GEMM and GEMV output of both
repackandgenericvs the currentvec_dotversion when repack is disabled.Performance
M4 max (-DGGML_BLAS=OFF -DGGML_METAL=OFF)
Exynos 2400
Perplexity
Some outputs of llama-cli to test token generation:
llama-cli using repack
llama-cli using generic
Patch for routing through the generic implementation
diff --git a/ggml/src/ggml-cpu/arch/arm/repack.cpp b/ggml/src/ggml-cpu/arch/arm/repack.cpp index ed2f9c668..440168e18 100644 --- a/ggml/src/ggml-cpu/arch/arm/repack.cpp +++ b/ggml/src/ggml-cpu/arch/arm/repack.cpp @@ -805,7 +805,7 @@ void ggml_gemv_q5_K_8x8_q8_K(int n, UNUSED(ncols_interleaved); UNUSED(blocklen); -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD) +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_DOTPROD) constexpr int col_pairs = ncols_interleaved / 2; const uint8x16_t m4b = vdupq_n_u8(0x0f); const uint8x16_t mone = vdupq_n_u8(1); @@ -3017,7 +3017,7 @@ void ggml_gemm_q5_K_8x8_q8_K(int n, UNUSED(ncols_interleaved); UNUSED(blocklen); -#if defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8) +#if 0 && defined(__aarch64__) && defined(__ARM_NEON) && defined(__ARM_FEATURE_MATMUL_INT8) constexpr int q8_k_blocklen = 4; constexpr int col_pairs = ncols_interleaved / 2; const uint8x16_t m4b = vdupq_n_u8(0x0f);