Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17526

For small shapes where the number of columns is small (i.e. 16), the current logic skipped some chunks due to rounding.

The issue was observed with NB_COLS 8 and ne01 16, and could potentially happen with NB_COLS 4 and other combinations threads/shape.
This is also affected the corner case where chunking is disabled.

@max-krasnyansky I checked the performance here and didn't see any issue. Let me know if you'd like me to perform any particular test

Performance

RPI5

model test 2f416b265 (7162) t/s 3e18dba (7161) t/s
lfm2 350M Q4_0 pp256 174.46 ± 0.07 173.41 ± 0.64
lfm2 350M Q4_0 tg128 51.58 ± 0.03 51.38 ± 0.26
lfm2 700M Q4_0 pp256 81.79 ± 0.01 82.55 ± 0.03
lfm2 700M Q4_0 tg128 25.78 ± 0.00 25.86 ± 0.00

M4 max

model test 2f416b265 (7162) t/s 3e18dba (7161) t/s
lfm2 1.2B Q4_K Medium pp256 682.39 ± 3.23 682.82 ± 2.97
lfm2 1.2B Q4_K Medium tg128 233.77 ± 4.45 234.96 ± 0.57
lfm2 700M Q4_K Medium pp256 1070.08 ± 2.77 1067.29 ± 7.14
lfm2 700M Q4_K Medium tg128 331.12 ± 1.27 333.13 ± 1.32
llama 8B Q4_K Medium pp256 100.26 ± 0.11 96.65 ± 1.75
llama 8B Q4_K Medium tg128 43.10 ± 0.50 41.69 ± 0.72
qwen3 8B Q4_K Medium pp256 94.40 ± 0.33 90.45 ± 0.34
qwen3 8B Q4_K Medium tg128 40.92 ± 0.33 40.29 ± 0.27

@loci-review
Copy link

loci-review bot commented Nov 26, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #337

Overview

This PR implements a chunking safety fix in the REPACK matrix multiplication module (ggml/src/ggml-cpu/repack.cpp). The change adds a validation check to prevent creating chunks smaller than the minimum alignment requirement (NB_COLS) when distributing work across threads in forward_mul_mat.

Code Change Analysis

The modification affects the chunking logic by adding a condition that verifies chunk size before increasing nchunk0 to match thread count. Specifically, it calculates dr0 early and checks if (nr0 + nth - 1) / nth >= min_chunk_size before setting nchunk0 = nth. This prevents chunk overlap issues that occur with small matrix shapes (e.g., ne01=16 with NB_COLS=8 and 8 threads), where the original code would create 2-element chunks that, after alignment, would overlap and cause incorrect computations.

The fix is a correctness improvement that addresses edge cases in small matrix operations without modifying the core computation logic. The change is localized to approximately 7 lines within a single function template in one file.

Performance Impact Assessment

Based on the analysis context provided, this PR shows no measurable performance changes in the metrics. The modification is a logic fix that only affects the chunking strategy for edge cases involving small matrices. The PR author's benchmarks on RPI5 and M4 Max show variations within ±1.75%, which falls within measurement noise.

Inference Impact: No functions related to tokenization or inference (llama_decode, llama_encode, llama_tokenize) are modified by this PR. The change affects only the internal chunking mechanism in matrix multiplication operations. Therefore, there is no expected impact on tokens per second for inference workloads.

Power Consumption: No changes reported in power consumption metrics, as the fix does not alter the computational workload or instruction count for typical matrix sizes.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 50d76f4 to cbd9848 Compare December 1, 2025 11:08
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants