UPSTREAM PR #17030: ggml-cpu: handle 3d tensors in repack mat_mul #94

DajanaV · 2025-11-05T17:36:28Z

While testing #16739, perplexities for LFM2 skyrocketed. @ggerganov pointed out that some matrix shapes would probably not be supported.

LFM2 has some layers that have two batches, so MAT_MULs were only done partially, leading to incorrect results. See ggml-org/llama.cpp#16739 (comment)

This patch adds basic support for tensors with ne2 > 1, with very naive chunking based on the non repack MUL MAT.

Perplexities using this patch:

bin/llama-perplexity \
  -hf LiquidAI/LFM2-1.2B-GGUF:Q4_0  \
  -f ../wikitext-2-raw/wiki.test.raw --chunks 20
# REPACK
[1]9.9763,[2]15.1558,[3]13.9708,[4]13.7465,[5]14.2039,[6]14.6234,[7]14.6543,[8]15.6984,[9]16.7691,[10]17.1773,[11]16.9814,[12]17.2111,[13]17.7539,[14]17.2013,[15]16.8515,[16]16.9276,[17]15.8386,[18]16.1010,[19]15.9863,[20]15.7344,
Final estimate: PPL = 15.7344 +/- 0.63198
#NO REPACK
[1]9.9763,[2]15.1558,[3]13.9708,[4]13.7465,[5]14.2039,[6]14.6234,[7]14.6543,[8]15.6984,[9]16.7691,[10]17.1773,[11]16.9814,[12]17.2111,[13]17.7539,[14]17.2013,[15]16.8515,[16]16.9276,[17]15.8386,[18]16.1010,[19]15.9863,[20]15.7344,
Final estimate: PPL = 15.7344 +/- 0.63198

I can provide logs for other models if needed.

loci-review · 2025-11-05T18:12:34Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #94 - 3D Tensor Support in Matrix Multiplication

Overview

PR #94 introduces 3D tensor support for batched matrix multiplication operations in the GGML CPU backend, specifically targeting models like LFM2 that require ne2 > 1 tensor processing. The changes modify the forward_mul_mat function in ggml/src/ggml-cpu/repack.cpp to handle multi-batch operations through enhanced chunking strategies.

Key Findings

Performance Impact:

Highest Throughput Degradation: forward_mul_mat function shows 36% increase (2,514 ns → 3,423 ns, +909 ns)
Highest Response Time Degradation: ggml_set_op_params_f32 shows 17% increase (102 ns → 120 ns, +18 ns)

Core Function Impact:
The changes do not directly affect primary inference functions (llama_decode, llama_encode, llama_tokenize) that drive tokens-per-second performance. The modified forward_mul_mat function operates at the GGML backend level for matrix operations, meaning token throughput remains unaffected for standard inference workloads.

Power Consumption Analysis:

build.bin.libggml-cpu.so shows 1.44% increase in power consumption (+2,166 nJ)
All other binaries show negligible changes (≤0.001%)
Power increase correlates directly with increased CPU cycles in matrix multiplication operations

Technical Analysis:

Flame Graph: Reveals assertion failure paths consuming 46% more execution time due to relocated error message strings affecting cache locality
CFG Comparison: Identical control flow structure with performance regression attributed to memory layout changes rather than algorithmic modifications
Code Review: Implementation adds computational overhead through batch index calculations, nested quantization loops, and two-dimensional chunking logic

Implementation Details:
The changes introduce necessary complexity for 3D tensor support while maintaining correctness. The performance overhead stems from enhanced parameter validation, complex memory addressing patterns, and expanded working set requirements (nbw2 * ne12 vs nbw1 * ne11).

Actionable Recommendations:

Consider conditional compilation to avoid 3D overhead for 2D operations
Optimize string literal placement for improved cache locality
Implement adaptive chunking based on tensor dimensions

The modifications successfully enable batched matrix operations for advanced model architectures while introducing acceptable performance overhead in specialized code paths.

loci-review · 2025-11-05T19:19:01Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #94 - 3D Tensor Support

Overview

PR #94 introduces 3D tensor support for repack matrix multiplication operations, specifically targeting models like LFM2 that require batched operations with ne2 > 1. The changes are localized to the ggml-cpu/repack.cpp file but introduce measurable performance overhead.

Key Findings

Performance Impact:

Highest Response Time Change: ggml_set_op_params shows +22.32% increase (131 ns → 160 ns)
Highest Throughput Change: forward_mul_mat (tensor_traits<block_iq4_nl>) shows +36.85% increase (2,514 ns → 3,440 ns)

Core Function Impact:
The changes do not directly affect primary inference functions (llama_decode, llama_encode, llama_tokenize). The performance degradation occurs in lower-level GGML tensor operations, which means tokens per second performance should remain largely unaffected for typical inference workloads.

Power Consumption Analysis:

build.bin.libggml-cpu.so shows +1.25% power consumption increase (150,656 nJ → 152,537 nJ)
All other binaries show no measurable power impact
The increase correlates directly with throughput degradations in CPU tensor operations

Technical Analysis:

Flame Graph: Shows 87.4% execution time concentrated in ggml_set_op_params itself, indicating performance regression stems from internal logic rather than external function calls
CFG Comparison: Reveals memory layout changes causing 57% execution time increase in assert paths and 36% in error paths due to different data section organization
Code Review: Identifies root cause as additional batch index calculations, complex pointer arithmetic, and nested quantization loops for 3D tensor support

Implementation Changes:
The modifications add computational overhead through:

Batch index calculations per function call
More complex memory addressing patterns
2D chunking strategy increasing thread coordination complexity

Scope Assessment:
Changes are functionally necessary for 3D tensor support but introduce measured overhead in matrix multiplication operations. The performance impact is contained within GGML backend operations and should not significantly affect end-user inference performance for standard model operations.

DajanaV temporarily deployed to PROD__AL_DEMO November 5, 2025 17:36 — with GitHub Actions Inactive

ggml-cpu: handle 3d tensors in repack mul_mat

950671d

DajanaV force-pushed the main branch from 059da62 to 948dcfd Compare November 5, 2025 18:11

Removed unnecessary branch, removed need for <algorithm>

0b86651

DajanaV force-pushed the upstream-PR17030-branch_Alcpz-Alcpz/batched_repack_mul_mat branch from eadb483 to 0b86651 Compare November 5, 2025 18:41

DajanaV temporarily deployed to PROD__AL_DEMO November 5, 2025 18:42 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from 948dcfd to 6f3825c Compare November 5, 2025 19:07

DajanaV force-pushed the main branch 3 times, most recently from b1ace60 to bff7103 Compare November 6, 2025 08:11

Fixed dst_ptr pointer in chunk + clang_format

75c7fd5

DajanaV force-pushed the main branch 17 times, most recently from 733e776 to 2c7fec2 Compare November 9, 2025 07:08

loci-dev force-pushed the main branch 28 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch 2 times, most recently from 0cb533b to ef7afbe Compare February 13, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17030: ggml-cpu: handle 3d tensors in repack mat_mul #94

UPSTREAM PR #17030: ggml-cpu: handle 3d tensors in repack mat_mul #94

Uh oh!

DajanaV commented Nov 5, 2025

Uh oh!

loci-review bot commented Nov 5, 2025

Uh oh!

loci-review bot commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UPSTREAM PR #17030: ggml-cpu: handle 3d tensors in repack mat_mul #94

Are you sure you want to change the base?

UPSTREAM PR #17030: ggml-cpu: handle 3d tensors in repack mat_mul #94

Uh oh!

Conversation

DajanaV commented Nov 5, 2025

Uh oh!

loci-review bot commented Nov 5, 2025

Performance Analysis Summary: PR #94 - 3D Tensor Support in Matrix Multiplication

Overview

Key Findings

Uh oh!

loci-review bot commented Nov 5, 2025

Performance Analysis Summary: PR #94 - 3D Tensor Support

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants