Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 5, 2025

Mirrored from ggml-org/llama.cpp#17030

While testing #16739, perplexities for LFM2 skyrocketed. @ggerganov pointed out that some matrix shapes would probably not be supported.

LFM2 has some layers that have two batches, so MAT_MULs were only done partially, leading to incorrect results. See ggml-org/llama.cpp#16739 (comment)

This patch adds basic support for tensors with ne2 > 1, with very naive chunking based on the non repack MUL MAT.

Perplexities using this patch:

bin/llama-perplexity \
  -hf LiquidAI/LFM2-1.2B-GGUF:Q4_0  \
  -f ../wikitext-2-raw/wiki.test.raw --chunks 20
# REPACK
[1]9.9763,[2]15.1558,[3]13.9708,[4]13.7465,[5]14.2039,[6]14.6234,[7]14.6543,[8]15.6984,[9]16.7691,[10]17.1773,[11]16.9814,[12]17.2111,[13]17.7539,[14]17.2013,[15]16.8515,[16]16.9276,[17]15.8386,[18]16.1010,[19]15.9863,[20]15.7344,
Final estimate: PPL = 15.7344 +/- 0.63198
#NO REPACK
[1]9.9763,[2]15.1558,[3]13.9708,[4]13.7465,[5]14.2039,[6]14.6234,[7]14.6543,[8]15.6984,[9]16.7691,[10]17.1773,[11]16.9814,[12]17.2111,[13]17.7539,[14]17.2013,[15]16.8515,[16]16.9276,[17]15.8386,[18]16.1010,[19]15.9863,[20]15.7344,
Final estimate: PPL = 15.7344 +/- 0.63198

I can provide logs for other models if needed.

@loci-review
Copy link

loci-review bot commented Nov 5, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #94 - 3D Tensor Support in Matrix Multiplication

Overview

PR #94 introduces 3D tensor support for batched matrix multiplication operations in the GGML CPU backend, specifically targeting models like LFM2 that require ne2 > 1 tensor processing. The changes modify the forward_mul_mat function in ggml/src/ggml-cpu/repack.cpp to handle multi-batch operations through enhanced chunking strategies.

Key Findings

Performance Impact:

  • Highest Throughput Degradation: forward_mul_mat function shows 36% increase (2,514 ns → 3,423 ns, +909 ns)
  • Highest Response Time Degradation: ggml_set_op_params_f32 shows 17% increase (102 ns → 120 ns, +18 ns)

Core Function Impact:
The changes do not directly affect primary inference functions (llama_decode, llama_encode, llama_tokenize) that drive tokens-per-second performance. The modified forward_mul_mat function operates at the GGML backend level for matrix operations, meaning token throughput remains unaffected for standard inference workloads.

Power Consumption Analysis:

  • build.bin.libggml-cpu.so shows 1.44% increase in power consumption (+2,166 nJ)
  • All other binaries show negligible changes (≤0.001%)
  • Power increase correlates directly with increased CPU cycles in matrix multiplication operations

Technical Analysis:

  • Flame Graph: Reveals assertion failure paths consuming 46% more execution time due to relocated error message strings affecting cache locality
  • CFG Comparison: Identical control flow structure with performance regression attributed to memory layout changes rather than algorithmic modifications
  • Code Review: Implementation adds computational overhead through batch index calculations, nested quantization loops, and two-dimensional chunking logic

Implementation Details:
The changes introduce necessary complexity for 3D tensor support while maintaining correctness. The performance overhead stems from enhanced parameter validation, complex memory addressing patterns, and expanded working set requirements (nbw2 * ne12 vs nbw1 * ne11).

Actionable Recommendations:

  • Consider conditional compilation to avoid 3D overhead for 2D operations
  • Optimize string literal placement for improved cache locality
  • Implement adaptive chunking based on tensor dimensions

The modifications successfully enable batched matrix operations for advanced model architectures while introducing acceptable performance overhead in specialized code paths.

@DajanaV DajanaV force-pushed the upstream-PR17030-branch_Alcpz-Alcpz/batched_repack_mul_mat branch from eadb483 to 0b86651 Compare November 5, 2025 18:41
@loci-review
Copy link

loci-review bot commented Nov 5, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #94 - 3D Tensor Support

Overview

PR #94 introduces 3D tensor support for repack matrix multiplication operations, specifically targeting models like LFM2 that require batched operations with ne2 > 1. The changes are localized to the ggml-cpu/repack.cpp file but introduce measurable performance overhead.

Key Findings

Performance Impact:

  • Highest Response Time Change: ggml_set_op_params shows +22.32% increase (131 ns → 160 ns)
  • Highest Throughput Change: forward_mul_mat (tensor_traits<block_iq4_nl>) shows +36.85% increase (2,514 ns → 3,440 ns)

Core Function Impact:
The changes do not directly affect primary inference functions (llama_decode, llama_encode, llama_tokenize). The performance degradation occurs in lower-level GGML tensor operations, which means tokens per second performance should remain largely unaffected for typical inference workloads.

Power Consumption Analysis:

  • build.bin.libggml-cpu.so shows +1.25% power consumption increase (150,656 nJ → 152,537 nJ)
  • All other binaries show no measurable power impact
  • The increase correlates directly with throughput degradations in CPU tensor operations

Technical Analysis:

  • Flame Graph: Shows 87.4% execution time concentrated in ggml_set_op_params itself, indicating performance regression stems from internal logic rather than external function calls
  • CFG Comparison: Reveals memory layout changes causing 57% execution time increase in assert paths and 36% in error paths due to different data section organization
  • Code Review: Identifies root cause as additional batch index calculations, complex pointer arithmetic, and nested quantization loops for 3D tensor support

Implementation Changes:
The modifications add computational overhead through:

  • Batch index calculations per function call
  • More complex memory addressing patterns
  • 2D chunking strategy increasing thread coordination complexity

Scope Assessment:
Changes are functionally necessary for 3D tensor support but introduce measured overhead in matrix multiplication operations. The performance impact is contained within GGML backend operations and should not significantly affect end-user inference performance for standard model operations.

@DajanaV DajanaV force-pushed the main branch 3 times, most recently from b1ace60 to bff7103 Compare November 6, 2025 08:11
@DajanaV DajanaV force-pushed the main branch 17 times, most recently from 733e776 to 2c7fec2 Compare November 9, 2025 07:08
@loci-dev loci-dev force-pushed the main branch 28 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 0cb533b to ef7afbe Compare February 13, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants