Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 13, 2025

Mirrored from ggml-org/llama.cpp#17241

This is a continuation of #17030 after a performance regression was reported.

Perplexity Comparison (Repack vs Non-Repack)

Command:

MODELS="unsloth/Qwen3-8B-128K-GGUF:Q4_0 ggml-org/Meta-Llama-3.1-8B-Instruct-Q4_0-GGUF:Q4_0 LiquidAI/LFM2-700M-GGUF:Q4_0 LiquidAI/LFM2-1.2B-GGUF:Q4_0"
for d in build-cpu-aarm64 build-cpu-aarm64-norepack; do
    for model in $MODELS; do
        ${d}/bin/llama-perplexity -hf "$model" -f ./wikitext-2-raw/wiki.test.raw --chunks 20 -dev none
    done
done
Model Repack PPL Non-Repack PPL
LFM2-700M Q4_0 20.3324 ± 0.87133 20.3324 ± 0.87133
LFM2-1.2B Q4_0 15.7524 ± 0.63304 15.7524 ± 0.63304
Meta-Llama-3.1-8B-Instruct Q4_0 8.6578 ± 0.30323 8.6578 ± 0.30323
Qwen3-8B-128K Q4_0 11.1735 ± 0.48175 11.1735 ± 0.48175

Llama-bench

model size params backend threads fa test t/s
qwen3 8B Q4_0 4.45 GiB 8.19 B CPU 8 1 pp256 148.88 ± 0.60
qwen3 8B Q4_0 4.45 GiB 8.19 B CPU 8 1 tg128 47.71 ± 0.35
llama 8B Q4_0 5.61 GiB 8.03 B CPU 8 1 pp256 151.26 ± 1.94
llama 8B Q4_0 5.61 GiB 8.03 B CPU 8 1 tg128 43.47 ± 0.78
lfm2 350M Q4_0 206.87 MiB 354.48 M CPU 8 1 pp256 3248.97 ± 32.82
lfm2 350M Q4_0 206.87 MiB 354.48 M CPU 8 1 tg128 562.68 ± 7.35
lfm2 700M Q4_0 423.37 MiB 742.49 M CPU 8 1 pp256 1585.66 ± 13.60
lfm2 700M Q4_0 423.37 MiB 742.49 M CPU 8 1 tg128 349.23 ± 2.42

build: c77bafd (6967) THIS PR

model size params backend threads fa test t/s
qwen3 8B Q4_0 4.45 GiB 8.19 B CPU 8 1 pp256 148.80 ± 0.18
qwen3 8B Q4_0 4.45 GiB 8.19 B CPU 8 1 tg128 48.50 ± 0.81
llama 8B Q4_0 5.61 GiB 8.03 B CPU 8 1 pp256 160.24 ± 0.76
llama 8B Q4_0 5.61 GiB 8.03 B CPU 8 1 tg128 45.60 ± 0.17
lfm2 350M Q4_0 206.87 MiB 354.48 M CPU 8 1 pp256 3269.37 ± 22.99
lfm2 350M Q4_0 206.87 MiB 354.48 M CPU 8 1 tg128 595.18 ± 3.34
lfm2 700M Q4_0 423.37 MiB 742.49 M CPU 8 1 pp256 1606.13 ± 8.51
lfm2 700M Q4_0 423.37 MiB 742.49 M CPU 8 1 tg128 362.24 ± 3.19

build: 2776db6 (7047) MASTER

@loci-review
Copy link

loci-review bot commented Nov 13, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: PR #191 - 3D Tensor Support in GGML CPU Repack

Overview

Pull Request #191 introduces 3D tensor support to the GGML CPU repack matrix multiplication system. The changes enable processing of transformer models with batch dimensions while maintaining numerical accuracy, but introduce measurable performance overhead in quantized operations.

Key Findings

Performance Impact:

  • Highest throughput degradation: forward_mul_mat function shows +37.52% throughput increase (2489 ns → 3423 ns), representing a performance regression in the core matrix multiplication path
  • Response time improvement: quantize_row_iq4_nl function shows -25.31% response time reduction (98 ns → 73 ns), indicating optimization in quantization operations
  • Power consumption: System-wide increase of +1.275% in build.bin.libggml-cpu.so, adding approximately 1936 nanojoules

Core Function Impact:
The changes affect critical inference components within the GGML backend system. The forward_mul_mat function is part of the high-performance inference pipeline for quantized matrix operations, directly impacting computational efficiency for IQ4_NL quantized models.

Inference Performance Impact:
Based on the reference model performance (ollama://smollm:135m on 12th Gen Intel i7-1255U), the 934 ns throughput increase in matrix multiplication operations may reduce tokens per second for workloads heavily utilizing IQ4_NL quantization, though the impact varies by model architecture and batch size.

Technical Analysis:

  • Flame Graph: Shows shallow execution structure with 80.6% time in main function body, indicating efficient core logic despite added complexity
  • CFG Comparison: Reveals identical control flow structure with performance improvements from compiler optimizations and better memory layout (50.6% improvement in assert path timing)
  • Code Review: Identifies increased computational complexity from 3D tensor indexing, nested loop overhead, and more complex pointer arithmetic as primary sources of throughput degradation

Affected Binaries:

  • build.bin.libggml-cpu.so: Primary impact with measurable power consumption increase
  • All other binaries show no performance changes

The implementation successfully enables 3D tensor processing while maintaining backward compatibility, with performance trade-offs concentrated in specific quantized operation paths.

@DajanaV DajanaV force-pushed the main branch 21 times, most recently from 701e6c7 to 6196a56 Compare November 16, 2025 01:36
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants