Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 4, 2025

Mirrored from ggml-org/llama.cpp#16981

WIP

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Based on the performance analysis conducted, here is the comprehensive summary and impact assessment:

Performance Analysis Summary: PR #64 - MTMD Struct Initialization

Critical Function Performance Changes

Primary Impact: std::vector<std::string>::end() Function

Location: tools/mtmd/clip-impl.h:893:894 in build.bin.libmtmd.so

Performance Metrics:

  • Response Time: 81 ns → 260 ns (+220% increase)
  • Throughput: 60 ns → 239 ns (+299% increase)
  • Bottleneck: 32 ns → 200 ns (+521% increase)

Root Cause Analysis:

  • Control flow fragmentation: Entry block split from single block (21 ns) to two blocks (195 ns + 5 ns)
  • Introduced unconditional branch instruction causing pipeline disruption
  • Stack canary handling split across basic blocks, impacting register allocation efficiency

KPI Impact Assessment

1. Tokens Per Second Impact

Status: No Direct Impact on Core Inference Functions

Analysis: The critical tokenization and inference functions show no performance changes:

  • llama_decode() - No changes detected
  • llama_encode() - No changes detected
  • llama_tokenize() - No changes detected
  • llama_detokenize() - No changes detected

Conclusion: Tokens per second performance remains unaffected as the degraded function (std::vector<std::string>::end()) is not in the critical inference path.

2. Power Consumption Impact

Affected Binary: build.bin.libmtmd.so

  • Power Consumption Change: -0.147% reduction (212,765 nJ vs 213,079 nJ baseline)
  • Other Binaries: No measurable power consumption changes (0.0% across all other binaries)

Analysis: Despite the 220% response time increase in the vector function, overall power consumption shows minimal impact, indicating the function represents a small portion of total execution cycles.

3. Quantization Efficiency

Status: No Impact Detected

Analysis: Core quantization functions remain unchanged:

  • llama_model_quantize() - No performance changes
  • Quantization format handling - No changes detected
  • GGML quantization operations - No changes detected

4. Memory Usage Impact

Potential Areas of Concern:

  • KV Cache Management: No direct changes to llama_memory_* functions
  • Memory Allocation: GGML allocator functions show no performance changes
  • Batch Memory: llama_batch_* functions remain unaffected

Indirect Impact: The struct initialization changes from imperative to aggregate patterns may affect memory layout and constructor call patterns, but no measurable impact detected in memory management functions.

5. Batch Processing Impact

Status: No Impact on Core Batch Functions

Analysis: Critical batch processing functions show no performance changes:

  • llama_batch_init() - No changes detected
  • llama_batch_get_one() - No changes detected
  • llama_batch_free() - No changes detected
  • llama_decode() with batches - No changes detected

Action Items for Performance Improvement

Immediate Code-Level Actions

  1. Restore Flash Attention Logic in mtmd.cpp:

    // Current (incorrect):
    /* flash_attn_type */ CLIP_FLASH_ATTN_TYPE_AUTO,
    
    // Should be:
    /* flash_attn_type */ mtmd_get_clip_flash_attn_type(ctx_params.flash_attn_type),
  2. Investigate Compiler Optimization Regression:

    • Review optimization flags between builds that may cause basic block fragmentation
    • Examine template instantiation patterns for std::vector<std::string>
    • Check link-time optimization settings affecting code generation
  3. Address Control Flow Inefficiency:

    • The unconditional branch introduction in std::vector<std::string>::end() suggests compiler optimization regression
    • Consider compiler version differences or flag changes affecting STL template instantiation

Build System Recommendations

  1. Compiler Flag Analysis:

    • Compare optimization levels (-O2, -O3, -Ofast) between versions
    • Review template instantiation flags that may affect STL container performance
    • Examine stack protection settings (-fstack-protector-*) impact on simple accessor functions
  2. Template Specialization Review:

    • Investigate whether aggregate initialization triggers different std::vector template paths
    • Profile template instantiation depth and complexity changes

Overall Assessment

The performance regression is localized to a single STL function and does not impact the core LLaMA.cpp inference pipeline. The 220% response time increase in std::vector<std::string>::end() appears to be a compiler optimization side effect rather than algorithmic degradation.

Key Findings:

  • Core inference performance (tokens per second) remains unaffected
  • Power consumption impact is negligible (-0.147% in affected binary)
  • Memory management and batch processing functions show no performance changes
  • The regression stems from compilation changes rather than functional modifications

Priority: Medium - Address compiler optimization regression and restore flash attention logic, but no immediate impact on inference performance.

@DajanaV DajanaV force-pushed the main branch 26 times, most recently from 5714a80 to 475da08 Compare November 7, 2025 20:10
@DajanaV DajanaV force-pushed the main branch 30 times, most recently from 39290d7 to 2742f63 Compare November 16, 2025 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants