Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 5, 2025

Mirrored from ggml-org/llama.cpp#16634

  • Rework matrix-matrix multiplication
  • Use Tensor API when available

TODOs

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version e54f4755-6bba-42c2-b8b2-fcb78022282d compared to base version b43f2432-b966-4c75-8c68-cb69d4ca588c reveals minimal performance variations within measurement precision. The changes primarily involve Metal GPU backend improvements for Apple Silicon hardware rather than core inference modifications.

Key Findings

Performance Metrics:

  • Highest Response Time change: operator void function in build.bin.llama-tts (+0.057%, +0.004 ns)
  • Highest Throughput change: make_unique<llm_graph_input_pos_bucket> in build.bin.libllama.so (-0.117%, -0.122 ns improvement)
  • No core inference functions (llama_decode, llama_encode, llama_tokenize) show measurable performance changes

Core Function Impact:
The observed changes do not affect critical inference functions. Both modified functions are utility/initialization components rather than token processing or model inference paths. No impact on tokens per second throughput is expected.

Power Consumption Analysis:
System-wide power consumption remains virtually unchanged across all binaries:

  • build.bin.libllama.so: -0.0002% change (-0.682 nJ)
  • build.bin.llama-tts: -0.00005% change (-0.154 nJ)
  • Total system change: <0.001%

Flame Graph and CFG Analysis:
The operator void function shows a simple single-frame execution profile with identical assembly code between versions. CFG comparison confirms byte-for-byte identical instructions, indicating the timing difference represents system-level execution noise rather than code modifications.

GitHub Code Review Insights:
PR #97 introduces Metal4 Tensor API support with comprehensive backward compatibility. The implementation includes runtime capability detection, conservative hardware-specific defaults, and dual code paths maintaining performance on older hardware while enabling optimizations on M5+ Apple Silicon chips.

Conclusion:
The performance variations observed are within normal measurement precision and do not represent functional changes to the inference pipeline. The Metal GPU improvements provide future performance benefits for supported hardware without affecting current CPU-based inference performance.

@DajanaV DajanaV force-pushed the main branch 14 times, most recently from 44faeaa to d7421a0 Compare November 8, 2025 09:08
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from fc0f51d to 89ba2e9 Compare November 29, 2025 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants