UPSTREAM PR #16634: metal : initial Metal4 tensor API support #97

DajanaV · 2025-11-05T20:37:47Z

Mirrored from ggml-org/llama.cpp#16634

Rework matrix-matrix multiplication
Use Tensor API when available

TODOs

Update mul_mm_id kernel
Test on M5 (looking for volunteers to test as I won't have hardware anytime soon)
How to handle missing bfloat tensor API? metal : initial Metal4 tensor API support ggml-org/llama.cpp#16634 (comment)
Confirm that using the Tensor API maintains the existing performance without using it on M4 and earlier

loci-agentic-ai · 2025-11-05T21:18:23Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version e54f4755-6bba-42c2-b8b2-fcb78022282d compared to base version b43f2432-b966-4c75-8c68-cb69d4ca588c reveals minimal performance variations within measurement precision. The changes primarily involve Metal GPU backend improvements for Apple Silicon hardware rather than core inference modifications.

Key Findings

Performance Metrics:

Highest Response Time change: operator void function in build.bin.llama-tts (+0.057%, +0.004 ns)
Highest Throughput change: make_unique<llm_graph_input_pos_bucket> in build.bin.libllama.so (-0.117%, -0.122 ns improvement)
No core inference functions (llama_decode, llama_encode, llama_tokenize) show measurable performance changes

Core Function Impact:
The observed changes do not affect critical inference functions. Both modified functions are utility/initialization components rather than token processing or model inference paths. No impact on tokens per second throughput is expected.

Power Consumption Analysis:
System-wide power consumption remains virtually unchanged across all binaries:

build.bin.libllama.so: -0.0002% change (-0.682 nJ)
build.bin.llama-tts: -0.00005% change (-0.154 nJ)
Total system change: <0.001%

Flame Graph and CFG Analysis:
The operator void function shows a simple single-frame execution profile with identical assembly code between versions. CFG comparison confirms byte-for-byte identical instructions, indicating the timing difference represents system-level execution noise rather than code modifications.

GitHub Code Review Insights:
PR #97 introduces Metal4 Tensor API support with comprehensive backward compatibility. The implementation includes runtime capability detection, conservative hardware-specific defaults, and dual code paths maintaining performance on older hardware while enabling optimizations on M5+ Apple Silicon chips.

Conclusion:
The performance variations observed are within normal measurement precision and do not represent functional changes to the inference pipeline. The Metal GPU improvements provide future performance benefits for supported hardware without affecting current CPU-based inference performance.

ggerganov added 13 commits November 5, 2025 11:31

metal : rework mat-mat multiplication

a4ec817

metal : initial Metal4 support

6e0b06a

cont

4931934

metal : detect tensor support

9039cb9

cont : better ifdefs

d9cd31c

metal : support tensors in mul_mm_id

c30d5c3

metal : add env for disabling tensor API

ab307c6

tests : restore

e86810a

metal : remove unused constants

ee6b086

metal : fix check for bfloat tensor support

e6aa68a

cont : handle API incompatibilities

9af8394

cont : handle even more incompatibilities

afebf27

metal : use tensor API only on M5 and later

3d9a497

DajanaV temporarily deployed to PROD__AL_DEMO November 5, 2025 20:37 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from 6f3825c to 60ff545 Compare November 5, 2025 21:07

DajanaV force-pushed the main branch 14 times, most recently from 44faeaa to d7421a0 Compare November 8, 2025 09:08

loci-dev force-pushed the main branch 30 times, most recently from fc0f51d to 89ba2e9 Compare November 29, 2025 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16634: metal : initial Metal4 tensor API support #97

UPSTREAM PR #16634: metal : initial Metal4 tensor API support #97

DajanaV commented Nov 5, 2025

Uh oh!

loci-agentic-ai bot commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #16634: metal : initial Metal4 tensor API support #97

Are you sure you want to change the base?

UPSTREAM PR #16634: metal : initial Metal4 tensor API support #97

Conversation

DajanaV commented Nov 5, 2025

Uh oh!

loci-agentic-ai bot commented Nov 5, 2025

Performance Analysis Summary

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants