Optimize Metal Tensor API usage by Developer-Ecosystem-Engineering · Pull Request #20962 · ggml-org/llama.cpp

Developer-Ecosystem-Engineering · 2026-03-24T20:39:41Z

Overview

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else.

Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.

Geomean improvement of ~26%
TL-Q4_ 0 ~6.8%
DeepSeek-8B-f16 ~71.9%

Further test matrix below

Additional information

Tile dimensions are now configurable via compile-time macros (NRA×NRB), currently defaulting to 64×128 vs the legacy kernel's fixed 64×32.
New dimensions offer better performance across all models tested.
Matrix B is read directly from device memory, this was done to eliminate the threadgroup staging copy
The new kernel uses cooperative tensor accessors (cT.get_multidimensional_index / cT[i]) for direct per-element device writes
Threadgroup memory holds only dequantized A: NRA × NK_TOTAL × sizeof(fp16).

Requirements

Testing Details

Tested on 16 inch M5 Max (Best)

Model	pp512	pp1024	pp2048	pp4096	Model GeoMean
DeepSeek-8B-f16	+86.2%	+84.3%	+80.4%	+71.9%	+80.6%
L2-7B-Q6_K	+49.8%	+47.8%	+46.3%	+43.7%	+46.9%
TL-Q3_K_S	+32.7%	+31.9%	+27.3%	+21.5%	+28.3%
TL-Q5_K_M	+28.2%	+29.5%	+25.5%	+19.7%	+25.7%
TL-Q2_K	+28.6%	+30.5%	+26.2%	+20.0%	+26.3%
G-2B-q8_0	+27.9%	+27.6%	+28.6%	+27.9%	+28.0%
TL-Q6_K	+27.8%	+28.4%	+23.9%	+18.1%	+24.5%
Q3-4B-Q8_0	+23.9%	+24.2%	+22.8%	+20.7%	+22.9%
TL-IQ4_XS	+23.5%	+23.6%	+20.5%	+16.6%	+21.0%
TL-Q4_K_M	+18.7%	+19.5%	+17.0%	+13.2%	+17.1%
TL-Q8_0	+17.1%	+17.1%	+15.4%	+11.9%	+15.4%
TL-Q5_0	+14.8%	+16.3%	+13.8%	+10.2%	+13.8%
TL-Q4_0	+6.1%	+7.7%	+6.6%	+6.8%	+6.8%
Overall GeoMean					+26.4%

I have read and agree with the contributing guidelines
Yes
AI usage disclosure: Yes, Assistive tooling was utilized to navigate & better learn the project codebase, and split work into different phases.

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.

H-A-Khan · 2026-03-28T20:40:05Z

M5 Max (MBP 16"), 6+12 CPU, 40 GPU, 64 GB

Tested with LLaMA 7B v2 (F16, Q8_0, Q4_0) — same models used in the Apple Silicon M-series benchmark discussion.

PR #20962 results (build `c3a1128`)

model	size	params	backend	threads	test	t/s
llama 7B F16	12.55 GiB	6.74 B	MTL,BLAS	6	pp512	3129.64 ± 4.17
llama 7B F16	12.55 GiB	6.74 B	MTL,BLAS	6	tg128	35.93 ± 0.15
llama 7B Q8_0	6.67 GiB	6.74 B	MTL,BLAS	6	pp512	3101.56 ± 8.30
llama 7B Q8_0	6.67 GiB	6.74 B	MTL,BLAS	6	tg128	68.34 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	MTL,BLAS	6	pp512	3246.19 ± 4.60
llama 7B Q4_0	3.56 GiB	6.74 B	MTL,BLAS	6	tg128	110.48 ± 0.41

Comparison with old pinned build (`8e672ef`)

Test	Old (`8e672ef`)	This PR	Change
F16 pp512	1018.30 t/s	3129.64 t/s	+207%
F16 tg128	37.58 t/s	35.93 t/s	-4%
Q8_0 pp512	1051.59 t/s	3101.56 t/s	+195%
Q8_0 tg128	64.61 t/s	68.34 t/s	+6%
Q4_0 pp512	987.10 t/s	3246.19 t/s	+229%
Q4_0 tg128	102.93 t/s	110.48 t/s	+7%

Prompt processing ~3x faster across the board. Text generation unchanged as expected (memory-bandwidth bound). Great work! 🚀

ggerganov · 2026-03-28T20:57:00Z

Could you also confirm correctness by running llama-perplexity? Thanks.

Edit: also, no need to compare to the old 8e672ef. Compare to current master.

Hassan-A-K · 2026-03-28T21:18:25Z

@ggerganov Perplexity results on M5 Max (MBP 16", 40 GPU, 64 GB) — Wikitext-2 test set:

Model	Perplexity (PPL)
F16	5.7966 ± 0.03235
Q8_0	5.7978 ± 0.03236
Q4_0	5.9622 ± 0.03348

All within expected ranges. Correctness looks good.

Edit: running perplexity comparison against current master as well. Results incoming.

Hassan-A-K · 2026-03-28T21:35:35Z

@ggerganov Updated results — perplexity + benchmark comparison against current master (c46758d).

Perplexity — Wikitext-2 (M5 Max, MBP 16", 40 GPU, 64 GB)

Model	Master	PR #20962
F16	5.7962	5.7966
Q8_0	5.7974	5.7978
Q4_0	5.9618	5.9622

Benchmark vs master

Test	Master (`c46758d`)	PR #20962	Change
F16 pp512	1,601.60 t/s	3,129.64 t/s	+95%
F16 tg128	36.86 t/s	35.93 t/s	-3%
Q8_0 pp512	1,909.17 t/s	3,101.56 t/s	+62%
Q8_0 tg128	67.86 t/s	68.34 t/s	+1%
Q4_0 pp512	2,052.20 t/s	3,246.19 t/s	+58%
Q4_0 tg128	109.56 t/s	110.48 t/s	+1%

Huge pp improvement across the board. TG unchanged (memory-bound). No correctness regression.

Hassan-A-K · 2026-03-28T21:43:52Z

Perplexity results — M5, 4+6 CPU, 10 GPU, 24 GB

Ran llama-perplexity on wikitext-2-raw with LLaMA 2 7B on the optimize_metal_tensor branch:

Model	PPL	±
F16	5.7845	0.03242
Q8_0	5.7865	0.03243
Q4_0	5.9581	0.03359

F16 and Q8_0 are nearly identical as expected. Q4_0 shows the usual small quantization degradation. No correctness issues observed.

build: c3a1128 (8509)

Developer-Ecosystem-Engineering · 2026-03-31T17:10:27Z

Thank you @Hassan-A-K for the additional details (and confirmation)! Will keep an eye on this for any follow up relevant requests or questions related to integration.

ggerganov · 2026-04-25T09:41:25Z

@Developer-Ecosystem-Engineering Any reason not to do the same implementation for kernel_mul_mm_id?

ggerganov · 2026-04-25T11:21:54Z

Some numbers on M5 Max:

scripts/compare-commits.sh master pr/20962 llama-bench -m ~/models/qwen3.5-27b/ggml-model-q8_0.gguf -m ~/models/qwen3.5-27b/ggml-model-q4_0.gguf -m ~/models/qwen3.5-35b-a3b/ggml-model-q8_0.gguf -m ~/models/qwen3.5-35b-a3b/ggml-model-q4_0.gguf -m ~/models/gemma-4-26b-a4b-it/ggml-model-q8_0.gguf -m ~/models/gemma-4-26b-a4b-it/ggml-model-q4_0.gguf -m ~/models/gemma-4-31b-it/ggml-model-q8_0.gguf -m ~/models/gemma-4-31b-it/ggml-model-q4_0.gguf -fa 1 -ub 2048 -p 512,2048 -n 0 -t 1 --delay 20

Model	Test	t/s master	t/s pr/20962	Speedup
gemma4 26B.A4B Q4_0	pp512	3976.48	3963.35	1.00
gemma4 26B.A4B Q4_0	pp2048	3825.10	3943.56	1.03
gemma4 26B.A4B Q8_0	pp512	3265.54	3398.86	1.04
gemma4 26B.A4B Q8_0	pp2048	3390.16	3649.88	1.08
gemma4 31B Q4_0	pp512	553.94	711.68	1.28
gemma4 31B Q4_0	pp2048	443.27	555.85	1.25
gemma4 31B Q8_0	pp512	472.54	705.05	1.49
gemma4 31B Q8_0	pp2048	354.28	555.47	1.57
qwen35 27B Q4_0	pp512	643.56	838.46	1.30
qwen35 27B Q4_0	pp2048	522.53	685.44	1.31
qwen35 27B Q8_0	pp512	546.49	811.19	1.48
qwen35 27B Q8_0	pp2048	427.44	648.46	1.52
qwen35moe 35B.A3B Q4_0	pp512	3486.46	3523.32	1.01
qwen35moe 35B.A3B Q4_0	pp2048	3775.81	3909.26	1.04
qwen35moe 35B.A3B Q8_0	pp512	2935.91	3065.51	1.04
qwen35moe 35B.A3B Q8_0	pp2048	3379.42	3619.71	1.07

ggerganov · 2026-04-25T12:13:24Z

Prompt processing comparison between M4 Max (32 GPU cores) and M5 Max (40 GPU cores)

model	size	params	test	(M4 Max 32C) t/s	(M5 Max 40C) t/s	speedup
mistral3 8B Q8_0	8.40 GiB	8.49 B	pp2048	631.07 ± 37.52	2694.79 ± 27.42	4.27
gemma3 4B Q4_0	2.35 GiB	3.88 B	pp2048	1571.53 ± 15.03	6043.62 ± 10.91	3.85

build: 5605dd6 (8936)

ggerganov · 2026-04-25T12:15:12Z

@Developer-Ecosystem-Engineering Thanks for the nice contribution!

Developer-Ecosystem-Engineering · 2026-04-27T15:39:34Z

@Developer-Ecosystem-Engineering Any reason not to do the same implementation for kernel_mul_mm_id?

Never quite sure how projects will respond to our involvement, easier to start simple! Thank you for the engagement!

I've created 175680154 to investigate the potential benefit in kernel_mul_mm_id.

ggerganov · 2026-04-27T20:09:14Z

Got it. Contributions are very welcome!

…#20962) * Optimize Metal Tensor API usage for matmul2d Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure. * cont : cleanup * cont : cleanup * cont : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 24, 2026

Developer-Ecosystem-Engineering changed the title ~~Optimize Metal Tensor API usage for matmul2d~~ Optimize Metal Tensor API usage Mar 24, 2026

TheTom mentioned this pull request Mar 25, 2026

Upstream: TurboQuant discussion + contribution requirements for llama.cpp TheTom/turboquant_plus#27

Open

Developer-Ecosystem-Engineering mentioned this pull request Mar 27, 2026

metal: Fix dimension constraint violation in matmul2d descriptor #21048

Merged

Merge branch 'master' into pr/20962

5dd310d

ggerganov requested a review from a team as a code owner April 24, 2026 08:24

ggerganov added 3 commits April 24, 2026 13:43

Merge branch 'master' into pr/20962

c9db5fd

Merge branch 'master' into pr/20962

beb9014

cont : cleanup

abc38c3

cont : cleanup

ad7fae1

ggerganov approved these changes Apr 25, 2026

View reviewed changes

cont : cleanup

5605dd6

ggerganov merged commit d164904 into ggml-org:master Apr 25, 2026
51 of 53 checks passed

ardan-bkennedy mentioned this pull request Apr 26, 2026

Eval bug: GBNF grammar sampling produces token-loop / unconstrained output starting in b8931 #22381

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Metal Tensor API usage#20962

Optimize Metal Tensor API usage#20962
ggerganov merged 7 commits into
ggml-org:masterfrom
Developer-Ecosystem-Engineering:optimize_metal_tensor

Developer-Ecosystem-Engineering commented Mar 24, 2026

Uh oh!

H-A-Khan commented Mar 28, 2026

Uh oh!

ggerganov commented Mar 28, 2026 •

edited

Loading

Uh oh!

Hassan-A-K commented Mar 28, 2026 •

edited

Loading

Uh oh!

Hassan-A-K commented Mar 28, 2026 •

edited

Loading

Uh oh!

Hassan-A-K commented Mar 28, 2026

Uh oh!

Developer-Ecosystem-Engineering commented Mar 31, 2026

Uh oh!

ggerganov commented Apr 25, 2026

Uh oh!

ggerganov commented Apr 25, 2026

Uh oh!

ggerganov commented Apr 25, 2026

Uh oh!

Uh oh!

ggerganov commented Apr 25, 2026

Uh oh!

Developer-Ecosystem-Engineering commented Apr 27, 2026

Uh oh!

ggerganov commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Developer-Ecosystem-Engineering commented Mar 24, 2026

Overview

Additional information

Requirements

Uh oh!

H-A-Khan commented Mar 28, 2026

M5 Max (MBP 16"), 6+12 CPU, 40 GPU, 64 GB

PR #20962 results (build c3a1128)

Comparison with old pinned build (8e672ef)

Uh oh!

ggerganov commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hassan-A-K commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hassan-A-K commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Perplexity — Wikitext-2 (M5 Max, MBP 16", 40 GPU, 64 GB)

Benchmark vs master

Uh oh!

Hassan-A-K commented Mar 28, 2026

Perplexity results — M5, 4+6 CPU, 10 GPU, 24 GB

Uh oh!

Developer-Ecosystem-Engineering commented Mar 31, 2026

Uh oh!

ggerganov commented Apr 25, 2026

Uh oh!

ggerganov commented Apr 25, 2026

Uh oh!

ggerganov commented Apr 25, 2026

Uh oh!

Uh oh!

ggerganov commented Apr 25, 2026

Uh oh!

Developer-Ecosystem-Engineering commented Apr 27, 2026

Uh oh!

ggerganov commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PR #20962 results (build `c3a1128`)

Comparison with old pinned build (`8e672ef`)

ggerganov commented Mar 28, 2026 •

edited

Loading

Hassan-A-K commented Mar 28, 2026 •

edited

Loading

Hassan-A-K commented Mar 28, 2026 •

edited

Loading