Skip to content

Optimize Metal Tensor API usage#20962

Merged
ggerganov merged 7 commits into
ggml-org:masterfrom
Developer-Ecosystem-Engineering:optimize_metal_tensor
Apr 25, 2026
Merged

Optimize Metal Tensor API usage#20962
ggerganov merged 7 commits into
ggml-org:masterfrom
Developer-Ecosystem-Engineering:optimize_metal_tensor

Conversation

@Developer-Ecosystem-Engineering
Copy link
Copy Markdown
Contributor

Overview

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else.

Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.

Geomean improvement of ~26%
TL-Q4_ 0 ~6.8%
DeepSeek-8B-f16 ~71.9%

Further test matrix below

Additional information

  • Tile dimensions are now configurable via compile-time macros (NRA×NRB), currently defaulting to 64×128 vs the legacy kernel's fixed 64×32.
  • New dimensions offer better performance across all models tested.
  • Matrix B is read directly from device memory, this was done to eliminate the threadgroup staging copy
  • The new kernel uses cooperative tensor accessors (cT.get_multidimensional_index / cT[i]) for direct per-element device writes
  • Threadgroup memory holds only dequantized A: NRA × NK_TOTAL × sizeof(fp16).

Requirements

Testing Details

  • Tested on 16 inch M5 Max (Best)
Model pp512 pp1024 pp2048 pp4096 Model GeoMean
DeepSeek-8B-f16 +86.2% +84.3% +80.4% +71.9% +80.6%
L2-7B-Q6_K +49.8% +47.8% +46.3% +43.7% +46.9%
TL-Q3_K_S +32.7% +31.9% +27.3% +21.5% +28.3%
TL-Q5_K_M +28.2% +29.5% +25.5% +19.7% +25.7%
TL-Q2_K +28.6% +30.5% +26.2% +20.0% +26.3%
G-2B-q8_0 +27.9% +27.6% +28.6% +27.9% +28.0%
TL-Q6_K +27.8% +28.4% +23.9% +18.1% +24.5%
Q3-4B-Q8_0 +23.9% +24.2% +22.8% +20.7% +22.9%
TL-IQ4_XS +23.5% +23.6% +20.5% +16.6% +21.0%
TL-Q4_K_M +18.7% +19.5% +17.0% +13.2% +17.1%
TL-Q8_0 +17.1% +17.1% +15.4% +11.9% +15.4%
TL-Q5_0 +14.8% +16.3% +13.8% +10.2% +13.8%
TL-Q4_0 +6.1% +7.7% +6.6% +6.8% +6.8%
Overall GeoMean +26.4%
  • I have read and agree with the contributing guidelines
    Yes
  • AI usage disclosure: Yes, Assistive tooling was utilized to navigate & better learn the project codebase, and split work into different phases.

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR.

The legacy simdgroup_matrix kernel is preserved under #else.

Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 24, 2026
@Developer-Ecosystem-Engineering Developer-Ecosystem-Engineering changed the title Optimize Metal Tensor API usage for matmul2d Optimize Metal Tensor API usage Mar 24, 2026
@H-A-Khan
Copy link
Copy Markdown

M5 Max (MBP 16"), 6+12 CPU, 40 GPU, 64 GB

Tested with LLaMA 7B v2 (F16, Q8_0, Q4_0) — same models used in the Apple Silicon M-series benchmark discussion.

PR #20962 results (build c3a1128)

model size params backend threads test t/s
llama 7B F16 12.55 GiB 6.74 B MTL,BLAS 6 pp512 3129.64 ± 4.17
llama 7B F16 12.55 GiB 6.74 B MTL,BLAS 6 tg128 35.93 ± 0.15
llama 7B Q8_0 6.67 GiB 6.74 B MTL,BLAS 6 pp512 3101.56 ± 8.30
llama 7B Q8_0 6.67 GiB 6.74 B MTL,BLAS 6 tg128 68.34 ± 0.08
llama 7B Q4_0 3.56 GiB 6.74 B MTL,BLAS 6 pp512 3246.19 ± 4.60
llama 7B Q4_0 3.56 GiB 6.74 B MTL,BLAS 6 tg128 110.48 ± 0.41

Comparison with old pinned build (8e672ef)

Test Old (8e672ef) This PR Change
F16 pp512 1018.30 t/s 3129.64 t/s +207%
F16 tg128 37.58 t/s 35.93 t/s -4%
Q8_0 pp512 1051.59 t/s 3101.56 t/s +195%
Q8_0 tg128 64.61 t/s 68.34 t/s +6%
Q4_0 pp512 987.10 t/s 3246.19 t/s +229%
Q4_0 tg128 102.93 t/s 110.48 t/s +7%

Prompt processing ~3x faster across the board. Text generation unchanged as expected (memory-bandwidth bound). Great work! 🚀

@ggerganov
Copy link
Copy Markdown
Member

ggerganov commented Mar 28, 2026

Could you also confirm correctness by running llama-perplexity? Thanks.

Edit: also, no need to compare to the old 8e672ef. Compare to current master.

@Hassan-A-K
Copy link
Copy Markdown

Hassan-A-K commented Mar 28, 2026

@ggerganov Perplexity results on M5 Max (MBP 16", 40 GPU, 64 GB) — Wikitext-2 test set:

Model Perplexity (PPL)
F16 5.7966 ± 0.03235
Q8_0 5.7978 ± 0.03236
Q4_0 5.9622 ± 0.03348

All within expected ranges. Correctness looks good.

Edit: running perplexity comparison against current master as well. Results incoming.

@Hassan-A-K
Copy link
Copy Markdown

Hassan-A-K commented Mar 28, 2026

@ggerganov Updated results — perplexity + benchmark comparison against current master (c46758d).

Perplexity — Wikitext-2 (M5 Max, MBP 16", 40 GPU, 64 GB)

Model Master PR #20962
F16 5.7962 5.7966
Q8_0 5.7974 5.7978
Q4_0 5.9618 5.9622

Benchmark vs master

Test Master (c46758d) PR #20962 Change
F16 pp512 1,601.60 t/s 3,129.64 t/s +95%
F16 tg128 36.86 t/s 35.93 t/s -3%
Q8_0 pp512 1,909.17 t/s 3,101.56 t/s +62%
Q8_0 tg128 67.86 t/s 68.34 t/s +1%
Q4_0 pp512 2,052.20 t/s 3,246.19 t/s +58%
Q4_0 tg128 109.56 t/s 110.48 t/s +1%

Huge pp improvement across the board. TG unchanged (memory-bound). No correctness regression.

@Hassan-A-K
Copy link
Copy Markdown

Perplexity results — M5, 4+6 CPU, 10 GPU, 24 GB

Ran llama-perplexity on wikitext-2-raw with LLaMA 2 7B on the optimize_metal_tensor branch:

Model PPL ±
F16 5.7845 0.03242
Q8_0 5.7865 0.03243
Q4_0 5.9581 0.03359

F16 and Q8_0 are nearly identical as expected. Q4_0 shows the usual small quantization degradation. No correctness issues observed.

build: c3a1128 (8509)

@Developer-Ecosystem-Engineering
Copy link
Copy Markdown
Contributor Author

Thank you @Hassan-A-K for the additional details (and confirmation)! Will keep an eye on this for any follow up relevant requests or questions related to integration.

@ggerganov ggerganov requested a review from a team as a code owner April 24, 2026 08:24
@ggerganov
Copy link
Copy Markdown
Member

@Developer-Ecosystem-Engineering Any reason not to do the same implementation for kernel_mul_mm_id?

@ggerganov
Copy link
Copy Markdown
Member

Some numbers on M5 Max:

scripts/compare-commits.sh master pr/20962 llama-bench -m ~/models/qwen3.5-27b/ggml-model-q8_0.gguf -m ~/models/qwen3.5-27b/ggml-model-q4_0.gguf -m ~/models/qwen3.5-35b-a3b/ggml-model-q8_0.gguf -m ~/models/qwen3.5-35b-a3b/ggml-model-q4_0.gguf -m ~/models/gemma-4-26b-a4b-it/ggml-model-q8_0.gguf -m ~/models/gemma-4-26b-a4b-it/ggml-model-q4_0.gguf -m ~/models/gemma-4-31b-it/ggml-model-q8_0.gguf -m ~/models/gemma-4-31b-it/ggml-model-q4_0.gguf -fa 1 -ub 2048 -p 512,2048 -n 0 -t 1 --delay 20
Model Test t/s master t/s pr/20962 Speedup
gemma4 26B.A4B Q4_0 pp512 3976.48 3963.35 1.00
gemma4 26B.A4B Q4_0 pp2048 3825.10 3943.56 1.03
gemma4 26B.A4B Q8_0 pp512 3265.54 3398.86 1.04
gemma4 26B.A4B Q8_0 pp2048 3390.16 3649.88 1.08
gemma4 31B Q4_0 pp512 553.94 711.68 1.28
gemma4 31B Q4_0 pp2048 443.27 555.85 1.25
gemma4 31B Q8_0 pp512 472.54 705.05 1.49
gemma4 31B Q8_0 pp2048 354.28 555.47 1.57
qwen35 27B Q4_0 pp512 643.56 838.46 1.30
qwen35 27B Q4_0 pp2048 522.53 685.44 1.31
qwen35 27B Q8_0 pp512 546.49 811.19 1.48
qwen35 27B Q8_0 pp2048 427.44 648.46 1.52
qwen35moe 35B.A3B Q4_0 pp512 3486.46 3523.32 1.01
qwen35moe 35B.A3B Q4_0 pp2048 3775.81 3909.26 1.04
qwen35moe 35B.A3B Q8_0 pp512 2935.91 3065.51 1.04
qwen35moe 35B.A3B Q8_0 pp2048 3379.42 3619.71 1.07

@ggerganov
Copy link
Copy Markdown
Member

Prompt processing comparison between M4 Max (32 GPU cores) and M5 Max (40 GPU cores)

model size params test (M4 Max 32C) t/s (M5 Max 40C) t/s speedup
mistral3 8B Q8_0 8.40 GiB 8.49 B pp2048 631.07 ± 37.52 2694.79 ± 27.42 4.27
gemma3 4B Q4_0 2.35 GiB 3.88 B pp2048 1571.53 ± 15.03 6043.62 ± 10.91 3.85

build: 5605dd6 (8936)

@ggerganov ggerganov merged commit d164904 into ggml-org:master Apr 25, 2026
51 of 53 checks passed
@ggerganov
Copy link
Copy Markdown
Member

@Developer-Ecosystem-Engineering Thanks for the nice contribution!

@Developer-Ecosystem-Engineering
Copy link
Copy Markdown
Contributor Author

@Developer-Ecosystem-Engineering Any reason not to do the same implementation for kernel_mul_mm_id?

Never quite sure how projects will respond to our involvement, easier to start simple! Thank you for the engagement!

I've created 175680154 to investigate the potential benefit in kernel_mul_mm_id.

@ggerganov
Copy link
Copy Markdown
Member

Got it. Contributions are very welcome!

IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
…#20962)

* Optimize Metal Tensor API usage for matmul2d

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR.

The legacy simdgroup_matrix kernel is preserved under #else.

Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.

* cont : cleanup

* cont : cleanup

* cont : cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
…#20962)

* Optimize Metal Tensor API usage for matmul2d

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR.

The legacy simdgroup_matrix kernel is preserved under #else.

Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.

* cont : cleanup

* cont : cleanup

* cont : cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
…#20962)

* Optimize Metal Tensor API usage for matmul2d

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR.

The legacy simdgroup_matrix kernel is preserved under #else.

Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.

* cont : cleanup

* cont : cleanup

* cont : cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
…#20962)

* Optimize Metal Tensor API usage for matmul2d

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR.

The legacy simdgroup_matrix kernel is preserved under #else.

Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.

* cont : cleanup

* cont : cleanup

* cont : cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
…#20962)

* Optimize Metal Tensor API usage for matmul2d

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR.

The legacy simdgroup_matrix kernel is preserved under #else.

Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.

* cont : cleanup

* cont : cleanup

* cont : cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
…#20962)

* Optimize Metal Tensor API usage for matmul2d

Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR.

The legacy simdgroup_matrix kernel is preserved under #else.

Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.

* cont : cleanup

* cont : cleanup

* cont : cleanup

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants