Optimize Metal Tensor API usage#20962
Conversation
Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.
M5 Max (MBP 16"), 6+12 CPU, 40 GPU, 64 GBTested with LLaMA 7B v2 (F16, Q8_0, Q4_0) — same models used in the Apple Silicon M-series benchmark discussion. PR #20962 results (build c3a1128)
Comparison with old pinned build (8e672ef)
Prompt processing ~3x faster across the board. Text generation unchanged as expected (memory-bandwidth bound). Great work! 🚀 |
|
Could you also confirm correctness by running Edit: also, no need to compare to the old 8e672ef. Compare to current |
|
@ggerganov Perplexity results on M5 Max (MBP 16", 40 GPU, 64 GB) — Wikitext-2 test set:
All within expected ranges. Correctness looks good. Edit: running perplexity comparison against current master as well. Results incoming. |
|
@ggerganov Updated results — perplexity + benchmark comparison against current master ( Perplexity — Wikitext-2 (M5 Max, MBP 16", 40 GPU, 64 GB)
Benchmark vs master
Huge pp improvement across the board. TG unchanged (memory-bound). No correctness regression. |
Perplexity results — M5, 4+6 CPU, 10 GPU, 24 GBRan
F16 and Q8_0 are nearly identical as expected. Q4_0 shows the usual small quantization degradation. No correctness issues observed. build: c3a1128 (8509) |
|
Thank you @Hassan-A-K for the additional details (and confirmation)! Will keep an eye on this for any follow up relevant requests or questions related to integration. |
|
@Developer-Ecosystem-Engineering Any reason not to do the same implementation for |
|
Some numbers on M5 Max: scripts/compare-commits.sh master pr/20962 llama-bench -m ~/models/qwen3.5-27b/ggml-model-q8_0.gguf -m ~/models/qwen3.5-27b/ggml-model-q4_0.gguf -m ~/models/qwen3.5-35b-a3b/ggml-model-q8_0.gguf -m ~/models/qwen3.5-35b-a3b/ggml-model-q4_0.gguf -m ~/models/gemma-4-26b-a4b-it/ggml-model-q8_0.gguf -m ~/models/gemma-4-26b-a4b-it/ggml-model-q4_0.gguf -m ~/models/gemma-4-31b-it/ggml-model-q8_0.gguf -m ~/models/gemma-4-31b-it/ggml-model-q4_0.gguf -fa 1 -ub 2048 -p 512,2048 -n 0 -t 1 --delay 20
|
|
Prompt processing comparison between
build: 5605dd6 (8936) |
|
@Developer-Ecosystem-Engineering Thanks for the nice contribution! |
Never quite sure how projects will respond to our involvement, easier to start simple! Thank you for the engagement! I've created 175680154 to investigate the potential benefit in |
|
Got it. Contributions are very welcome! |
…#20962) * Optimize Metal Tensor API usage for matmul2d Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure. * cont : cleanup * cont : cleanup * cont : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…#20962) * Optimize Metal Tensor API usage for matmul2d Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure. * cont : cleanup * cont : cleanup * cont : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…#20962) * Optimize Metal Tensor API usage for matmul2d Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure. * cont : cleanup * cont : cleanup * cont : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…#20962) * Optimize Metal Tensor API usage for matmul2d Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure. * cont : cleanup * cont : cleanup * cont : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…#20962) * Optimize Metal Tensor API usage for matmul2d Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure. * cont : cleanup * cont : cleanup * cont : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…#20962) * Optimize Metal Tensor API usage for matmul2d Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure. * cont : cleanup * cont : cleanup * cont : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Overview
Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else.
Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.
Geomean improvement of ~26%
TL-Q4_ 0 ~6.8%
DeepSeek-8B-f16 ~71.9%
Further test matrix below
Additional information
Requirements
Testing Details
Yes