UPSTREAM PR #17866: metal: use shared buffers on eGPU #488

loci-dev · 2025-12-08T18:44:05Z

With #15906, I noticed on important regression when using metal backend on eGPU.
This commit restore the previous behavior and add an option to force its activation.

Before #15906, llama-bench on gemma 3 give me this kind of result:

$ ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         48.72 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          5.95 ± 0.00 |

build: 33daece86 (6440)

So above 45t/s on pp test, and more than 5t/s on tg test.

After #15906, pp test has improved but tg test has been divided by 2.

$ ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         60.66 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          2.84 ± 0.00 |

build: 0f0a3c285 (6441)

Launching the benchmark with "Metal System Trace" in Instruments.app, reveals some usage of the DMA1 channel which introduced lot of latency (at least, this is how I interpreted it).

With this PR, the performance are back as before on eGPU and should not impact any other configuration (dGPU and M1-M5).

# ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         47.24 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          6.07 ± 0.00 |

build: b0db6483b (7327)

With #15906, I noticed on important regression when using metal backend on eGPU. This commit restore the previous behavior and add an option to force its activation.

loci-review · 2025-12-08T19:24:19Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Overview

PR #488 introduces a targeted 5-line change to ggml/src/ggml-metal/ggml-metal-device.m that addresses eGPU performance regression by enabling shared memory buffers for external GPU devices. The modification occurs during Metal device initialization and does not alter any inference or tokenization functions.

Performance Metrics Analysis

Function-Level Changes: No measurable changes detected in Response Time or Throughput Time across all analyzed functions. The summary report returned no data, indicating zero impact on function execution characteristics.

Power Consumption: All 16 binaries show identical power consumption between versions (0.0% change):

libllama.so: 194,312 nJ (unchanged)
llama-run: 219,167 nJ (unchanged)
llama-cvector-generator: 249,478 nJ (unchanged)
All other binaries: no measurable difference

Key Findings

Inference Impact: Zero impact on tokens per second. The code changes affect only Metal buffer allocation strategy during device initialization in ggml_metal_device_init(). No modifications to inference functions:

llama_decode: not modified
llama_encode: not modified
llama_tokenize: not modified
ggml_backend_graph_compute: not modified

Code Changes: The PR adds automatic eGPU detection using MTLDeviceLocationExternal and introduces GGML_METAL_SHARED_BUFFERS_ENABLE environment variable. These changes execute once during initialization and do not affect the hot path of token generation or model inference on the analyzed CPU configuration.

Scope: Device-specific optimization for external GPU configurations. The analyzed binaries show no performance variation because the changes target Metal backend initialization logic that does not execute in CPU-only inference scenarios. The modification is isolated to buffer allocation strategy selection, occurring before any computation begins.

metal: use shared buffers on eGPU

6d041c9

With #15906, I noticed on important regression when using metal backend on eGPU. This commit restore the previous behavior and add an option to force its activation.

loci-dev temporarily deployed to PROD__AL_DEMO December 8, 2025 18:44 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 27 times, most recently from 2102502 to d772aad Compare December 12, 2025 12:15

loci-dev force-pushed the main branch 30 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17866: metal: use shared buffers on eGPU #488

UPSTREAM PR #17866: metal: use shared buffers on eGPU #488

Uh oh!

loci-dev commented Dec 8, 2025

Uh oh!

loci-review bot commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UPSTREAM PR #17866: metal: use shared buffers on eGPU #488

Are you sure you want to change the base?

UPSTREAM PR #17866: metal: use shared buffers on eGPU #488

Uh oh!

Conversation

loci-dev commented Dec 8, 2025

Uh oh!

loci-review bot commented Dec 8, 2025

Performance Analysis Summary

Overview

Performance Metrics Analysis

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants