Skip to content

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Dec 8, 2025

Mirrored from ggml-org/llama.cpp#17866

With #15906, I noticed on important regression when using metal backend on eGPU.
This commit restore the previous behavior and add an option to force its activation.

Before #15906, llama-bench on gemma 3 give me this kind of result:

$ ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         48.72 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          5.95 ± 0.00 |

build: 33daece86 (6440)

So above 45t/s on pp test, and more than 5t/s on tg test.

After #15906, pp test has improved but tg test has been divided by 2.

$ ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         60.66 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          2.84 ± 0.00 |

build: 0f0a3c285 (6441)

Launching the benchmark with "Metal System Trace" in Instruments.app, reveals some usage of the DMA1 channel which introduced lot of latency (at least, this is how I interpreted it).

With this PR, the performance are back as before on eGPU and should not impact any other configuration (dGPU and M1-M5).

# ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         47.24 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          6.07 ± 0.00 |

build: b0db6483b (7327)

With #15906, I noticed on important regression when using metal backend on eGPU.
This commit restore the previous behavior and add an option to force its activation.
@loci-review
Copy link

loci-review bot commented Dec 8, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Overview

PR #488 introduces a targeted 5-line change to ggml/src/ggml-metal/ggml-metal-device.m that addresses eGPU performance regression by enabling shared memory buffers for external GPU devices. The modification occurs during Metal device initialization and does not alter any inference or tokenization functions.

Performance Metrics Analysis

Function-Level Changes: No measurable changes detected in Response Time or Throughput Time across all analyzed functions. The summary report returned no data, indicating zero impact on function execution characteristics.

Power Consumption: All 16 binaries show identical power consumption between versions (0.0% change):

  • libllama.so: 194,312 nJ (unchanged)
  • llama-run: 219,167 nJ (unchanged)
  • llama-cvector-generator: 249,478 nJ (unchanged)
  • All other binaries: no measurable difference

Key Findings

Inference Impact: Zero impact on tokens per second. The code changes affect only Metal buffer allocation strategy during device initialization in ggml_metal_device_init(). No modifications to inference functions:

  • llama_decode: not modified
  • llama_encode: not modified
  • llama_tokenize: not modified
  • ggml_backend_graph_compute: not modified

Code Changes: The PR adds automatic eGPU detection using MTLDeviceLocationExternal and introduces GGML_METAL_SHARED_BUFFERS_ENABLE environment variable. These changes execute once during initialization and do not affect the hot path of token generation or model inference on the analyzed CPU configuration.

Scope: Device-specific optimization for external GPU configurations. The analyzed binaries show no performance variation because the changes target Metal backend initialization logic that does not execute in CPU-only inference scenarios. The modification is isolated to buffer allocation strategy selection, occurring before any computation begins.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 2102502 to d772aad Compare December 12, 2025 12:15
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants