Skip to content

llama: only use one iGPU device by default#23897

Merged
0cc4m merged 1 commit into
masterfrom
0cc4m/igpu-deduplication
May 31, 2026
Merged

llama: only use one iGPU device by default#23897
0cc4m merged 1 commit into
masterfrom
0cc4m/igpu-deduplication

Conversation

@0cc4m
Copy link
Copy Markdown
Contributor

@0cc4m 0cc4m commented May 30, 2026

Overview

After #23007 Vulkan is no longer the only backend reporting devices as iGPU, so we now get the case that multiple backends report the same iGPU. On my DGX Spark that leads to the model being split between CUDA and Vulkan.

This is the simplest solution, just only ever allow a single iGPU. I think that there should never be a case with multiple iGPUs, so this is okay. The dGPU deduplication logic by device_id would also work on DGX Spark and (Linux) AMD, but I don't think it is needed here.

Requirements

@0cc4m 0cc4m requested a review from ggerganov as a code owner May 30, 2026 05:58
@0cc4m 0cc4m merged commit 22cadc1 into master May 31, 2026
27 checks passed
@0cc4m 0cc4m deleted the 0cc4m/igpu-deduplication branch May 31, 2026 06:17
o7si added a commit to o7si/llama.cpp that referenced this pull request May 31, 2026
…wercase

* upstream/master: (27 commits)
  vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756)
  ui: fix ETag truncation with MSVC compiler (ggml-org#23917)
  docs : update ZenDNN docs for Q8 support (ggml-org#23791)
  llama: only use one iGPU device by default (ggml-org#23897)
  webui: add custom CSS injection via config (ggml-org#23904)
  Support `-fa auto` in llama-bench (ggml-org#23714)
  opencl: support bf16 by converting to f16 (ggml-org#23839)
  ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (ggml-org#23910)
  TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (ggml-org#23843)
  metal : restore im2col implementation for large kernels (ggml-org#23901)
  test: (test-llama-archs) log the config name first (ggml-org#23885)
  ci : update ios-xcode release job to macos-26 (ggml-org#23906)
  ggml : add some lsx support (ggml-org#23798)
  vulkan: add Flash Attention support for BFloat16 KV cache (ggml-org#23420)
  ci : fix s390x release job (ggml-org#23898)
  ci : clear cache instead of "no timestamp" keys + fix macos (ggml-org#23895)
  llama : do not skip iGPU when only RPC devices are present (ggml-org#23868)
  server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884)
  ggml-webgpu: Check earlier for WebGPU required features (ggml-org#23879)
  ggml-webgpu: add q4_0/q8_0 SET_ROWS (ggml-org#23760)
  ...

# Conflicts:
#	gguf-py/gguf/vocab.py
#	src/llama-vocab.cpp
@Djip007
Copy link
Copy Markdown
Contributor

Djip007 commented Jun 1, 2026

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Jun 2, 2026

You can open an issue or propose a solution once you have access to one. Random criticism just according to a datasheet is rather dubious.

turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
@Djip007
Copy link
Copy Markdown
Contributor

Djip007 commented Jun 3, 2026

sorry only wanted to help.
And yes, my comment is a bit harsh. (I had a rough day and it has nothing to do with this project.)

I think that there should never be a case with multiple iGPUs, so this is okay.

juste wanted to report that it is not the case... but it is the only I know (for the story it power the faster HPC of the top500)

You can open an issue or propose a solution once you have access to one

I really like to have access on one of them, sure it will be really good. But no I don't.
And Yes If I can have access I will create a issue / and a PR.

On my DGX Spark that leads to the model being split between CUDA and Vulkan.

Just out of curiosity: in what case do you need to activate both backends?

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Jun 3, 2026

No worries then. But the idea of an iGPU is "sharing memory with the host", and I have no idea how that would work with multiple GPUs. Might be an interesting edge case eventually. For now this change should be correct.

I run CUDA+Vulkan on my Spark to be able to test both backends without recompiling. They can (usually) coexist without problem, it will prefer CUDA by default for devices that support both.

@Djip007
Copy link
Copy Markdown
Contributor

Djip007 commented Jun 3, 2026

Without proper hardware it is alway hard to know what to do. ;)

There is more element here: https://arxiv.org/pdf/2508.11298 . For me look like all GPU can access all RAM the "same" way CPU access other RAM on NUMA nodes (But that's just my understanding.)

One possiblity is to add all devices on the same backend (until we get heterogenous multi iGPU ?)
Like

                        if (igpus.empty()) {
                            igpus.push_back({false, dev});
                        } else {
                            // add only device with the same backend (for MI300A?)
                            ggml_backend_reg_t reg = ggml_backend_dev_backend_reg(dev);
                            ggml_backend_reg_t reg0 = ggml_backend_dev_backend_reg(igpus[0].dev);
                            // ??? can we compare the reg pointer?
                            if (ggml_backend_reg_name(reg) == ggml_backend_reg_name(reg0)) {
                                igpus.push_back({false, dev});
                            }
                        }

But can't test it ...

@0cc4m
Copy link
Copy Markdown
Contributor Author

0cc4m commented Jun 4, 2026

We'll look into it once someone actually tries it and reports a problem. You can always manually override the selection anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants