llama: only use one iGPU device by default#23897
Conversation
…wercase * upstream/master: (27 commits) vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756) ui: fix ETag truncation with MSVC compiler (ggml-org#23917) docs : update ZenDNN docs for Q8 support (ggml-org#23791) llama: only use one iGPU device by default (ggml-org#23897) webui: add custom CSS injection via config (ggml-org#23904) Support `-fa auto` in llama-bench (ggml-org#23714) opencl: support bf16 by converting to f16 (ggml-org#23839) ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (ggml-org#23910) TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (ggml-org#23843) metal : restore im2col implementation for large kernels (ggml-org#23901) test: (test-llama-archs) log the config name first (ggml-org#23885) ci : update ios-xcode release job to macos-26 (ggml-org#23906) ggml : add some lsx support (ggml-org#23798) vulkan: add Flash Attention support for BFloat16 KV cache (ggml-org#23420) ci : fix s390x release job (ggml-org#23898) ci : clear cache instead of "no timestamp" keys + fix macos (ggml-org#23895) llama : do not skip iGPU when only RPC devices are present (ggml-org#23868) server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884) ggml-webgpu: Check earlier for WebGPU required features (ggml-org#23879) ggml-webgpu: add q4_0/q8_0 SET_ROWS (ggml-org#23760) ... # Conflicts: # gguf-py/gguf/vocab.py # src/llama-vocab.cpp
|
Not a good idea, it will break the quad MI300A APU. |
|
You can open an issue or propose a solution once you have access to one. Random criticism just according to a datasheet is rather dubious. |
|
sorry only wanted to help.
juste wanted to report that it is not the case... but it is the only I know (for the story it power the faster HPC of the top500)
I really like to have access on one of them, sure it will be really good. But no I don't.
Just out of curiosity: in what case do you need to activate both backends? |
|
No worries then. But the idea of an iGPU is "sharing memory with the host", and I have no idea how that would work with multiple GPUs. Might be an interesting edge case eventually. For now this change should be correct. I run CUDA+Vulkan on my Spark to be able to test both backends without recompiling. They can (usually) coexist without problem, it will prefer CUDA by default for devices that support both. |
|
Without proper hardware it is alway hard to know what to do. ;) There is more element here: https://arxiv.org/pdf/2508.11298 . For me look like all GPU can access all RAM the "same" way CPU access other RAM on NUMA nodes (But that's just my understanding.) One possiblity is to add all devices on the same backend (until we get heterogenous multi iGPU ?) if (igpus.empty()) {
igpus.push_back({false, dev});
} else {
// add only device with the same backend (for MI300A?)
ggml_backend_reg_t reg = ggml_backend_dev_backend_reg(dev);
ggml_backend_reg_t reg0 = ggml_backend_dev_backend_reg(igpus[0].dev);
// ??? can we compare the reg pointer?
if (ggml_backend_reg_name(reg) == ggml_backend_reg_name(reg0)) {
igpus.push_back({false, dev});
}
}But can't test it ... |
|
We'll look into it once someone actually tries it and reports a problem. You can always manually override the selection anyways. |
Overview
After #23007 Vulkan is no longer the only backend reporting devices as iGPU, so we now get the case that multiple backends report the same iGPU. On my DGX Spark that leads to the model being split between CUDA and Vulkan.
This is the simplest solution, just only ever allow a single iGPU. I think that there should never be a case with multiple iGPUs, so this is okay. The dGPU deduplication logic by device_id would also work on DGX Spark and (Linux) AMD, but I don't think it is needed here.
Requirements