Skip to content

server: skip device enumeration in router mode to avoid creating CUDA…#23137

Merged
allozaur merged 1 commit into
ggml-org:masterfrom
ServeurpersoCom:server/avoid-cuda-ctx-on-router
May 16, 2026
Merged

server: skip device enumeration in router mode to avoid creating CUDA…#23137
allozaur merged 1 commit into
ggml-org:masterfrom
ServeurpersoCom:server/avoid-cuda-ctx-on-router

Conversation

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

@ServeurpersoCom ServeurpersoCom commented May 16, 2026

Overview

In router mode (no model loaded), the parent llama-server eats 500MiB of VRAM per CUDA device. (The pre-allocated memory amount is related to GPU SMs number. The GPU with more SMs requires a larger memory)

Fix this

2CUDA

Additional information

Regression from #23021, which moved the device info log into common_params_print_info and wired it in main before the router mode gets detected.

The culprit is ggml_backend_dev_memory inside the loop: on CUDA it lands in cudaSetDevice + cudaMemGetInfo, which materialize the primary context on each device. The router parent never touches the GPU, so the context is pure waste and the PID shows up in nvidia-smi. Invisible on Metal because of unified memory, which is why it slipped through.

Fix adds an optional print_devices flag (default true, no other caller touched). Router passes false. Build info, verbosity and system_info still log normally, only the device loop is skipped.

Same class of bug previously fixed in #20595 (closes #20582). Closes #23120.

router log

0.01.397.131 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.01.397.185 I system_info: n_threads = 16 ... CUDA : ARCHS = 1200 ...
0.01.397.189 I srv main: n_parallel is set to auto ...

child log

[43773] 0.00.043.822 I log_info: verbosity = 3 ...
[43773] 0.00.043.824 I device_info:
[43773] 0.00.121.335 I   - CUDA0   : NVIDIA RTX PRO 6000 Blackwell ... (97247 MiB, 96689 MiB free)
[43773] 0.00.121.343 I   - CPU     : AMD Ryzen 9 9950X3D 16-Core Processor (94190 MiB, ...)
[43773] 0.00.121.414 I system_info: ...

Requirements

@tha80
Copy link
Copy Markdown
Contributor

tha80 commented May 16, 2026

Can confirm that it fixes the problem. 👍

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

Since log handling has already been improved, adding an explicit commented argument should, I think, prevent any regression.

@allozaur allozaur merged commit 64b38b5 into ggml-org:master May 16, 2026
49 checks passed
kgrama pushed a commit to kgrama/llama.cpp that referenced this pull request May 19, 2026
xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 19, 2026
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request May 19, 2026
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026
winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
Jcfunk added a commit to Jcfunk/llama.cpp that referenced this pull request Jun 2, 2026
* turboquant/HEAD: (82 commits)
  docs(readme): credit Google's original TurboQuant + explain the '+'
  docs(readme): fix turbo ladder ordering + cite K-compression paper
  docs(readme): reorder KV configs as a ladder + 'start light' guidance
  docs(readme): add Chronara to deployments + AtomicChat link
  docs: restructure README — professional layout, deployments, paper links
  docs: tighten README — add turbo2, missing features, paper links
  docs: keep upstream README, prepend fork-specific summary
  docs: replace upstream README with fork-specific summary
  fix(xxd.cmake): handle missing input file (not just empty)
  fix(ci): 4 cross-vendor -Werror failures + defensive xxd.cmake
  cmake : fix LLAMA_BUILD_UI logic (ggml-org#23190)
  fix(ggml-cuda): HIP nodiscard + MUSA cudaMemcpyToSymbol alias
  fix(turbo-quant): add forward declaration for turbo_cpu_fwht_inverse
  fix(metal): set ne12/ne13/r2/r3 function constants in mul_mm_tq_rotated pipeline
  webui: support video files as input (ggml-org#22830)
  server: (router) alloc tmp buffer on heap (ggml-org#23159)
  server: skip device enumeration in router mode to avoid creating CUDA primary context (ggml-org#23137)
  vulkan: removed duplicate #include <memory> in headers (ggml-org#23144)
  ui: Add request timeout for MCP tool calls (ggml-org#23138)
  sync : ggml
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: CUDA: Additional processes are started that waste vram. Misc. bug: llama-server router mode uses more VRAM than direct loading

4 participants