server: skip device enumeration in router mode to avoid creating CUDA… by ServeurpersoCom · Pull Request #23137 · ggml-org/llama.cpp

ServeurpersoCom · 2026-05-16T07:17:58Z

Overview

In router mode (no model loaded), the parent llama-server eats 500MiB of VRAM per CUDA device. (The pre-allocated memory amount is related to GPU SMs number. The GPU with more SMs requires a larger memory)

Fix this

Additional information

Regression from #23021, which moved the device info log into common_params_print_info and wired it in main before the router mode gets detected.

The culprit is ggml_backend_dev_memory inside the loop: on CUDA it lands in cudaSetDevice + cudaMemGetInfo, which materialize the primary context on each device. The router parent never touches the GPU, so the context is pure waste and the PID shows up in nvidia-smi. Invisible on Metal because of unified memory, which is why it slipped through.

Fix adds an optional print_devices flag (default true, no other caller touched). Router passes false. Build info, verbosity and system_info still log normally, only the device loop is skipped.

Same class of bug previously fixed in #20595 (closes #20582). Closes #23120.

router log

0.01.397.131 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.01.397.185 I system_info: n_threads = 16 ... CUDA : ARCHS = 1200 ...
0.01.397.189 I srv main: n_parallel is set to auto ...

child log

[43773] 0.00.043.822 I log_info: verbosity = 3 ...
[43773] 0.00.043.824 I device_info:
[43773] 0.00.121.335 I   - CUDA0   : NVIDIA RTX PRO 6000 Blackwell ... (97247 MiB, 96689 MiB free)
[43773] 0.00.121.343 I   - CPU     : AMD Ryzen 9 9950X3D 16-Core Processor (94190 MiB, ...)
[43773] 0.00.121.414 I system_info: ...

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES Opus 4.7 + GPU pod

… primary context

tha80 · 2026-05-16T18:57:42Z

Can confirm that it fixes the problem. 👍

ServeurpersoCom · 2026-05-16T19:02:38Z

Since log handling has already been improved, adding an explicit commented argument should, I think, prevent any regression.

… primary context (ggml-org#23137)

* turboquant/HEAD: (82 commits) docs(readme): credit Google's original TurboQuant + explain the '+' docs(readme): fix turbo ladder ordering + cite K-compression paper docs(readme): reorder KV configs as a ladder + 'start light' guidance docs(readme): add Chronara to deployments + AtomicChat link docs: restructure README — professional layout, deployments, paper links docs: tighten README — add turbo2, missing features, paper links docs: keep upstream README, prepend fork-specific summary docs: replace upstream README with fork-specific summary fix(xxd.cmake): handle missing input file (not just empty) fix(ci): 4 cross-vendor -Werror failures + defensive xxd.cmake cmake : fix LLAMA_BUILD_UI logic (ggml-org#23190) fix(ggml-cuda): HIP nodiscard + MUSA cudaMemcpyToSymbol alias fix(turbo-quant): add forward declaration for turbo_cpu_fwht_inverse fix(metal): set ne12/ne13/r2/r3 function constants in mul_mm_tq_rotated pipeline webui: support video files as input (ggml-org#22830) server: (router) alloc tmp buffer on heap (ggml-org#23159) server: skip device enumeration in router mode to avoid creating CUDA primary context (ggml-org#23137) vulkan: removed duplicate #include <memory> in headers (ggml-org#23144) ui: Add request timeout for MCP tool calls (ggml-org#23138) sync : ggml ...

server: skip device enumeration in router mode to avoid creating CUDA…

23633f4

… primary context

ServeurpersoCom requested review from a team as code owners May 16, 2026 07:17

github-actions Bot added examples server labels May 16, 2026

ngxson approved these changes May 16, 2026

View reviewed changes

allozaur approved these changes May 16, 2026

View reviewed changes

allozaur merged commit 64b38b5 into ggml-org:master May 16, 2026
49 checks passed

kgrama pushed a commit to kgrama/llama.cpp that referenced this pull request May 19, 2026

server: skip device enumeration in router mode to avoid creating CUDA…

769598f

… primary context (ggml-org#23137)

xxmustafacooTR pushed a commit to xxPlayground/llama-cpp-turboquant that referenced this pull request May 19, 2026

server: skip device enumeration in router mode to avoid creating CUDA…

6a5f17d

… primary context (ggml-org#23137)

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026

server: skip device enumeration in router mode to avoid creating CUDA…

aa44172

… primary context (ggml-org#23137)

ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request May 19, 2026

server: skip device enumeration in router mode to avoid creating CUDA…

e107a02

… primary context (ggml-org#23137)

baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026

server: skip device enumeration in router mode to avoid creating CUDA…

c40ae85

… primary context (ggml-org#23137)

ServeurpersoCom mentioned this pull request May 23, 2026

server: add router device memory margin parameter for dynamic unloading #21231

Open

srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026

server: skip device enumeration in router mode to avoid creating CUDA…

07e8540

… primary context (ggml-org#23137)

winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026

server: skip device enumeration in router mode to avoid creating CUDA…

4b15e70

… primary context (ggml-org#23137)

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026

server: skip device enumeration in router mode to avoid creating CUDA…

0bfec1b

… primary context (ggml-org#23137)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: skip device enumeration in router mode to avoid creating CUDA…#23137

server: skip device enumeration in router mode to avoid creating CUDA…#23137
allozaur merged 1 commit into
ggml-org:masterfrom
ServeurpersoCom:server/avoid-cuda-ctx-on-router

ServeurpersoCom commented May 16, 2026 •

edited

Loading

Uh oh!

tha80 commented May 16, 2026 •

edited

Loading

Uh oh!

ServeurpersoCom commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ServeurpersoCom commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Fix this

Additional information

router log

child log

Requirements

Uh oh!

tha80 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ServeurpersoCom commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ServeurpersoCom commented May 16, 2026 •

edited

Loading

tha80 commented May 16, 2026 •

edited

Loading