Skip to content

app: re-inject subcommand when router spawns children under unified binary#23442

Merged
allozaur merged 1 commit into
ggml-org:masterfrom
ServeurpersoCom:fix/router-spawns-unified
May 21, 2026
Merged

app: re-inject subcommand when router spawns children under unified binary#23442
allozaur merged 1 commit into
ggml-org:masterfrom
ServeurpersoCom:fix/router-spawns-unified

Conversation

@ServeurpersoCom
Copy link
Copy Markdown
Contributor

@ServeurpersoCom ServeurpersoCom commented May 20, 2026

Overview

cont #23296

Under LLAMA_BUILD_APP=ON, /proc/self/exe is llama, so the router spawns the child as "llama --host ..." which dies on unknown command. The dispatcher now exports the subcommand (LLAMA_APP_CMD) and the router re-injects it, so the child starts as "llama serve ...". No effect on the standalone llama-server binary.

Additional information

@ngxson WDYT? There are several possible approaches here.

Requirements

@ggerganov
Copy link
Copy Markdown
Member

Can we do a discovery on startup and avoid setting env variables? Check which of the two tool is available. Print error if none.

@angt
Copy link
Copy Markdown
Member

angt commented May 21, 2026

The current way of doing it is very fragile btw..

@angt
Copy link
Copy Markdown
Member

angt commented May 21, 2026

Setting the env from llama-app as a quickfix LGTM, but at some point I believe we should reorganize the code to spawn many models

@ServeurpersoCom
Copy link
Copy Markdown
Contributor Author

Yes, intended as a quickfix. We can do better in a follow-up PR.

@allozaur allozaur merged commit c902171 into ggml-org:master May 21, 2026
49 checks passed
ProTekk pushed a commit to ProTekk/buun-llama-cpp that referenced this pull request May 21, 2026
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 21, 2026
* origin/master: (138 commits)
fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372)
tests : move save-load-state from examples to tests (ggml-org#23336)
server: expose prompt token counts in /slots endpoint (ggml-org#23454)
metal : optimize concat kernel and fix set kernel threads (ggml-org#23411)
server : free draft/MTP resources on sleep to fix VRAM leak (ggml-org#23461)
server: re-inject subcommand when router spawns children under unified binary (ggml-org#23442)
app : add batched-bench, fit-params, quantize & perplexity (ggml-org#23459)
mtp: use inp_out_ids for skipping logit computation (ggml-org#23433)
vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410)
doc: fix spec mtp typo (ggml-org#23435)
ui: Improve Git Hooks for UI development (ggml-org#23403)
ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306)
llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131)
hexagon: ssm-conv fix for large prompts (ggml-org#23307)
app : show version (ggml-org#23426)
mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329)
ui: Add max image size option (ggml-org#22849)
Move to backend sampling for MTP draft path (ggml-org#23287)
opencl: refactor backend initilization (ggml-org#23318)
common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386)
...
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants