Skip to content

common : support manually triggering the reasoning budget end sequence#23949

Merged
pwilkin merged 1 commit into
ggml-org:masterfrom
aldehir:expose-reasoning-forcing-fn
Jun 1, 2026
Merged

common : support manually triggering the reasoning budget end sequence#23949
pwilkin merged 1 commit into
ggml-org:masterfrom
aldehir:expose-reasoning-forcing-fn

Conversation

@aldehir
Copy link
Copy Markdown
Contributor

@aldehir aldehir commented May 31, 2026

Overview

Add a way to force the reasoning budget end sequence when in a COUNTING state. This will allow the server to manually trigger the reasoning to close.

bool common_sampler_reasoning_budget_force(struct common_sampler * gsmpl)

We will now have to always impose the reasoning budget sampler when this feature is desired. See #21870, some users have reported the reasoning budget impacts their performance while others don't notice a degradation. Right now it's only added if either reasoning_budget > 0 or if grammar_lazy: true to avoid triggering the grammar during reasoning.

ref: #23944 (comment)

Additional information

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, simple enough so I guided Slopus 4.8 to my desired implementation.

@aldehir aldehir requested review from a team and ggerganov as code owners May 31, 2026 21:54
@github-actions github-actions Bot added the testing Everything test related label May 31, 2026
@aldehir aldehir requested review from ngxson and pwilkin May 31, 2026 21:58
@pwilkin pwilkin merged commit 5254a79 into ggml-org:master Jun 1, 2026
27 checks passed
ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Jun 1, 2026
Builds on the manual reasoning budget trigger from ggml-org#23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Jun 1, 2026
Builds on the manual reasoning budget trigger from ggml-org#23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Jun 1, 2026
Builds on the manual reasoning budget trigger from ggml-org#23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Jun 1, 2026
Builds on the manual reasoning budget trigger from ggml-org#23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Jun 1, 2026
Builds on the manual reasoning budget trigger from ggml-org#23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Jun 1, 2026
Builds on the manual reasoning budget trigger from ggml-org#23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Jun 1, 2026
Builds on the manual reasoning budget trigger from ggml-org#23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Jun 1, 2026
Builds on the manual reasoning budget trigger from ggml-org#23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jun 1, 2026
* origin/master: (36 commits)
vendor : update cpp-httplib to 0.46.1 (ggml-org#23980)
llama: limit max outputs of `llama_context` (ggml-org#23861)
metal: template GLU kernels to support f16/f32 (ggml-org#23882)
vulkan: don't hold the device mutex while compiling pipelines (ggml-org#23641)
vulkan: reduce host memory lock contention (ggml-org#23376)
vocab: add normalizer.lowercase support to WPM (ggml-org#23899)
TP: quantized KV cache support (ggml-org#23792)
security : disable private disclosures (ggml-org#23963)
model: Add EXAONE 4.5 implementations (ggml-org#21733)
vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (ggml-org#23056)
vulkan: Removed unused functions (ggml-org#23175)
common : support manually triggering the reasoning budget end sequence (ggml-org#23949)
ci : add missing Linux label to cpu-x64-high-perf runner (ggml-org#23958)
[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (ggml-org#23812)
[SYCL] Add more types in GET_ROWS OP (ggml-org#23710)
sycl : Optimize Q3_K mul_mat by reorder (ggml-org#23725)
ci: remove redundant or duplicate jobs (ggml-org#23927)
server : handle If-None-Match weak ETags (ggml-org#23916)
ci : limit trigger paths for the CPU workflow (ggml-org#23938)
vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756)
...
ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Jun 1, 2026
Builds on the manual reasoning budget trigger from ggml-org#23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Jun 1, 2026
Builds on the manual reasoning budget trigger from ggml-org#23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Jun 2, 2026
Builds on the manual reasoning budget trigger from ggml-org#23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
allozaur added a commit that referenced this pull request Jun 2, 2026
* server: real-time reasoning interruption via control endpoint

Builds on the manual reasoning budget trigger from #23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.

* ui: track reasoning phase via explicit streaming state

Add isReasoning to the chat store, mirroring the isLoading pattern:
per conversation map, private setter, public accessor and reactive
export. Set from the stream callbacks, true on reasoning chunks, false
on the first content chunk, reset on stream end and resynced on
conversation switch. The skip button now keys off isReasoning so it
shows only during the thinking phase, not the whole generation.

* ui: extract control endpoint and action into constants

Move the chat completion routes, the slots route and the reasoning
control action out of chat.service into api-endpoints and a dedicated
control-actions module. No behavior change, drops the magic strings so
the control protocol has a single source of truth.

* server: target reasoning control by completion id

Address @ngxson review on the control endpoint.

Switch from id_slot to the chat completion id to avoid a TOCTOU: the
slot can be reassigned between the lookup and the control request, so
matching the live completion (oaicompat_cmpl_id) is safe and a finished
one simply matches nothing. Rename the action to reasoning_end, guard
it on the reasoning_control flag of the target slot, and reduce the
response to {success} with an optional message.

* ui: target reasoning control by completion id

Keep the streamed completion id on the message and post it back to the
control endpoint instead of probing /slots. Drops the slot discovery
and the TOCTOU that came with it. Action renamed to reasoning_end,
response read as {success}.

* server: address review from @ngxson

Move the control fields into task_params and drop the redundant
comments on the control path.

* server: document the reasoning control endpoint

* Update tools/ui/src/lib/types/database.d.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* ui: rename cmplId to completionId

Per @allozaur review, clearer name for the streamed completion id.

* ui: wire completion id capture through the agentic flow

The webui streams through the agentic flow, which relayed onModel but
not onCompletionId, so the completion id never reached the message and
the control request was never sent. Relay it through the flow and its
callbacks type, declare id on the chunk type, and log an explicit error
when the button fires without a usable id.

* ui: target reasoning control model from the message

The model is a property of the completion, so read it from the streaming
message like the id, not from the model dropdown which is unrelated UI
state. Makes the request self-consistent by construction instead of just
unlikely to drift.

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants