Handle reasoning budget by pwilkin · Pull Request #20297 · ggml-org/llama.cpp

pwilkin · 2026-03-09T15:07:29Z

Adds proper handling for --reasoning-budget.

Currently, --reasoning-budget is just a stub that handles one case: 0, and the only think that does is set enable_thinking to false.

This PR adds the following flags:

--reasoning on (short -rea on) - enable reasoning via kwargs on model
--reasoning off (short -rea off) - disable reasoning via kwargs on model
--reasoning-budget-message - a message to be appended before the reasoning close marker to inform the model the reasoning was terminated due to budget constraints, i.e. " ... reasoning budget exceeded" or "... okay, now let's answer."

Also, --reasoning-budget now adds an extra grammar with a mechanism called delayed launch. Basically what this does is, when the opening trigger for the grammar is triggered, tokens get counted down and we also watch for the disarm trigger. If the disarm trigger doesn't get triggered before the countdown, the grammar gets launched.

This allows setting a real token budget limit on reasoning for models. This also allows disabling thinking for models that normally do not allow that by setting the budget to 0, which is now a different behavior than --disable-reasoning. Note: while possible, this isn't recommended since if a model was trained to only work with reasoning, it might exhibit aberrant behavior (for example, try to open an extra reasoning section).

Supersedes #17750

pwilkin · 2026-03-09T15:08:03Z

AI disclosure: I used Claude Opus in making most of the changes, auditing and modifying the critical code myself.

pwilkin · 2026-03-09T15:09:32Z

Oh, I didn't mention it in the note but this of course entails support for multiple grammars for one server task since the tool grammar is still there.

pwilkin · 2026-03-09T17:18:37Z

Some interesting observations from early tests (on Qwen3.5 9B Q8_0):

full model humaneval is around 93%
non-reasoning (-dre) is around 88%
using reasoning_budget 1000 and 400 is actually pretty similar and improves to about 89%
however this relies on having a --reasoning-budget-message (I used " ... reasoning budget exceeded, need to answer."). Without that, performance drops to a terrible 79%.

ggerganov

I'll probably need to understand this deeper, but on first look this seems very heavy logic. How important is this functionality?

Specifically the changes in common_sampler seem disproportionately large compared to what this brings to the existing logic. Look for ways to simplify.

pwilkin · 2026-03-09T18:02:00Z

I'll probably need to understand this deeper, but on first look this seems very heavy logic. How important is this functionality?

A lot of people have been requesting this, especially with the Qwen3.5 models that are seen as too verbose with their reasoning.

The changes in the sampler code are basically to the grammar sampler, since the idea is (a) to support more than one grammar simultaneously and (b) to support delayed grammar application (with token counting). Maybe this can be simplified by instead inserting another grammar sampler? Not sure how viable that would be.

aldehir · 2026-03-09T18:48:44Z

In my opinion, we need to think long term.

The grammar sampler is incredibly inefficient. We had to revert a change @ggerganov wanted to make that shifts the grammar to the start of the chain to support backend sampling.

Merging this will increase the reliance on the grammar sampler and make it more challenging to optimize in the future.

I'm of the opinion that a dedicated, simple, reasoning sampler that lives in common would be enough and can be used at the start of the chain--so long as it aligns with the grammar used (if any).

ggerganov · 2026-03-09T18:55:08Z

Yes, framing this as a reasoning sampler should definitely be explored.

pwilkin · 2026-03-09T21:17:44Z

@ggerganov @aldehir aight, reverted all the grammar changes and instead reimplemented it as a clean new reasoning parser.

I tested on cli, there is no noticeable overhead on generation (152 t/s both with and without the sampler).

common/arg.cpp

tools/server/server-context.cpp

aldehir · 2026-03-09T21:56:40Z

Unless @ggerganov thinks otherwise, I would put it under common until it reaches maturity before exposing it in the public API. I imagine there will be quite a bit of churn with all the models to support.

Other notes:

Is arm_immediately needed? Why not define an initial state instead?
Need to add a soft/hard cap to handle partial utf8 sequences. When I tested this with a grammar approach, I would often see incomplete utf8 codepoints. Instead, we could enforce a soft cap then continue until we hit a clean boundary or hit the hard cap.
Would like to see some vision for incorporating other reasoning budget strategies. For example, Nemotron Nano 2 (which they claim to also support in their 3-series).

3.4. Budget Control Evaluation
Nemotron Nano V2 allows users to specify how many thinking tokens the model may generate before
producing the final answer. The final answer is the portion of text typically shown to end users.
This feature is implemented by counting tokens after the model begins generating the <think>
token. Once the budget is reached, the inference setup attempts to insert a closing </think> tag.
Rather than inserting it immediately, we let the model finish its current sentence and place the
tag at the next newline. In extreme cases where no newline appears, the system enforces closure
within 500 tokens past the budget: if no newline occurs by the (budget+ 500)th token, the </think>
tag is forcibly inserted.

https://arxiv.org/abs/2508.14444

Overall, I think this is a cleaner approach. It isolates the complexity rather than polluting the already complex grammar sampling logic.

aldehir

Need some tests around the apply/accept logic. I had some in my example, but feel free to improvise.

include/llama.h

CISC · 2026-03-09T22:58:18Z

Funny, wonder what happened here:
https://github.com/ggml-org/llama.cpp/actions/runs/22877806623/job/66373589713?pr=20297

pwilkin · 2026-03-09T23:06:29Z

Funny, wonder what happened here: https://github.com/ggml-org/llama.cpp/actions/runs/22877806623/job/66373589713?pr=20297

GitHub merge running on Windows? :D

pwilkin · 2026-03-10T12:52:20Z

Aight I got rid of the explosive terminology and fixed the newlines in the process :)

pwilkin · 2026-03-10T14:11:24Z

Okay, UTF-8 and tests are done, think this one's ready.

Reasoning budget updates

ggerganov · 2026-03-11T10:23:28Z

common/common.h

    bool enable_chat_template = true;
    common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
+    int enable_reasoning = -1; // -1 = auto, 0 = disable, 1 = enable
    int reasoning_budget = -1;


This reasoning_budget parameter in common_params seems should be removed in favor of reasoning_budget_tokens in common_params_sampling.

ggerganov · 2026-03-11T10:27:26Z

common/common.h

    common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
+    int enable_reasoning = -1; // -1 = auto, 0 = disable, 1 = enable
    int reasoning_budget = -1;
+    std::string reasoning_budget_message; // message injected before end tag when budget exhausted


I think reasoning_budget_message is rather a sampling parameter, so probably better to move to common_params_sampling.

ggerganov · 2026-03-11T10:29:36Z

tools/cli/cli.cpp

+    int reasoning_budget = -1;
+    std::string reasoning_budget_message;


I think you don't need these vars. Just extract the info from defaults.sampling

CISC · 2026-03-11T21:24:17Z

@pwilkin This broke the server CI:
https://github.com/ggml-org/llama.cpp/actions/runs/22945680157/job/66597350486

* v1 * Finished! * Handlie cli * Reasoning sampler * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Less explosive terminology :) * Add utf-8 case and tests * common : migrate reasoning budget sampler to common * cont : clean up * cont : expose state and allow passing as initial state * cont : remove unused imports * cont : update state machine doc string --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Alde Rojas <hello@alde.dev>

jabr · 2026-03-13T07:53:52Z

This is working great for me so far! Thanks @pwilkin!

edit: I seem to get better behavior if the message ends with a newline. Without it, the thinking sometimes "escapes" the first close tag and continues on into the content. (With Qwen3.5 0.8B)

erazortt · 2026-03-13T09:41:08Z

I am on b8303 now (this was merged in b8287) and neither --enable-reasoning nor --disable-reasoning flags are recognized by llama-server, returning an "invalid argument" error. It this expected?

CISC · 2026-03-13T09:42:48Z

I am on b8303 now and neither --enable-reasoning nor --disable-reasoning flages are recognized by llama-server, retunring an invalid argument. It this expected?

The OP is outdated, the option is --reasoning on or --reasoning off.

ZUIcat · 2026-03-13T14:48:24Z

I believe the documentation may need an update. I'm currently extremely confused about how to enable or disable the thinking mode of Qwen3.5. Previously, I used reasoning-budget = 0, but it seems that's no longer the case? What exactly is --reasoning? And how is it different from chat-template-kwargs = "{\"enable_thinking\":false}"?

pwilkin · 2026-03-13T16:09:51Z

What exactly is --reasoning? And how is it different from chat-template-kwargs = "{\"enable_thinking\":false}"?

--reasoning is enable_thinking together with a few extra optimizations for handling the thinking path in the parser. So if you want to disable reasoning, use --reasoning off, which also sets enable_thinking = false in the template.

If your template doesn't support disabling thinking, you can use --reasoning-budget 0 as the sampler "forcing" solution, but beware that models might not like it (this is basically the equivalent of hacking the template to insert </think> at the start of the response).

pwilkin · 2026-03-13T16:10:37Z

The OP is outdated, the option is --reasoning on or --reasoning off.

I updated the OP message.

ZUIcat · 2026-03-14T02:39:01Z

--reasoning is enable_thinking together with a few extra optimizations for handling the thinking path in the parser. So if you want to disable reasoning, use --reasoning off, which also sets enable_thinking = false in the template.

If your template doesn't support disabling thinking, you can use --reasoning-budget 0 as the sampler "forcing" solution, but beware that models might not like it (this is basically the equivalent of hacking the template to insert </think> at the start of the response).

Thank you very much for your explanation. However, I still have a small question. In the previous discussion at #13196, it was mentioned that "the preferred way for disabling thinking with a command line argument is now --reasoning-budget 0." Therefore, I have been using only this command to disable thinking.

So now, to disable thinking, do I need to apply both --reasoning-budget 0 and --reasoning off? Or is it sufficient to use only --reasoning off without anything else (considering Qwen3.5 only)?

SlavikCA · 2026-03-14T19:31:06Z

~~Is it true, that all flags and options mentioned in this thread applies only during starting of llama?~~

~~Is there an option (header) to use with API request to enable / disable reasoning? For example send one query with thinking disabled and another - with thinking enabled?~~

Found that this works to disable thinking on per-request basis:

curl 'https://direct.*****/v1/chat/completions' \
 -H "x-api-key: ****" \
 -d '{
    "model": "qwen35-122b",
    "chat_template_kwargs": {"enable_thinking": false},
    "messages": [
        {
        "role": "user",
        "content": "How large is the solar system?"
        }
    ]
}' | jq

Now, is there an option I can use to configure reasoning-budget per request?

winstonma · 2026-03-15T02:04:03Z

@SlavikCA I can stop the Qwen 3.5 35B A3B from thinking using llama-server and your API call.

But I still no luck disabling the thinking on llama-cli. Here is what I get:

❯ llama-cli -m ~/model/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf --reasoning off
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8352-d88ccec
model      : Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> hi

[Start thinking]
Thinking Process:

1.  **Analyze the Input:**
...

pwilkin · 2026-03-15T16:07:50Z

@winstonma Yeah, cli reasoning parameter passing is broken. Will be fixed in #20424

pwilkin requested review from ggerganov and ngxson as code owners March 9, 2026 15:07

pwilkin mentioned this pull request Mar 9, 2026

disabling reasoning does not work anymore on certain models #20196

Open

github-actions bot added examples server labels Mar 9, 2026

ggerganov reviewed Mar 9, 2026

View reviewed changes

pwilkin force-pushed the reasoning-budget branch from 1df4d24 to b201c80 Compare March 9, 2026 21:15

CISC reviewed Mar 9, 2026

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

common/arg.cpp Outdated Show resolved Hide resolved

tools/server/server-context.cpp Outdated Show resolved Hide resolved

aldehir reviewed Mar 9, 2026

View reviewed changes

include/llama.h Outdated Show resolved Hide resolved

loci-dev mentioned this pull request Mar 10, 2026

UPSTREAM PR #20297: Handle reasoning budget auroralabs-loci/llama.cpp#1237

Open

pwilkin mentioned this pull request Mar 10, 2026

cli : fix --reasoning-budget and --chat-template-kwargs being ignored #20329

Open

pwilkin force-pushed the reasoning-budget branch from 1dbfbcb to 24eb88d Compare March 10, 2026 12:36

github-actions bot added the testing Everything test related label Mar 10, 2026

pwilkin requested review from CISC, aldehir and ggerganov March 10, 2026 14:11

pwilkin added 2 commits March 10, 2026 22:58

v1

194062f

Finished!

6abf839

aldehir and others added 3 commits March 11, 2026 00:58

cont : remove unused imports

8e699dd

cont : update state machine doc string

bc21158

Merge pull request #14 from aldehir/reasoning-budget-updates

70de1e2

Reasoning budget updates

aldehir approved these changes Mar 11, 2026

View reviewed changes

pwilkin merged commit acb7c79 into ggml-org:master Mar 11, 2026
13 of 75 checks passed

pwilkin deleted the reasoning-budget branch March 11, 2026 09:26

ggerganov reviewed Mar 11, 2026

View reviewed changes

This was referenced Mar 11, 2026

server: handle limiting maximum reasoning budget #17750

Closed

Misc. bug: Reasoning budget cleanup work #20429

Open

Fix server tests to use reasoning instead of reasoning_budget #20432

Merged

thomas-0816 mentioned this pull request Mar 12, 2026

Eval bug: reasoning off gives reasoning medium for gpt-oss #20458

Open

mostlygeek mentioned this pull request Mar 13, 2026

[feature] Add compatibility for reasoning-budget params mostlygeek/llama-swap#584

Closed

Tomeamis added a commit to Tomeamis/llama.cpp that referenced this pull request Mar 14, 2026

Update server README to reflect PR ggml-org#20297

07f10d0

Tomeamis added a commit to Tomeamis/llama.cpp that referenced this pull request Mar 14, 2026

Update server README to reflect PR ggml-org#20297

1c1153f

This was referenced Mar 15, 2026

Eval bug: starting from b8287 every response for Qwen 3.5 35b A3B in Instruct mode starts with "</think>" #20548

Closed

Eval bug: Response always starts with </think> tag when running Qwen3.5 9B #20516

Open

		int reasoning_budget = -1;
		std::string reasoning_budget_message;

Conversation

pwilkin commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Mar 9, 2026

Uh oh!

pwilkin commented Mar 9, 2026

Uh oh!

pwilkin commented Mar 9, 2026

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Mar 9, 2026

Uh oh!

aldehir commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 9, 2026

Uh oh!

pwilkin commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aldehir commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CISC commented Mar 9, 2026

Uh oh!

pwilkin commented Mar 9, 2026

Uh oh!

pwilkin commented Mar 10, 2026

Uh oh!

pwilkin commented Mar 10, 2026

Uh oh!

Uh oh!

ggerganov Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

CISC commented Mar 11, 2026

Uh oh!

jabr commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erazortt commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Mar 13, 2026

Uh oh!

ZUIcat commented Mar 13, 2026

Uh oh!

pwilkin commented Mar 13, 2026

Uh oh!

pwilkin commented Mar 13, 2026

Uh oh!

ZUIcat commented Mar 14, 2026

Uh oh!

SlavikCA commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

winstonma commented Mar 15, 2026

Uh oh!

pwilkin commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

pwilkin commented Mar 9, 2026 •

edited

Loading

aldehir commented Mar 9, 2026 •

edited

Loading

pwilkin commented Mar 9, 2026 •

edited

Loading

aldehir commented Mar 9, 2026 •

edited

Loading

ggerganov Mar 11, 2026 •

edited

Loading

jabr commented Mar 13, 2026 •

edited

Loading

erazortt commented Mar 13, 2026 •

edited

Loading

SlavikCA commented Mar 14, 2026 •

edited

Loading