Skip to content

Handle reasoning budget#20297

Merged
pwilkin merged 13 commits intoggml-org:masterfrom
pwilkin:reasoning-budget
Mar 11, 2026
Merged

Handle reasoning budget#20297
pwilkin merged 13 commits intoggml-org:masterfrom
pwilkin:reasoning-budget

Conversation

@pwilkin
Copy link
Contributor

@pwilkin pwilkin commented Mar 9, 2026

Adds proper handling for --reasoning-budget.

Currently, --reasoning-budget is just a stub that handles one case: 0, and the only think that does is set enable_thinking to false.

This PR adds the following flags:

  • --reasoning on (short -rea on) - enable reasoning via kwargs on model
  • --reasoning off (short -rea off) - disable reasoning via kwargs on model
  • --reasoning-budget-message - a message to be appended before the reasoning close marker to inform the model the reasoning was terminated due to budget constraints, i.e. " ... reasoning budget exceeded" or "... okay, now let's answer."

Also, --reasoning-budget now adds an extra grammar with a mechanism called delayed launch. Basically what this does is, when the opening trigger for the grammar is triggered, tokens get counted down and we also watch for the disarm trigger. If the disarm trigger doesn't get triggered before the countdown, the grammar gets launched.

This allows setting a real token budget limit on reasoning for models. This also allows disabling thinking for models that normally do not allow that by setting the budget to 0, which is now a different behavior than --disable-reasoning. Note: while possible, this isn't recommended since if a model was trained to only work with reasoning, it might exhibit aberrant behavior (for example, try to open an extra reasoning section).

Supersedes #17750

@pwilkin pwilkin requested review from ggerganov and ngxson as code owners March 9, 2026 15:07
@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 9, 2026

AI disclosure: I used Claude Opus in making most of the changes, auditing and modifying the critical code myself.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 9, 2026

Oh, I didn't mention it in the note but this of course entails support for multiple grammars for one server task since the tool grammar is still there.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 9, 2026

Some interesting observations from early tests (on Qwen3.5 9B Q8_0):

  • full model humaneval is around 93%
  • non-reasoning (-dre) is around 88%
  • using reasoning_budget 1000 and 400 is actually pretty similar and improves to about 89%
  • however this relies on having a --reasoning-budget-message (I used " ... reasoning budget exceeded, need to answer."). Without that, performance drops to a terrible 79%.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably need to understand this deeper, but on first look this seems very heavy logic. How important is this functionality?

Specifically the changes in common_sampler seem disproportionately large compared to what this brings to the existing logic. Look for ways to simplify.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 9, 2026

I'll probably need to understand this deeper, but on first look this seems very heavy logic. How important is this functionality?

A lot of people have been requesting this, especially with the Qwen3.5 models that are seen as too verbose with their reasoning.

The changes in the sampler code are basically to the grammar sampler, since the idea is (a) to support more than one grammar simultaneously and (b) to support delayed grammar application (with token counting). Maybe this can be simplified by instead inserting another grammar sampler? Not sure how viable that would be.

@aldehir
Copy link
Collaborator

aldehir commented Mar 9, 2026

In my opinion, we need to think long term.

The grammar sampler is incredibly inefficient. We had to revert a change @ggerganov wanted to make that shifts the grammar to the start of the chain to support backend sampling.

Merging this will increase the reliance on the grammar sampler and make it more challenging to optimize in the future.

I'm of the opinion that a dedicated, simple, reasoning sampler that lives in common would be enough and can be used at the start of the chain--so long as it aligns with the grammar used (if any).

@ggerganov
Copy link
Member

Yes, framing this as a reasoning sampler should definitely be explored.

@pwilkin pwilkin force-pushed the reasoning-budget branch from 1df4d24 to b201c80 Compare March 9, 2026 21:15
@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 9, 2026

@ggerganov @aldehir aight, reverted all the grammar changes and instead reimplemented it as a clean new reasoning parser.

I tested on cli, there is no noticeable overhead on generation (152 t/s both with and without the sampler).

@aldehir
Copy link
Collaborator

aldehir commented Mar 9, 2026

Unless @ggerganov thinks otherwise, I would put it under common until it reaches maturity before exposing it in the public API. I imagine there will be quite a bit of churn with all the models to support.

Other notes:

  • Is arm_immediately needed? Why not define an initial state instead?
  • Need to add a soft/hard cap to handle partial utf8 sequences. When I tested this with a grammar approach, I would often see incomplete utf8 codepoints. Instead, we could enforce a soft cap then continue until we hit a clean boundary or hit the hard cap.
  • Would like to see some vision for incorporating other reasoning budget strategies. For example, Nemotron Nano 2 (which they claim to also support in their 3-series).
3.4. Budget Control Evaluation
Nemotron Nano V2 allows users to specify how many thinking tokens the model may generate before
producing the final answer. The final answer is the portion of text typically shown to end users.
This feature is implemented by counting tokens after the model begins generating the <think>
token. Once the budget is reached, the inference setup attempts to insert a closing </think> tag.
Rather than inserting it immediately, we let the model finish its current sentence and place the
tag at the next newline. In extreme cases where no newline appears, the system enforces closure
within 500 tokens past the budget: if no newline occurs by the (budget+ 500)th token, the </think>
tag is forcibly inserted.

https://arxiv.org/abs/2508.14444

Overall, I think this is a cleaner approach. It isolates the complexity rather than polluting the already complex grammar sampling logic.

Copy link
Collaborator

@aldehir aldehir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some tests around the apply/accept logic. I had some in my example, but feel free to improvise.

@CISC
Copy link
Member

CISC commented Mar 9, 2026

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 9, 2026

Funny, wonder what happened here: https://github.com/ggml-org/llama.cpp/actions/runs/22877806623/job/66373589713?pr=20297

GitHub merge running on Windows? :D

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 10, 2026

Aight I got rid of the explosive terminology and fixed the newlines in the process :)

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 10, 2026

Okay, UTF-8 and tests are done, think this one's ready.

@github-actions github-actions bot added the testing Everything test related label Mar 10, 2026
@pwilkin pwilkin requested review from CISC, aldehir and ggerganov March 10, 2026 14:11
@pwilkin pwilkin merged commit acb7c79 into ggml-org:master Mar 11, 2026
13 of 75 checks passed
@pwilkin pwilkin deleted the reasoning-budget branch March 11, 2026 09:26
bool enable_chat_template = true;
common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
int enable_reasoning = -1; // -1 = auto, 0 = disable, 1 = enable
int reasoning_budget = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reasoning_budget parameter in common_params seems should be removed in favor of reasoning_budget_tokens in common_params_sampling.

common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
int enable_reasoning = -1; // -1 = auto, 0 = disable, 1 = enable
int reasoning_budget = -1;
std::string reasoning_budget_message; // message injected before end tag when budget exhausted
Copy link
Member

@ggerganov ggerganov Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reasoning_budget_message is rather a sampling parameter, so probably better to move to common_params_sampling.

Comment on lines +60 to +61
int reasoning_budget = -1;
std::string reasoning_budget_message;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need these vars. Just extract the info from defaults.sampling

@CISC
Copy link
Member

CISC commented Mar 11, 2026

ProgenyAlpha pushed a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 12, 2026
* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>
@jabr
Copy link

jabr commented Mar 13, 2026

This is working great for me so far! Thanks @pwilkin!

edit: I seem to get better behavior if the message ends with a newline. Without it, the thinking sometimes "escapes" the first close tag and continues on into the content. (With Qwen3.5 0.8B)

@erazortt
Copy link

erazortt commented Mar 13, 2026

I am on b8303 now (this was merged in b8287) and neither --enable-reasoning nor --disable-reasoning flags are recognized by llama-server, returning an "invalid argument" error. It this expected?

@CISC
Copy link
Member

CISC commented Mar 13, 2026

I am on b8303 now and neither --enable-reasoning nor --disable-reasoning flages are recognized by llama-server, retunring an invalid argument. It this expected?

The OP is outdated, the option is --reasoning on or --reasoning off.

@ZUIcat
Copy link

ZUIcat commented Mar 13, 2026

I believe the documentation may need an update. I'm currently extremely confused about how to enable or disable the thinking mode of Qwen3.5. Previously, I used reasoning-budget = 0, but it seems that's no longer the case? What exactly is --reasoning? And how is it different from chat-template-kwargs = "{\"enable_thinking\":false}"?

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 13, 2026

What exactly is --reasoning? And how is it different from chat-template-kwargs = "{\"enable_thinking\":false}"?

--reasoning is enable_thinking together with a few extra optimizations for handling the thinking path in the parser. So if you want to disable reasoning, use --reasoning off, which also sets enable_thinking = false in the template.

If your template doesn't support disabling thinking, you can use --reasoning-budget 0 as the sampler "forcing" solution, but beware that models might not like it (this is basically the equivalent of hacking the template to insert </think> at the start of the response).

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 13, 2026

The OP is outdated, the option is --reasoning on or --reasoning off.

I updated the OP message.

@ZUIcat
Copy link

ZUIcat commented Mar 14, 2026

--reasoning is enable_thinking together with a few extra optimizations for handling the thinking path in the parser. So if you want to disable reasoning, use --reasoning off, which also sets enable_thinking = false in the template.

If your template doesn't support disabling thinking, you can use --reasoning-budget 0 as the sampler "forcing" solution, but beware that models might not like it (this is basically the equivalent of hacking the template to insert </think> at the start of the response).

Thank you very much for your explanation. However, I still have a small question. In the previous discussion at #13196, it was mentioned that "the preferred way for disabling thinking with a command line argument is now --reasoning-budget 0." Therefore, I have been using only this command to disable thinking.

So now, to disable thinking, do I need to apply both --reasoning-budget 0 and --reasoning off? Or is it sufficient to use only --reasoning off without anything else (considering Qwen3.5 only)?

@SlavikCA
Copy link

SlavikCA commented Mar 14, 2026

Is it true, that all flags and options mentioned in this thread applies only during starting of llama?

Is there an option (header) to use with API request to enable / disable reasoning? For example send one query with thinking disabled and another - with thinking enabled?

Found that this works to disable thinking on per-request basis:

curl 'https://direct.*****/v1/chat/completions' \
 -H "x-api-key: ****" \
 -d '{
    "model": "qwen35-122b",
    "chat_template_kwargs": {"enable_thinking": false},
    "messages": [
        {
        "role": "user",
        "content": "How large is the solar system?"
        }
    ]
}' | jq

Now, is there an option I can use to configure reasoning-budget per request?

@winstonma
Copy link

@SlavikCA I can stop the Qwen 3.5 35B A3B from thinking using llama-server and your API call.

But I still no luck disabling the thinking on llama-cli. Here is what I get:

❯ llama-cli -m ~/model/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf --reasoning off
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8352-d88ccec
model      : Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> hi

[Start thinking]
Thinking Process:

1.  **Analyze the Input:**
...

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 15, 2026

@winstonma Yeah, cli reasoning parameter passing is broken. Will be fixed in #20424

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants