Conversation
|
AI disclosure: I used Claude Opus in making most of the changes, auditing and modifying the critical code myself. |
|
Oh, I didn't mention it in the note but this of course entails support for multiple grammars for one server task since the tool grammar is still there. |
|
Some interesting observations from early tests (on Qwen3.5 9B Q8_0):
|
ggerganov
left a comment
There was a problem hiding this comment.
I'll probably need to understand this deeper, but on first look this seems very heavy logic. How important is this functionality?
Specifically the changes in common_sampler seem disproportionately large compared to what this brings to the existing logic. Look for ways to simplify.
A lot of people have been requesting this, especially with the Qwen3.5 models that are seen as too verbose with their reasoning. The changes in the sampler code are basically to the grammar sampler, since the idea is (a) to support more than one grammar simultaneously and (b) to support delayed grammar application (with token counting). Maybe this can be simplified by instead inserting another grammar sampler? Not sure how viable that would be. |
|
In my opinion, we need to think long term. The grammar sampler is incredibly inefficient. We had to revert a change @ggerganov wanted to make that shifts the grammar to the start of the chain to support backend sampling. Merging this will increase the reliance on the grammar sampler and make it more challenging to optimize in the future. I'm of the opinion that a dedicated, simple, reasoning sampler that lives in common would be enough and can be used at the start of the chain--so long as it aligns with the grammar used (if any). |
|
Yes, framing this as a reasoning sampler should definitely be explored. |
1df4d24 to
b201c80
Compare
|
@ggerganov @aldehir aight, reverted all the grammar changes and instead reimplemented it as a clean new reasoning parser. I tested on cli, there is no noticeable overhead on generation (152 t/s both with and without the sampler). |
|
Unless @ggerganov thinks otherwise, I would put it under Other notes:
https://arxiv.org/abs/2508.14444 Overall, I think this is a cleaner approach. It isolates the complexity rather than polluting the already complex grammar sampling logic. |
aldehir
left a comment
There was a problem hiding this comment.
Need some tests around the apply/accept logic. I had some in my example, but feel free to improvise.
|
Funny, wonder what happened here: |
GitHub merge running on Windows? :D |
1dbfbcb to
24eb88d
Compare
|
Aight I got rid of the explosive terminology and fixed the newlines in the process :) |
|
Okay, UTF-8 and tests are done, think this one's ready. |
| bool enable_chat_template = true; | ||
| common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK; | ||
| int enable_reasoning = -1; // -1 = auto, 0 = disable, 1 = enable | ||
| int reasoning_budget = -1; |
There was a problem hiding this comment.
This reasoning_budget parameter in common_params seems should be removed in favor of reasoning_budget_tokens in common_params_sampling.
| common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK; | ||
| int enable_reasoning = -1; // -1 = auto, 0 = disable, 1 = enable | ||
| int reasoning_budget = -1; | ||
| std::string reasoning_budget_message; // message injected before end tag when budget exhausted |
There was a problem hiding this comment.
I think reasoning_budget_message is rather a sampling parameter, so probably better to move to common_params_sampling.
| int reasoning_budget = -1; | ||
| std::string reasoning_budget_message; |
There was a problem hiding this comment.
I think you don't need these vars. Just extract the info from defaults.sampling
|
@pwilkin This broke the |
* v1 * Finished! * Handlie cli * Reasoning sampler * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Less explosive terminology :) * Add utf-8 case and tests * common : migrate reasoning budget sampler to common * cont : clean up * cont : expose state and allow passing as initial state * cont : remove unused imports * cont : update state machine doc string --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Alde Rojas <hello@alde.dev>
|
This is working great for me so far! Thanks @pwilkin! edit: I seem to get better behavior if the |
|
I am on b8303 now (this was merged in b8287) and neither --enable-reasoning nor --disable-reasoning flags are recognized by llama-server, returning an "invalid argument" error. It this expected? |
The OP is outdated, the option is |
|
I believe the documentation may need an update. I'm currently extremely confused about how to enable or disable the thinking mode of Qwen3.5. Previously, I used |
If your template doesn't support disabling thinking, you can use |
I updated the OP message. |
Thank you very much for your explanation. However, I still have a small question. In the previous discussion at #13196, it was mentioned that "the preferred way for disabling thinking with a command line argument is now --reasoning-budget 0." Therefore, I have been using only this command to disable thinking. So now, to disable thinking, do I need to apply both |
|
Found that this works to disable thinking on per-request basis: Now, is there an option I can use to configure |
|
@SlavikCA I can stop the Qwen 3.5 35B A3B from thinking using llama-server and your API call. But I still no luck disabling the thinking on llama-cli. Here is what I get: |
|
@winstonma Yeah, |
Adds proper handling for
--reasoning-budget.Currently,
--reasoning-budgetis just a stub that handles one case: 0, and the only think that does is set enable_thinking to false.This PR adds the following flags:
--reasoning on(short-rea on) - enable reasoning via kwargs on model--reasoning off(short-rea off) - disable reasoning via kwargs on model--reasoning-budget-message- a message to be appended before the reasoning close marker to inform the model the reasoning was terminated due to budget constraints, i.e. " ... reasoning budget exceeded" or "... okay, now let's answer."Also,
--reasoning-budgetnow adds an extra grammar with a mechanism called delayed launch. Basically what this does is, when the opening trigger for the grammar is triggered, tokens get counted down and we also watch for the disarm trigger. If the disarm trigger doesn't get triggered before the countdown, the grammar gets launched.This allows setting a real token budget limit on reasoning for models. This also allows disabling thinking for models that normally do not allow that by setting the budget to 0, which is now a different behavior than
--disable-reasoning. Note: while possible, this isn't recommended since if a model was trained to only work with reasoning, it might exhibit aberrant behavior (for example, try to open an extra reasoning section).Supersedes #17750