Skip to content

common/parser: add proper reasoning tag prefill reading#20424

Open
pwilkin wants to merge 9 commits intoggml-org:masterfrom
pwilkin:reasoning-prefill
Open

common/parser: add proper reasoning tag prefill reading#20424
pwilkin wants to merge 9 commits intoggml-org:masterfrom
pwilkin:reasoning-prefill

Conversation

@pwilkin
Copy link
Contributor

@pwilkin pwilkin commented Mar 11, 2026

This changes the erroneous behavior of the autoparser that ascribed thinking behavior to templates. As people rightly mentioned, some models have dynamic or hybrid reasoning - they can reason or not depending on some switches and even the template behavior can change due to this (i.e. inserting <think> in assistant prefill after a "no_think" appears in a user message).

Therefore, the FORCED_OPEN and FORCED_CLOSED formats are gone. The parser will now just detect models with tagged reasoning, i.e. an opening and closing reasoning marker (deleted DELIMITER also since it's a special case with the opening marker being empty). However, it will check the assistant prefill for those markers and will append them to the input for the grammar and the parser, so that they are taken into account, therefore just simplifying the parsing mechanism since it doesn't now have to differentiate whether the <think>' / ` was added by the template or generated by the model.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 11, 2026

Fixes #20356
Fixes #20325
Fixes #20265

This also clears the ground for disabling grammar triggers inside reasoning loops in a subsequent PR, which would resolve #20260

@github-actions github-actions bot added documentation Improvements or additions to documentation testing Everything test related examples server labels Mar 11, 2026
@aldehir
Copy link
Contributor

aldehir commented Mar 11, 2026

Dumb question, why not find the start of the assistant message and prepend that?

I agree it would be easier to parse if we had a "prefill" of some sort that normalizes the input, such that we can handle the logic in the grammar and not through flags. However, if we're going this route I would look into prepending the start of the entire assistant message. This will also open the door for parsing output from requests with an assistant prefill.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 11, 2026

Yeah, that would be the logical conclusion, but for now it's easier for me just to extract the reasoning markers since finding the actual start of the assistant message is nontrivial.

@aldehir
Copy link
Contributor

aldehir commented Mar 11, 2026

Qwen3.5 uses

<think>\n\n</think>\n\n
{%- if enable_thinking is defined and enable_thinking is false %}
{{- '<think>\n\n</think>\n\n' }}
{%- else %}
{{- '<think>\n' }}
{%- endif %}

however,

      "reasoning_prefill": "<think></think>\n\n",

It probably doesn't matter for this model, but it is technically not adhering to the template.

@aldehir
Copy link
Contributor

aldehir commented Mar 11, 2026

    {
      "id": 248045,
      "piece": "<|im_start|>"
    },
    {
      "id": 74455,
      "piece": "assistant"
    },
    {
      "id": 198,
      "piece": "\n"
    },
    {
      "id": 248068,
      "piece": "<think>"
    },
    {
      "id": 271,
      "piece": "\n\n"
    },
    {
      "id": 248069,
      "piece": "</think>"
    },
    {
      "id": 271,
      "piece": "\n\n"
    }

Maybe set reasoning_prefill from the start of the opening tag to the end of the prompt?

@aldehir
Copy link
Contributor

aldehir commented Mar 11, 2026

finding the actual start of the assistant message is nontrivial.

Run the template once with add_generation_prompt = false, capture the size, run again with true, extract the string content that spans the delta? I think that would work in most cases.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 12, 2026

That usually works, yeah 😀 I can try that and see what the results are (this is what calculate_diff_split from the analyzer does BTW). I'm just worried about some weird edge cases.

@bsdice
Copy link

bsdice commented Mar 14, 2026

Nice patch! With model https://huggingface.co/mradermacher/Qwen3.5-40B-Claude-4.5-Opus-High-Reasoning-Thinking-GGUF the patches fix webui getting confused on /think and not splitting correctly reasoning and generation part. Build llama.cpp-cuda-git-b8334.r9.710878a7dd-1.

@pwilkin pwilkin force-pushed the reasoning-prefill branch from 3bfb08f to 4083259 Compare March 14, 2026 14:49
@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 14, 2026

@aldehir changed the prefill extraction behavior to the differential one you mentioned.

common/chat.h Outdated
std::string grammar;
bool grammar_lazy = false;
bool thinking_forced_open = false;
std::string prefill;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we name this generation_prompt? It lines up with the add_generation_prompt flag.

Comment on lines +71 to +95
bool clear_reasoning_start = false;
if (inputs.reasoning_format != COMMON_REASONING_FORMAT_NONE &&
autoparser.reasoning.mode != reasoning_mode::NONE &&
!autoparser.reasoning.end.empty()) {
const auto & r_start = autoparser.reasoning.start;
const auto & r_end = autoparser.reasoning.end;
auto r_end_t = trim_trailing_whitespace(r_end);
auto r_start_t = trim_trailing_whitespace(r_start);

if (!r_start_t.empty()) {
auto start_pos = prompt_to_search.rfind(r_start_t);
if (start_pos != std::string::npos) {
std::string from_start = prompt_to_search.substr(start_pos);
auto fs_trimmed = trim_trailing_whitespace(from_start);

if (string_ends_with(fs_trimmed, r_end_t)) {
data.prefill = r_start + r_end;
} else if (string_ends_with(fs_trimmed, r_start_t)) {
data.prefill = from_start;
} else {
clear_reasoning_start = true;
}
}
}
}
Copy link
Contributor

@aldehir aldehir Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my understanding is: we have a generation prompt G, we can create a parser that accepts G[0:max(G.size(), G.index_of(reasoning_start))] + (reasoning_start + reasoning + reasoning end)? + .... Then we can do away with all the trim logic.

The benefit is that now the parser can properly parse assistant prefill from the user, since the parser starts from the beginning of the assistant message.

I see that Mistral's templates have no generation prompt, so G = "". But this is fine, because the model emits the [THINK] tag. So the above still works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workaround is mostly for Apriel that has a delimited thinking format and inserts a header "Thinking chain starts here: " or something like that as the generation prompt which acts as a quasi-reasoning marker that we want to strip.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 15, 2026

@aldehir okay, that rewrite ended up being a bit bigger than I expected... but it's exactly the algorithm you mentioned now.

@aldehir
Copy link
Contributor

aldehir commented Mar 15, 2026

Oh jeez, well it's <100 net LOC. I'll give it a whirl.

@pwilkin pwilkin requested review from a team as code owners March 15, 2026 15:02
@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 15, 2026

@aldehir happy to report I added another nice piece of code to make it work correctly with grammars / schemas :)

@pwilkin pwilkin force-pushed the reasoning-prefill branch from d0cf846 to a21d219 Compare March 15, 2026 16:18
@github-actions github-actions bot added the python python script changes label Mar 15, 2026
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this new logic interacts with the existing logic for feeding the prompt to the sampler:

void init_sampler() const {
common_sampler_reset(smpl.get());
if (!task->need_sampling()) {
return;
}
const int64_t t_start = ggml_time_us();
int n_text = 0;
for (int i = 0; i < (int) prompt.tokens.size(); i++) {
const llama_token id = prompt.tokens[i];
if (id != LLAMA_TOKEN_NULL) {
common_sampler_accept(smpl.get(), id, false);
n_text++;
}
}
SLT_INF(*this, "init sampler, took %0.2f ms, tokens: text = %d, total = %d\n",
(ggml_time_us() - t_start) / 1000.0, n_text, (int) prompt.tokens.size());
}

In general the new common_params_sampling.grammar_external external flag feels off. I would look to avoid it.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 15, 2026

In general the new common_params_sampling.grammar_external external flag feels off. I would look to avoid it.

I'd really love to avoid it, but there's a key problem here.

Together with @aldehir we figured out an algorithm to clearly handle cases of template generation prompt prefill - we simply add the prefill to the parser, this is reflected in the grammar and therefore we handle this irrelevant of what is added by the template and what is generated by the model.

If an external json_schema is added to constrain output, it's also not a problem since the json schema is not directly converted into the grammar - instead, it's fed into the parser and only reflected in the grammar via the parser, so we handle this.

However, there's one case we can't handle, which is the user explicitly specifying a grammar. Because the grammar passed by the user is non-lazy and because it's not aware of the generation prompt prefill, it will fail exactly on the prefill part, which is why it's critical to notify the mechanism that it's an externally passed grammar (and not one generated within the parser engine) so that the prefill part is not added.

@aldehir
Copy link
Contributor

aldehir commented Mar 15, 2026

Hmm, this on me. I didn't think through the grammar impact. The grammar sampler should be limited to just the generation otherwise it's not technically "sampling." What if we modified how the root rule is defined? This should invoke no changes to the grammar sampler and its inputs. Alternatively we can leverage the grammar_root parameter that defaults to root, which might be an easier solution.

inputs.add_generation_prompt = true;
inputs.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
inputs.enable_thinking = common_chat_templates_support_enable_thinking(chat_params.tmpls.get());
inputs.enable_thinking = chat_params.enable_thinking ? common_chat_templates_support_enable_thinking(chat_params.tmpls.get()) : false;
Copy link
Contributor

@aldehir aldehir Mar 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract this fix to another PR? Just so we can merge it in sooner. I think this PR needs a bit more brainstorming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ye, sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted to #20606

{%- for tool in message['tool_calls']-%}
{%- if not ns.is_first -%}
{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}
{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] | tojson + '\n' + '```' + '<|tool▁call▁end|>'}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we use the JSON representation when dumping a JSON object in the Jinja engine itself.

Copy link
Contributor Author

@pwilkin pwilkin Mar 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, no, we do not, as I found out when I removed the "Python dict" support.

We use the Python dict representation.

Or, more precisely, use double quotes if the internal string contains a single quote and single-quotes when it does not:

// TODO: avoid circular references
std::string value_to_string_repr(const value & val) {
    if (is_val<value_string>(val)) {
        const std::string val_str = val->as_string().str();

        if (val_str.find('\'') != std::string::npos) {
            return value_to_json(val);
        } else {
            return "'" + val_str + "'";
        }
    } else {
        return val->as_repr();
    }
}

I would really love it to use the JSON representation as standard, but I'm worried there might be some "original Python implementation faithfulness" shenanigans. @ngxson @CISC do you guys know if we could make | tojson the official string representation for objects?

Copy link
Contributor

@aldehir aldehir Mar 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An argument can be made that Jinja will respect __str__() if wrapped in a JSON type that serializes to JSON.

import json
from jinja2 import Template

tmpl = Template("{{ obj }}")

python_dict = dict(a=1, b=2, c=dict(d=3, e=4))
print("Python Dict:", tmpl.render(obj=python_dict))

class JSON:
    def __init__(self, value):
        self.value = value
    def __str__(self):
        return json.dumps(self.value)

json_obj = JSON(dict(a=1, b=2, c=dict(d=3, e=4)))
print("JSON Object:", tmpl.render(obj=json_obj))
$ ./example.py
Python Dict: {'a': 1, 'b': 2, 'c': {'d': 3, 'e': 4}}
JSON Object: {"a": 1, "b": 2, "c": {"d": 3, "e": 4}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be really glad to change it because it would mean we don't have to change the templates and we handle all the templates that don't use | tojson on arguments - but I won't make the call myself :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what's happening here though, this template does not support non-string arguments, that's all there is to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, I did not test with string concat which throws an exception.

There is no winning here. Either we don't parse arguments and break templates that iterate its values. Or we do, a break templates that expect a string.

Copy link
Member

@CISC CISC Mar 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be really glad to change it because it would mean we don't have to change the templates and we handle all the templates that don't use | tojson on arguments - but I won't make the call myself :)

It's not acceptable, it's not how this behaves with jinja2, either arguments is sent as a str, or it's been parsed with json.loads, in which case it's a dict, in other words this particular template will give you the following error:

TypeError: can only concatenate str (not "dict") to str

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I don't understand the issue, don't we have a capability check for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capability check for this

No, the arguments are indiscriminately parsed. I don't see a capability check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capability check for this

No, the arguments are indiscriminately parsed. I don't see a capability check.

Hmmm, maybe that was minja, either way, we should probably add one.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 15, 2026

Hmm, this on me. I didn't think through the grammar impact. The grammar sampler should be limited to just the generation otherwise it's not technically "sampling." What if we modified how the root rule is defined? This should invoke no changes to the grammar sampler and its inputs. Alternatively we can leverage the grammar_root parameter that defaults to root, which might be an easier solution.

Yes, I thought of this path as well. Thing is, I'm not really sure this would be easier. We still have to differentiate the user-supplied grammars from the internally-generated grammars to know which ones we need to patch the root rule for.

Regarding the sampling, the thing is we are technically simulating the sampling of the generation prompt, that's the whole idea of this approach - to make it basically invisible to the parser / grammar whether the given fragment was generated by the template or by the model itself.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 15, 2026

BTW @ggerganov is the weird thingy where you have to add a space in front of certain tokens still necessary? It looks really bizarre and it seems to break idempotence (as in detokenize(tokenize(x)) != x).

@aldehir
Copy link
Contributor

aldehir commented Mar 15, 2026

I'm starting to think there isn't a clean solution here. To address the original issue, perhaps we should simply adopt the previous checks for the end thinking tag and create a custom parser for templates that deviate.

I believe even a "reasoning prefill" will have challenges when incorporating with the grammar and reasoning samplers.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 15, 2026

There isn't a clean solution, but I quite like this one because it actually realizes the idea of treating template prefill as forced model output. If you think about it, what we're doing is not really that different from the reasoning sampler forcing the end-thinking tag - but instead of forcing the model to generate it, we just simulate the generation for the prefill.

This makes the parsing for reasoning elegant and simple and avoids all problems with deviant / weird / atypical templates. I think having one extra flag, even if a big hackish, is an appropriate price to pay for it.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 15, 2026

Also, I tested this with the reasoning sampler and it works fine.

@ggerganov
Copy link
Member

BTW @ggerganov is the weird thingy where you have to add a space in front of certain tokens still necessary? It looks really bizarre and it seems to break idempotence (as in detokenize(tokenize(x)) != x).

Not sure where the spaces come from. There was something about having to apply normalize() to restore the identity of tokenize + detokenize, but don't remember the details.

Comment on lines 256 to 286
@@ -259,7 +282,7 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st
params.reasoning_budget_end,
params.reasoning_budget_forced,
params.reasoning_budget_tokens,
params.reasoning_budget_activate_immediately ? REASONING_BUDGET_COUNTING : REASONING_BUDGET_IDLE));
prefill_tokens));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue that I see is that this logic is technically the same as the logic here:

void init_sampler() const {
common_sampler_reset(smpl.get());
if (!task->need_sampling()) {
return;
}
const int64_t t_start = ggml_time_us();
int n_text = 0;
for (int i = 0; i < (int) prompt.tokens.size(); i++) {
const llama_token id = prompt.tokens[i];
if (id != LLAMA_TOKEN_NULL) {
common_sampler_accept(smpl.get(), id, false);
n_text++;
}
}
SLT_INF(*this, "init sampler, took %0.2f ms, tokens: text = %d, total = %d\n",
(ggml_time_us() - t_start) / 1000.0, n_text, (int) prompt.tokens.size());
}

Unless I am missing something, the common_sampler should work like this:

  • We can initialize an object:
    • without grammar, without reasoning budget
    • without grammar, with reasoning budget
    • with grammar, without reasoning budget
    • with grammar, with reasoning budget
  • In all cases, we need to feed any "initial" (a.k.a. "prefill") tokens to both the grammar and the reasoning budget samplers
  • So this feeding logic should become part of the common_sampler initialization
  • I don't see why we need the params.grammar_external flag - we always have to feed the tokens to the grammar and reasoning samplers, if they exist

Maybe try to remove the old prefill logic from the server and consolidate it in common_sampler_init().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the grammar_external flag because otherwise we'll break backwards compatibility.

Some templates add stuff like the assistant role marker (<|im_begin|>assistant) in the prefill. User grammars generally only expect to parse what's present in the content section. The idea is to allow the new behavior of simulating the sampling of the prefill tokens but also guaranteeing backwards compatibility (this is why the server tests failed before I added the fix:

https://github.com/ggml-org/llama.cpp/actions/runs/23100821002

As for the common_sampler initialization, I'll take a look into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User grammars generally only expect to parse what's present in the content section.

Ah yes, I forgot about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov okay, so I checked the flow and we can't really move this to common_sampler_init().

The way common_sampler_init() works is it resets and reinitializes the sampler state for all samplers other than the grammar sampler. The grammar sampler is a different beast from the other samplers since its state depenendence is different - it doesn't depend on the entire prompt, but on the current message (same with the reasoning sampler). Therefore, if we wanted to do this uniformly, we'd have to do something like this:

  • modify the sampler struct to add a scope parameter (PROMPT, MESSAGE)
  • keep track of where the beginning of the current message starts (non-trivial since template does this and templates routinely modify the message history eg. by removing reasoning content so we can't just rely on differential analysis here)
  • run accept for PROMPT-scope samplers for the entire prompt and for MESSAGE-scope samplers for the current message only

This would make the behavior cleaner, but is a big change probably outside the scope of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's much simpler than that.

Each sampler has an accept() call that consumes a token. Generally we want to feed all tokens from the current context into the samplers. Hence the loop that I referenced in the server. The grammar sampler is the only exception - the false argument to common_sampler_accept() makes it so that we don't feed the grammar sampler with the "prefill" tokens.

So I am thinking that the newly added logic in common_sampler_init() for collecting "prefill" tokens and passing it to the reasoning sampler is not necessary. The reasoning sampler should accept all of its token through common_sampler_accept() - same as all other samplers (except the grammar).

Most likely the common_reasoning_budget_init() needs to be updated to not accept prefill tokens at all.

Copy link
Contributor Author

@pwilkin pwilkin Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov We definitely do want to pass prefill tokens to the reasoning sampler, the same way we pass it to the grammar sampler, simply because the two samplers operate on the same principle - their context is the single assistant message that is being generated, not the entire conversation (prompt).

However, I now finally understand why the entire flow is so confusing. So when a completion task gets assigned to a slot, this happens:
-> task gets assigned to a slot
-> chat template gets applied, processing the message history to a tokenizable prompt and adding assistant prefill tokens
-> the prompt (with prefill tokens) gets tokenized
-> the tokenized prompt is passed to the budget and reasoning parsers together with the prefill part; those parsers are initialized with just the tokens of the assistant prefill
-> model processes the prompt
-> now that generation is about to start, all the other samplers get reset and fed the entire tokenized prompt together with the assistant prefill
-> all samplers - the reasoning+grammar ones initialized earlier and the generation ones initialized now - start working on the outputs generated by the model

It's really counterintuitive why the sampler reset + initialization is done on generation start, when it could just as well be done during init (right now it doesn't matter that much, but if we have backend samplers, they could theoretically run the prefill / accept phase in parallel with the model doing prompt processing).

However, my point still stands - we can't really equate the two prefills because they differ in what they understand as "context". For the normal samplers, the entire prompt is the context - the entire message history together with the assistant prefill, before the generation phase. For the reasoning and grammar samplers, only the assistant prefill (the current assistant message) is the context. We could init them in the init_sampler() stage and have a unified prefill stage, but only if we mark where the assistant message starts in the prompt and the samplers themselves are aware whether they require full context or just assistant prefill context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, the loop inside init_sampler() would become:

        for (int i = 0; i < (int) prompt.tokens.size(); i++) {
            const llama_token id = prompt.tokens[i];

            if (id != LLAMA_TOKEN_NULL) {
                common_sampler_accept(smpl.get(), id, i >= asst_prefill_start_pos);
                n_text++;
            }
        }

where asst_prefill_start_pos would be calculated during task init as the position within the prompt where the assistant prefill tokens start, the third parameter to common_sampler_accept would change its meaning from accept_grammar to is_assistant_prefill_token and the samplers would know whether they require full context or just assistant prefill context. Then we could get rid of the loop in common_sampler_init :)

@pwilkin pwilkin force-pushed the reasoning-prefill branch from 8df64b2 to eab5b46 Compare March 16, 2026 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation examples python python script changes server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants