common/parser: add proper reasoning tag prefill reading#20424
common/parser: add proper reasoning tag prefill reading#20424pwilkin wants to merge 9 commits intoggml-org:masterfrom
Conversation
|
Dumb question, why not find the start of the assistant message and prepend that? I agree it would be easier to parse if we had a "prefill" of some sort that normalizes the input, such that we can handle the logic in the grammar and not through flags. However, if we're going this route I would look into prepending the start of the entire assistant message. This will also open the door for parsing output from requests with an assistant prefill. |
|
Yeah, that would be the logical conclusion, but for now it's easier for me just to extract the reasoning markers since finding the actual start of the assistant message is nontrivial. |
|
Qwen3.5 uses however, It probably doesn't matter for this model, but it is technically not adhering to the template. |
Maybe set |
Run the template once with |
|
That usually works, yeah 😀 I can try that and see what the results are (this is what |
|
Nice patch! With model https://huggingface.co/mradermacher/Qwen3.5-40B-Claude-4.5-Opus-High-Reasoning-Thinking-GGUF the patches fix webui getting confused on /think and not splitting correctly reasoning and generation part. Build llama.cpp-cuda-git-b8334.r9.710878a7dd-1. |
3bfb08f to
4083259
Compare
|
@aldehir changed the prefill extraction behavior to the differential one you mentioned. |
common/chat.h
Outdated
| std::string grammar; | ||
| bool grammar_lazy = false; | ||
| bool thinking_forced_open = false; | ||
| std::string prefill; |
There was a problem hiding this comment.
Think we name this generation_prompt? It lines up with the add_generation_prompt flag.
| bool clear_reasoning_start = false; | ||
| if (inputs.reasoning_format != COMMON_REASONING_FORMAT_NONE && | ||
| autoparser.reasoning.mode != reasoning_mode::NONE && | ||
| !autoparser.reasoning.end.empty()) { | ||
| const auto & r_start = autoparser.reasoning.start; | ||
| const auto & r_end = autoparser.reasoning.end; | ||
| auto r_end_t = trim_trailing_whitespace(r_end); | ||
| auto r_start_t = trim_trailing_whitespace(r_start); | ||
|
|
||
| if (!r_start_t.empty()) { | ||
| auto start_pos = prompt_to_search.rfind(r_start_t); | ||
| if (start_pos != std::string::npos) { | ||
| std::string from_start = prompt_to_search.substr(start_pos); | ||
| auto fs_trimmed = trim_trailing_whitespace(from_start); | ||
|
|
||
| if (string_ends_with(fs_trimmed, r_end_t)) { | ||
| data.prefill = r_start + r_end; | ||
| } else if (string_ends_with(fs_trimmed, r_start_t)) { | ||
| data.prefill = from_start; | ||
| } else { | ||
| clear_reasoning_start = true; | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
So my understanding is: we have a generation prompt G, we can create a parser that accepts G[0:max(G.size(), G.index_of(reasoning_start))] + (reasoning_start + reasoning + reasoning end)? + .... Then we can do away with all the trim logic.
The benefit is that now the parser can properly parse assistant prefill from the user, since the parser starts from the beginning of the assistant message.
I see that Mistral's templates have no generation prompt, so G = "". But this is fine, because the model emits the [THINK] tag. So the above still works.
There was a problem hiding this comment.
This workaround is mostly for Apriel that has a delimited thinking format and inserts a header "Thinking chain starts here: " or something like that as the generation prompt which acts as a quasi-reasoning marker that we want to strip.
|
@aldehir okay, that rewrite ended up being a bit bigger than I expected... but it's exactly the algorithm you mentioned now. |
|
Oh jeez, well it's <100 net LOC. I'll give it a whirl. |
|
@aldehir happy to report I added another nice piece of code to make it work correctly with grammars / schemas :) |
d0cf846 to
a21d219
Compare
ggerganov
left a comment
There was a problem hiding this comment.
I'm not sure how this new logic interacts with the existing logic for feeding the prompt to the sampler:
llama.cpp/tools/server/server-context.cpp
Lines 198 to 220 in 07c6a59
In general the new common_params_sampling.grammar_external external flag feels off. I would look to avoid it.
I'd really love to avoid it, but there's a key problem here. Together with @aldehir we figured out an algorithm to clearly handle cases of template generation prompt prefill - we simply add the prefill to the parser, this is reflected in the grammar and therefore we handle this irrelevant of what is added by the template and what is generated by the model. If an external However, there's one case we can't handle, which is the user explicitly specifying a grammar. Because the grammar passed by the user is non-lazy and because it's not aware of the generation prompt prefill, it will fail exactly on the prefill part, which is why it's critical to notify the mechanism that it's an externally passed grammar (and not one generated within the parser engine) so that the prefill part is not added. |
|
Hmm, this on me. I didn't think through the grammar impact. The grammar sampler should be limited to just the generation otherwise it's not technically "sampling." What if we modified how the root rule is defined? This should invoke no changes to the grammar sampler and its inputs. Alternatively we can leverage the |
| inputs.add_generation_prompt = true; | ||
| inputs.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK; | ||
| inputs.enable_thinking = common_chat_templates_support_enable_thinking(chat_params.tmpls.get()); | ||
| inputs.enable_thinking = chat_params.enable_thinking ? common_chat_templates_support_enable_thinking(chat_params.tmpls.get()) : false; |
There was a problem hiding this comment.
Can we extract this fix to another PR? Just so we can merge it in sooner. I think this PR needs a bit more brainstorming.
| {%- for tool in message['tool_calls']-%} | ||
| {%- if not ns.is_first -%} | ||
| {{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}} | ||
| {{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] | tojson + '\n' + '```' + '<|tool▁call▁end|>'}} |
There was a problem hiding this comment.
I'm thinking we use the JSON representation when dumping a JSON object in the Jinja engine itself.
There was a problem hiding this comment.
Well, no, we do not, as I found out when I removed the "Python dict" support.
We use the Python dict representation.
Or, more precisely, use double quotes if the internal string contains a single quote and single-quotes when it does not:
// TODO: avoid circular references
std::string value_to_string_repr(const value & val) {
if (is_val<value_string>(val)) {
const std::string val_str = val->as_string().str();
if (val_str.find('\'') != std::string::npos) {
return value_to_json(val);
} else {
return "'" + val_str + "'";
}
} else {
return val->as_repr();
}
}I would really love it to use the JSON representation as standard, but I'm worried there might be some "original Python implementation faithfulness" shenanigans. @ngxson @CISC do you guys know if we could make | tojson the official string representation for objects?
There was a problem hiding this comment.
An argument can be made that Jinja will respect __str__() if wrapped in a JSON type that serializes to JSON.
import json
from jinja2 import Template
tmpl = Template("{{ obj }}")
python_dict = dict(a=1, b=2, c=dict(d=3, e=4))
print("Python Dict:", tmpl.render(obj=python_dict))
class JSON:
def __init__(self, value):
self.value = value
def __str__(self):
return json.dumps(self.value)
json_obj = JSON(dict(a=1, b=2, c=dict(d=3, e=4)))
print("JSON Object:", tmpl.render(obj=json_obj))$ ./example.py
Python Dict: {'a': 1, 'b': 2, 'c': {'d': 3, 'e': 4}}
JSON Object: {"a": 1, "b": 2, "c": {"d": 3, "e": 4}}
There was a problem hiding this comment.
I'd be really glad to change it because it would mean we don't have to change the templates and we handle all the templates that don't use | tojson on arguments - but I won't make the call myself :)
There was a problem hiding this comment.
That's not what's happening here though, this template does not support non-string arguments, that's all there is to it.
There was a problem hiding this comment.
Fair, I did not test with string concat which throws an exception.
There is no winning here. Either we don't parse arguments and break templates that iterate its values. Or we do, a break templates that expect a string.
There was a problem hiding this comment.
I'd be really glad to change it because it would mean we don't have to change the templates and we handle all the templates that don't use
| tojsonon arguments - but I won't make the call myself :)
It's not acceptable, it's not how this behaves with jinja2, either arguments is sent as a str, or it's been parsed with json.loads, in which case it's a dict, in other words this particular template will give you the following error:
TypeError: can only concatenate str (not "dict") to strThere was a problem hiding this comment.
Anyway, I don't understand the issue, don't we have a capability check for this?
There was a problem hiding this comment.
capability check for this
No, the arguments are indiscriminately parsed. I don't see a capability check.
There was a problem hiding this comment.
capability check for this
No, the arguments are indiscriminately parsed. I don't see a capability check.
Hmmm, maybe that was minja, either way, we should probably add one.
Yes, I thought of this path as well. Thing is, I'm not really sure this would be easier. We still have to differentiate the user-supplied grammars from the internally-generated grammars to know which ones we need to patch the root rule for. Regarding the sampling, the thing is we are technically simulating the sampling of the generation prompt, that's the whole idea of this approach - to make it basically invisible to the parser / grammar whether the given fragment was generated by the template or by the model itself. |
|
BTW @ggerganov is the weird thingy where you have to add a space in front of certain tokens still necessary? It looks really bizarre and it seems to break idempotence (as in |
|
I'm starting to think there isn't a clean solution here. To address the original issue, perhaps we should simply adopt the previous checks for the end thinking tag and create a custom parser for templates that deviate. I believe even a "reasoning prefill" will have challenges when incorporating with the grammar and reasoning samplers. |
|
There isn't a clean solution, but I quite like this one because it actually realizes the idea of treating template prefill as forced model output. If you think about it, what we're doing is not really that different from the reasoning sampler forcing the end-thinking tag - but instead of forcing the model to generate it, we just simulate the generation for the prefill. This makes the parsing for reasoning elegant and simple and avoids all problems with deviant / weird / atypical templates. I think having one extra flag, even if a big hackish, is an appropriate price to pay for it. |
|
Also, I tested this with the reasoning sampler and it works fine. |
Not sure where the spaces come from. There was something about having to apply |
| @@ -259,7 +282,7 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, st | |||
| params.reasoning_budget_end, | |||
| params.reasoning_budget_forced, | |||
| params.reasoning_budget_tokens, | |||
| params.reasoning_budget_activate_immediately ? REASONING_BUDGET_COUNTING : REASONING_BUDGET_IDLE)); | |||
| prefill_tokens)); | |||
| } | |||
There was a problem hiding this comment.
The main issue that I see is that this logic is technically the same as the logic here:
llama.cpp/tools/server/server-context.cpp
Lines 198 to 220 in d393649
Unless I am missing something, the common_sampler should work like this:
- We can initialize an object:
- without grammar, without reasoning budget
- without grammar, with reasoning budget
- with grammar, without reasoning budget
- with grammar, with reasoning budget
- In all cases, we need to feed any "initial" (a.k.a. "prefill") tokens to both the grammar and the reasoning budget samplers
- So this feeding logic should become part of the
common_samplerinitialization - I don't see why we need the
params.grammar_externalflag - we always have to feed the tokens to the grammar and reasoning samplers, if they exist
Maybe try to remove the old prefill logic from the server and consolidate it in common_sampler_init().
There was a problem hiding this comment.
We need the grammar_external flag because otherwise we'll break backwards compatibility.
Some templates add stuff like the assistant role marker (<|im_begin|>assistant) in the prefill. User grammars generally only expect to parse what's present in the content section. The idea is to allow the new behavior of simulating the sampling of the prefill tokens but also guaranteeing backwards compatibility (this is why the server tests failed before I added the fix:
https://github.com/ggml-org/llama.cpp/actions/runs/23100821002
As for the common_sampler initialization, I'll take a look into it.
There was a problem hiding this comment.
User grammars generally only expect to parse what's present in the content section.
Ah yes, I forgot about that.
There was a problem hiding this comment.
@ggerganov okay, so I checked the flow and we can't really move this to common_sampler_init().
The way common_sampler_init() works is it resets and reinitializes the sampler state for all samplers other than the grammar sampler. The grammar sampler is a different beast from the other samplers since its state depenendence is different - it doesn't depend on the entire prompt, but on the current message (same with the reasoning sampler). Therefore, if we wanted to do this uniformly, we'd have to do something like this:
- modify the sampler struct to add a scope parameter (PROMPT, MESSAGE)
- keep track of where the beginning of the current message starts (non-trivial since template does this and templates routinely modify the message history eg. by removing reasoning content so we can't just rely on differential analysis here)
- run accept for PROMPT-scope samplers for the entire prompt and for MESSAGE-scope samplers for the current message only
This would make the behavior cleaner, but is a big change probably outside the scope of this PR.
There was a problem hiding this comment.
I think it's much simpler than that.
Each sampler has an accept() call that consumes a token. Generally we want to feed all tokens from the current context into the samplers. Hence the loop that I referenced in the server. The grammar sampler is the only exception - the false argument to common_sampler_accept() makes it so that we don't feed the grammar sampler with the "prefill" tokens.
So I am thinking that the newly added logic in common_sampler_init() for collecting "prefill" tokens and passing it to the reasoning sampler is not necessary. The reasoning sampler should accept all of its token through common_sampler_accept() - same as all other samplers (except the grammar).
Most likely the common_reasoning_budget_init() needs to be updated to not accept prefill tokens at all.
There was a problem hiding this comment.
@ggerganov We definitely do want to pass prefill tokens to the reasoning sampler, the same way we pass it to the grammar sampler, simply because the two samplers operate on the same principle - their context is the single assistant message that is being generated, not the entire conversation (prompt).
However, I now finally understand why the entire flow is so confusing. So when a completion task gets assigned to a slot, this happens:
-> task gets assigned to a slot
-> chat template gets applied, processing the message history to a tokenizable prompt and adding assistant prefill tokens
-> the prompt (with prefill tokens) gets tokenized
-> the tokenized prompt is passed to the budget and reasoning parsers together with the prefill part; those parsers are initialized with just the tokens of the assistant prefill
-> model processes the prompt
-> now that generation is about to start, all the other samplers get reset and fed the entire tokenized prompt together with the assistant prefill
-> all samplers - the reasoning+grammar ones initialized earlier and the generation ones initialized now - start working on the outputs generated by the model
It's really counterintuitive why the sampler reset + initialization is done on generation start, when it could just as well be done during init (right now it doesn't matter that much, but if we have backend samplers, they could theoretically run the prefill / accept phase in parallel with the model doing prompt processing).
However, my point still stands - we can't really equate the two prefills because they differ in what they understand as "context". For the normal samplers, the entire prompt is the context - the entire message history together with the assistant prefill, before the generation phase. For the reasoning and grammar samplers, only the assistant prefill (the current assistant message) is the context. We could init them in the init_sampler() stage and have a unified prefill stage, but only if we mark where the assistant message starts in the prompt and the samplers themselves are aware whether they require full context or just assistant prefill context.
There was a problem hiding this comment.
In other words, the loop inside init_sampler() would become:
for (int i = 0; i < (int) prompt.tokens.size(); i++) {
const llama_token id = prompt.tokens[i];
if (id != LLAMA_TOKEN_NULL) {
common_sampler_accept(smpl.get(), id, i >= asst_prefill_start_pos);
n_text++;
}
}where asst_prefill_start_pos would be calculated during task init as the position within the prompt where the assistant prefill tokens start, the third parameter to common_sampler_accept would change its meaning from accept_grammar to is_assistant_prefill_token and the samplers would know whether they require full context or just assistant prefill context. Then we could get rid of the loop in common_sampler_init :)
…art to common/reasoning-budget.cpp
8df64b2 to
eab5b46
Compare
This changes the erroneous behavior of the autoparser that ascribed thinking behavior to templates. As people rightly mentioned, some models have dynamic or hybrid reasoning - they can reason or not depending on some switches and even the template behavior can change due to this (i.e. inserting
<think>in assistant prefill after a "no_think" appears in a user message).Therefore, the
FORCED_OPENandFORCED_CLOSEDformats are gone. The parser will now just detect models with tagged reasoning, i.e. an opening and closing reasoning marker (deletedDELIMITERalso since it's a special case with the opening marker being empty). However, it will check the assistant prefill for those markers and will append them to the input for the grammar and the parser, so that they are taken into account, therefore just simplifying the parsing mechanism since it doesn't now have to differentiate whether the<think>' /` was added by the template or generated by the model.