Skip to content

Comments

server : support preserving reasoning_content in assistant message#18994

Merged
pwilkin merged 5 commits intoggml-org:masterfrom
ngxson:xsn/reasoning_content_input
Jan 22, 2026
Merged

server : support preserving reasoning_content in assistant message#18994
pwilkin merged 5 commits intoggml-org:masterfrom
ngxson:xsn/reasoning_content_input

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Jan 21, 2026

Ref: #18936 (comment)

Changes included in this PR

  • use json_fwd in chat.h to avoid using template trick
  • deduplicate code between common_chat_msgs_to_json_oaicompat and common_chat_msg::to_json_oaicompat()
  • force clear_thinking = false for GLM 4.7 if it is not specified
  • report the supports_preserve_reasoning to server /props

(Web UI support is TBD)

Changes in API

The /chat/completions API now accepts reasoning_content for assistant message:

{
  "messages": [
    {
      "content": "Hello, world!",
      "role": "user"
    },
    {
      "content": "Hey there!",
      "role": "assistant",
      "reasoning_content": "This is my reasoning."
    },
    {
      "content": "Hello, world!",
      "role": "user"
    }
  ],
  "stream": false,
  "max_tokens": 64
}

If the template supports it, the reasoning will be put back into the message (testing with GLM 4.7)

[gMASK]<sop><|user|>Hello, world!<|assistant|><think>This is my reasoning.</think>Hey there!<|user|>Hello, world!<|assistant|><think>

Otherwise, it will be ignored.

To know if the template supports it, /props endpoint will indicate:

{
  "chat_template_caps": {
    ...
    "supports_preserve_reasoning": true,
    ...
  }
}

Copy link
Collaborator

@pwilkin pwilkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a general notion: I am not a fan of splitting reasoning handling into "enable_reasoning", "clear_thinking" and the passive "supports_preserve_reasoning". I think this is a bit messy. Don't have a clear idea of how to handle this yet, but I guess we should (a) detect whether model supports reasoning (b) enable reasoning by default if it does (c) pass reasoning traces if the template supports it (d) accept explicit overrides, but I'm not sure if the explicit overrides are something we should handle on the level of flags or just allow passing it in template_kwargs.

#include "log.h"
#include "regex-partial.h"

// #include <minja/chat-template.hpp>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should just remove those at this point, we're not going back to Minja.

common/chat.cpp Outdated
}
// std::vector<common_chat_msg> common_chat_msgs_parse_oaicompat(const std::string & messages) {
// return common_chat_msgs_parse_oaicompat(json::parse(messages));
// }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, I'd just remove this. The code files are littered with comments like this that are left and then never removed.

common/chat.cpp Outdated
}
// std::vector<common_chat_tool> common_chat_tools_parse_oaicompat(const std::string & tools) {
// return common_chat_tools_parse_oaicompat(json::parse(tools));
// }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

// TODO @ngxson : no known chat templates support reasoning_content in content parts yet
// this can be useful for models with interleaved thinking (like Kimi-K2)
// if you see any templates explicitly support this, please ping me
// std::string reasoning_content;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could argue that GPT-OSS does, but don't know if anyone properly supports that.

Comment on lines +251 to +255
{
{"role", "assistant"},
{"content", "Assistant message"},
{"reasoning_content", "Reasoning content"}
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need a couple more capability checks for thinking at the message level and "type": "thinking" in content parts for gpt-oss and ministral 3 respectively.

The current logic for these models transforms reasoning_content to their expected field at init.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for gpt-oss, it seems like reasoning is only allowed to be added if add_generation_prompt = false, so not usable in llama.cpp use case I think:

{%- elif loop.last and not add_generation_prompt %}
    {#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #}
    {#- This is a situation that should only occur in training, never in inference. #}
    {%- if "thinking" in message %}
        {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
    {%- endif %}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 293:

            {%- elif message.thinking and not future_final_message.found %}
                {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
            {%- endif %}

@ngxson
Copy link
Collaborator Author

ngxson commented Jan 21, 2026

Just as a general notion: I am not a fan of splitting reasoning handling into "enable_reasoning", "clear_thinking" and the passive "supports_preserve_reasoning"

@pwilkin I'm not splitting but they are indeed different notions:

  • enable_reasoning: I think you mean enable_thinking. This flag means to add a trailing </think> to the formatted chat, it does not overlap with supports_preserve_reasoning (one is user-controlled and one is read-only). For example: I can enable thinking in older messages in the conversation, then the next message, I put back the reasoning_content while disable enable_thinking, this forces the model to read the reasoning from the earlier message in the conversation.
  • supports_preserve_reasoning: as explained above; However, this is NOT a flag that you can enable or disable, it's simply an indication for whether put back the reasoning_content into history is accepted by the template
  • clear_thinking: it is not a llama.cpp notion, just mentioned here because GLM 4.7 template have it; other models can have other naming for this.

(a) detect whether model supports reasoning (b) enable reasoning by default if it does (c) pass reasoning traces if the template supports it (d) accept explicit overrides

  • (a) Hmm, could you point me to the code where we detect if a model supports reasoning?
  • (b) Aren't we already enabled reasoning by default if model support it?
  • (c) You mean reasoning traces parsing (enable_thinking) or preserving reasoning trace inside history (supports_preserve_reasoning)?
  • (d) I think it's what this PR is made to do

Edit: I think this PR already provide the 4 points a,b,c,d that you brought up

@pwilkin
Copy link
Collaborator

pwilkin commented Jan 21, 2026

@ngxson yeah, you're right. I was somehow confused that we're already passing the reasoning_content to the template.

@aldehir
Copy link
Collaborator

aldehir commented Jan 21, 2026

The API supports it, but the WebUI does not. I assume this is setting up the foundation to add first class support in the WebUI.

By support, I mean it'll pass the reasoning in the message objects fed to the template.

@pwilkin pwilkin merged commit 51fa458 into ggml-org:master Jan 22, 2026
78 checks passed
ronaldmannak pushed a commit to PicoMLX/llama.cpp that referenced this pull request Jan 24, 2026
…gml-org#18994)

* support reasoning_content input

* report template caps to webui

* add docs

* rm commented code
ronaldmannak pushed a commit to PicoMLX/llama.cpp that referenced this pull request Jan 24, 2026
…gml-org#18994)

* support reasoning_content input

* report template caps to webui

* add docs

* rm commented code
shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request Feb 6, 2026
…gml-org#18994)

* support reasoning_content input

* report template caps to webui

* add docs

* rm commented code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples jinja parser Issues related to the jinja parser server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants