Skip to content

UPSTREAM PR #18994: server : support preserving reasoning_content in assistant message#992

Open
loci-dev wants to merge 5 commits intomainfrom
upstream-PR18994-branch_ngxson-xsn/reasoning_content_input
Open

UPSTREAM PR #18994: server : support preserving reasoning_content in assistant message#992
loci-dev wants to merge 5 commits intomainfrom
upstream-PR18994-branch_ngxson-xsn/reasoning_content_input

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18994

Ref: ggml-org/llama.cpp#18936 (comment)

Changes included in this PR

  • use json_fwd in chat.h to avoid using template trick
  • deduplicate code between common_chat_msgs_to_json_oaicompat and common_chat_msg::to_json_oaicompat()
  • force clear_thinking = false for GLM 4.7 if it is not specified
  • report the supports_preserve_reasoning to server /props

(Web UI support is TBD)

Changes in API

The /chat/completions API now accepts reasoning_content for assistant message:

{
  "messages": [
    {
      "content": "Hello, world!",
      "role": "user"
    },
    {
      "content": "Hey there!",
      "role": "assistant",
      "reasoning_content": "This is my reasoning."
    },
    {
      "content": "Hello, world!",
      "role": "user"
    }
  ],
  "stream": false,
  "max_tokens": 64
}

If the template supports it, the reasoning will be put back into the message (testing with GLM 4.7)

[gMASK]<sop><|user|>Hello, world!<|assistant|><think>This is my reasoning.</think>Hey there!<|user|>Hello, world!<|assistant|><think>

Otherwise, it will be ignored.

To know if the template supports it, /props endpoint will indicate:

{
  "chat_template_caps": {
    ...
    "supports_preserve_reasoning": true,
    ...
  }
}

@loci-review
Copy link

loci-review bot commented Jan 21, 2026

Performance Review Report: llama.cpp Version Comparison

Impact Classification: Minor Impact

Total Functions Analyzed: 12 functions across 2 binaries (llama-tts, llama-cvector-generator)

Executive Summary

Performance analysis reveals no meaningful impact on LLM inference. All 12 functions are non-critical utility operations (logging, templates, argument parsing) that execute during initialization, not in inference loops. Changes stem primarily from compiler optimization differences, with one intentional maintainability refactoring.

Key Findings

Commit Context (3 commits by Xuan Son Nguyen):

  • Added template capability reporting to web UI
  • Implemented reasoning_content support for reasoning-capable models
  • Documentation updates

Most Impacted Function:

  • jinja::caps::to_string() (both binaries): +8,265ns response time (+2,100%)
  • Cause: Intentional refactoring from hardcoded output to map-based iteration
  • Justification: Enables code reuse for server capability reporting; executes once per template load
  • Impact: Negligible (8.3μs overhead in initialization, not inference)

STL Functions (9 functions):

  • Response time changes: -183ns to +203ns per call
  • Throughput improvements: +88% to +306% in most cases
  • Cause: Compiler optimization strategy differences (inlining, instruction scheduling)
  • Impact: Negligible (sub-microsecond changes in infrequent operations)

JSON Library (1 function):

  • nlohmann::json::iteration_proxy::end(): +111ns response time, +163% throughput
  • Cause: Compiler optimization improvements
  • Impact: Negligible (configuration parsing only)

Performance-Critical Assessment

None of the 12 functions are performance-critical. llama.cpp's bottlenecks are:

  • Matrix multiplication (70-90% of inference time)
  • Attention computation
  • KV cache operations
  • Quantization kernels

All analyzed functions execute outside these hot paths.

Power Consumption & GPU Impact

Power Impact: Negligible (<0.01% of total system power)
GPU Operations: Zero GPU-related changes detected
Inference Metrics: No impact on tokens/second, TTFT, or batch throughput

Conclusion

The performance changes are appropriate and justified. The jinja refactoring prioritizes maintainability over micro-optimization in non-critical code. Compiler-driven STL optimizations show expected variance between builds. No optimization action required.
See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 095e526 to db6cb7a Compare January 21, 2026 19:15
@loci-review
Copy link

loci-review bot commented Jan 21, 2026

Performance Review Report: llama.cpp Reasoning Content Support

Executive Summary

Analysis of 15 functions across llama-tts and llama-cvector-generator binaries reveals moderate performance impact from architectural changes supporting reasoning-focused LLMs (DeepSeek R1, OpenAI o1). The largest regression (+8,268 ns) occurs in debug-only code, while core inference paths remain unaffected.

Commit Context

Five commits by Xuan Son Nguyen implement reasoning content support:

  • Added supports_preserve_reasoning capability flag
  • Introduced caps::to_map() for JSON serialization to web UI
  • Refactored template capability reporting for maintainability

Files changed: 10 modified, 37 added, 3 deleted

Performance Impact Analysis

Most-Impacted Functions

1. jinja::caps::to_string() (both binaries)

  • Response time: 395 ns → 8,664 ns (+8,268 ns, +2,090%)
  • Throughput: 241 → 268 ops/sec (+11%)
  • Cause: Refactored from 5 hardcoded string concatenations to dynamic map iteration via to_map()
  • Justification: Debug-only function (JJ_DEBUG macro), enables server capability reporting, improves maintainability through DRY principles
  • Impact: Negligible - called once during initialization, not in inference loops

2. std::vector<>::begin() (2 instances)

  • Response time: 84 ns → 265 ns (+181 ns, +215%)
  • Cause: Build configuration changes (likely debug assertions enabled)
  • Impact: Minimal - called during terminal rendering (llama-tts) and unused in cvector-generator

3. httplib::Request::Request() (cvector-generator)

  • Response time: 2,285 ns → 2,467 ns (+183 ns, +8%)
  • Throughput: 196 → 378 ops/sec (+92%)
  • Cause: Compiler optimizations favoring concurrent request handling
  • Impact: None - dead code in CLI tool, relevant only for llama-server

Performance Improvements

std::vector operations (6 functions): 28-57% response time improvements (-64 to -203 ns) from compiler optimizations and template consolidation in chat message handling.

Code Changes Assessment

The refactoring prioritizes maintainability over micro-optimization in non-critical paths:

  • Added map-based capability reporting enables dynamic feature detection
  • Template consolidation reduces binary size and improves cache locality
  • New supports_preserve_reasoning field supports emerging reasoning models

All changes are justified - regressions occur exclusively in initialization and debug code, while actual execution paths show improvements or no change.

Power Consumption

Net execution time increase: +16,446 ns per cycle across affected functions. However, power impact is negligible (<0.01% of total application consumption) because:

  • Largest regressions are in one-time initialization code
  • High-frequency functions (iterators) show net improvements
  • No changes in inference hot paths (matrix operations, attention, KV cache)

GPU/ML Operations

No impact - zero analyzed functions involve GPU operations, tensor computations, or inference kernels. Changes affect only CPU-side template processing and initialization.

Conclusion

Performance changes are acceptable and justified. The 2,090% regression in to_string() appears severe but represents only 8.3 microseconds in debug-only code. The architectural improvements enable critical functionality for reasoning models while maintaining inference performance. No optimization required.
See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 984fada to 54e0744 Compare January 22, 2026 06:14
@noamshiovits-Aurora
Copy link

@loci-dev how many functions were analyzed ?

@loci-review
Copy link

loci-review bot commented Jan 22, 2026

According to the performance review report, 15 functions were analyzed across the llama-tts and llama-cvector-generator binaries.

These 15 functions include:

  • 1 function with the largest regression (jinja::caps::to_string)
  • 2 instances of std::vector<>::begin() with regressions
  • 1 function (httplib::Request::Request) with a small regression
  • 6 functions showing performance improvements (std::vector operations)
  • 5 additional functions (to make up the total of 15)

The analysis focused on functions impacted by the architectural changes supporting reasoning-focused LLMs, with changes ranging from significant regressions in debug-only code to improvements in vector operations.

@loci-dev loci-dev force-pushed the main branch 13 times, most recently from 0e2fcc8 to 5668a6a Compare January 24, 2026 07:09
@loci-dev loci-dev force-pushed the main branch 18 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 10f8f26 to a6ecec6 Compare February 20, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments