UPSTREAM PR #18994: server : support preserving reasoning_content in assistant message by loci-dev · Pull Request #992 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-21T14:44:05Z

Changes included in this PR

use json_fwd in chat.h to avoid using template trick
deduplicate code between common_chat_msgs_to_json_oaicompat and common_chat_msg::to_json_oaicompat()
force clear_thinking = false for GLM 4.7 if it is not specified
report the supports_preserve_reasoning to server /props

(Web UI support is TBD)

Changes in API

The /chat/completions API now accepts reasoning_content for assistant message:

{
  "messages": [
    {
      "content": "Hello, world!",
      "role": "user"
    },
    {
      "content": "Hey there!",
      "role": "assistant",
      "reasoning_content": "This is my reasoning."
    },
    {
      "content": "Hello, world!",
      "role": "user"
    }
  ],
  "stream": false,
  "max_tokens": 64
}

If the template supports it, the reasoning will be put back into the message (testing with GLM 4.7)

Otherwise, it will be ignored.

To know if the template supports it, /props endpoint will indicate:

{
  "chat_template_caps": {
    ...
    "supports_preserve_reasoning": true,
    ...
  }
}

loci-review · 2026-01-21T16:16:34Z

Performance Review Report: llama.cpp Version Comparison

Impact Classification: Minor Impact

Total Functions Analyzed: 12 functions across 2 binaries (llama-tts, llama-cvector-generator)

Executive Summary

Performance analysis reveals no meaningful impact on LLM inference. All 12 functions are non-critical utility operations (logging, templates, argument parsing) that execute during initialization, not in inference loops. Changes stem primarily from compiler optimization differences, with one intentional maintainability refactoring.

Key Findings

Commit Context (3 commits by Xuan Son Nguyen):

Added template capability reporting to web UI
Implemented reasoning_content support for reasoning-capable models
Documentation updates

Most Impacted Function:

jinja::caps::to_string() (both binaries): +8,265ns response time (+2,100%)
Cause: Intentional refactoring from hardcoded output to map-based iteration
Justification: Enables code reuse for server capability reporting; executes once per template load
Impact: Negligible (8.3μs overhead in initialization, not inference)

STL Functions (9 functions):

Response time changes: -183ns to +203ns per call
Throughput improvements: +88% to +306% in most cases
Cause: Compiler optimization strategy differences (inlining, instruction scheduling)
Impact: Negligible (sub-microsecond changes in infrequent operations)

JSON Library (1 function):

nlohmann::json::iteration_proxy::end(): +111ns response time, +163% throughput
Cause: Compiler optimization improvements
Impact: Negligible (configuration parsing only)

Performance-Critical Assessment

None of the 12 functions are performance-critical. llama.cpp's bottlenecks are:

Matrix multiplication (70-90% of inference time)
Attention computation
KV cache operations
Quantization kernels

All analyzed functions execute outside these hot paths.

Power Consumption & GPU Impact

Power Impact: Negligible (<0.01% of total system power)
GPU Operations: Zero GPU-related changes detected
Inference Metrics: No impact on tokens/second, TTFT, or batch throughput

Conclusion

The performance changes are appropriate and justified. The jinja refactoring prioritizes maintainability over micro-optimization in non-critical code. Compiler-driven STL optimizations show expected variance between builds. No optimization action required.
See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

loci-review · 2026-01-21T23:00:40Z

Performance Review Report: llama.cpp Reasoning Content Support

Executive Summary

Analysis of 15 functions across llama-tts and llama-cvector-generator binaries reveals moderate performance impact from architectural changes supporting reasoning-focused LLMs (DeepSeek R1, OpenAI o1). The largest regression (+8,268 ns) occurs in debug-only code, while core inference paths remain unaffected.

Commit Context

Five commits by Xuan Son Nguyen implement reasoning content support:

Added supports_preserve_reasoning capability flag
Introduced caps::to_map() for JSON serialization to web UI
Refactored template capability reporting for maintainability

Files changed: 10 modified, 37 added, 3 deleted

Performance Impact Analysis

Most-Impacted Functions

1. jinja::caps::to_string() (both binaries)

Response time: 395 ns → 8,664 ns (+8,268 ns, +2,090%)
Throughput: 241 → 268 ops/sec (+11%)
Cause: Refactored from 5 hardcoded string concatenations to dynamic map iteration via to_map()
Justification: Debug-only function (JJ_DEBUG macro), enables server capability reporting, improves maintainability through DRY principles
Impact: Negligible - called once during initialization, not in inference loops

2. std::vector<>::begin() (2 instances)

Response time: 84 ns → 265 ns (+181 ns, +215%)
Cause: Build configuration changes (likely debug assertions enabled)
Impact: Minimal - called during terminal rendering (llama-tts) and unused in cvector-generator

3. httplib::Request::Request() (cvector-generator)

Response time: 2,285 ns → 2,467 ns (+183 ns, +8%)
Throughput: 196 → 378 ops/sec (+92%)
Cause: Compiler optimizations favoring concurrent request handling
Impact: None - dead code in CLI tool, relevant only for llama-server

Performance Improvements

std::vector operations (6 functions): 28-57% response time improvements (-64 to -203 ns) from compiler optimizations and template consolidation in chat message handling.

Code Changes Assessment

The refactoring prioritizes maintainability over micro-optimization in non-critical paths:

Added map-based capability reporting enables dynamic feature detection
Template consolidation reduces binary size and improves cache locality
New supports_preserve_reasoning field supports emerging reasoning models

All changes are justified - regressions occur exclusively in initialization and debug code, while actual execution paths show improvements or no change.

Power Consumption

Net execution time increase: +16,446 ns per cycle across affected functions. However, power impact is negligible (<0.01% of total application consumption) because:

Largest regressions are in one-time initialization code
High-frequency functions (iterators) show net improvements
No changes in inference hot paths (matrix operations, attention, KV cache)

GPU/ML Operations

No impact - zero analyzed functions involve GPU operations, tensor computations, or inference kernels. Changes affect only CPU-side template processing and initialization.

Conclusion

Performance changes are acceptable and justified. The 2,090% regression in to_string() appears severe but represents only 8.3 microseconds in debug-only code. The architectural improvements enable critical functionality for reasoning models while maintaining inference performance. No optimization required.
See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

noamshiovits-Aurora · 2026-01-22T07:10:29Z

@loci-dev how many functions were analyzed ?

loci-review · 2026-01-22T07:10:38Z

According to the performance review report, 15 functions were analyzed across the llama-tts and llama-cvector-generator binaries.

These 15 functions include:

1 function with the largest regression (jinja::caps::to_string)
2 instances of std::vector<>::begin() with regressions
1 function (httplib::Request::Request) with a small regression
6 functions showing performance improvements (std::vector operations)
5 additional functions (to make up the total of 15)

The analysis focused on functions impacted by the architectural changes supporting reasoning-focused LLMs, with changes ranging from significant regressions in debug-only code to improvements in vector operations.

ngxson added 3 commits January 21, 2026 13:09

support reasoning_content input

8b38c6d

report template caps to webui

25bb8a3

add docs

fc3f4d6

loci-dev temporarily deployed to PROD__AL_DEMO January 21, 2026 14:44 — with GitHub Actions Inactive

rm commented code

0ef0860

loci-dev force-pushed the main branch from fb5dc2f to 330153d Compare January 21, 2026 16:15

loci-dev force-pushed the main branch 3 times, most recently from 095e526 to db6cb7a Compare January 21, 2026 19:15

Merge branch 'master' into xsn/reasoning_content_input

78a55c3

loci-dev temporarily deployed to PROD__AL_DEMO January 21, 2026 21:41 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 2 times, most recently from 984fada to 54e0744 Compare January 22, 2026 06:14

loci-dev force-pushed the main branch 13 times, most recently from 0e2fcc8 to 5668a6a Compare January 24, 2026 07:09

loci-dev force-pushed the main branch 18 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17

loci-dev force-pushed the main branch 2 times, most recently from 10f8f26 to a6ecec6 Compare February 20, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18994: server : support preserving reasoning_content in assistant message#992

UPSTREAM PR #18994: server : support preserving reasoning_content in assistant message#992
loci-dev wants to merge 5 commits intomainfrom
upstream-PR18994-branch_ngxson-xsn/reasoning_content_input

loci-dev commented Jan 21, 2026

Uh oh!

loci-review bot commented Jan 21, 2026

Uh oh!

loci-review bot commented Jan 21, 2026

Uh oh!

noamshiovits-Aurora commented Jan 22, 2026

Uh oh!

loci-review bot commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

loci-dev commented Jan 21, 2026

Changes included in this PR

Changes in API

Uh oh!

loci-review bot commented Jan 21, 2026

Performance Review Report: llama.cpp Version Comparison

Impact Classification: Minor Impact

Executive Summary

Key Findings

Performance-Critical Assessment

Power Consumption & GPU Impact

Conclusion

Uh oh!

loci-review bot commented Jan 21, 2026

Performance Review Report: llama.cpp Reasoning Content Support

Executive Summary

Commit Context

Performance Impact Analysis

Most-Impacted Functions

Performance Improvements

Code Changes Assessment

Power Consumption

GPU/ML Operations

Conclusion

Uh oh!

noamshiovits-Aurora commented Jan 22, 2026

Uh oh!

loci-review bot commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments