Skip to content

UPSTREAM PR #17750: server: handle limiting maximum reasoning budget#425

Open
loci-dev wants to merge 2 commits intomainfrom
upstream-PR17750-branch_aviallon-feat/handle-limiting-maximum-budget
Open

UPSTREAM PR #17750: server: handle limiting maximum reasoning budget#425
loci-dev wants to merge 2 commits intomainfrom
upstream-PR17750-branch_aviallon-feat/handle-limiting-maximum-budget

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Dec 4, 2025

Mirrored from ggml-org/llama.cpp#17750

Work in progress. Written at ungodly hours by a very tired me + LLMs. NOT ready for review, unless you want to comment early on the idea.

Basically, this enables support for limiting the maximum reasoning by counting how much token we output since entered reasoning mode, and then append a closing sentence + reasoning end token.

I hardcoded some often used tokens, and the closing phrase, so this is extremely hacky. There is surely a better way of doing that.

In any case, it does work.

Make sure to read the contributing guidelines before submitting a PR

@loci-review
Copy link

loci-review bot commented Dec 4, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #425

Overview

PR #425 implements reasoning budget enforcement for the llama.cpp server, adding runtime token counting and automatic closure of reasoning blocks when token limits are exceeded. The changes are isolated to server components with no modifications to core inference functions.

Key Findings

Inference Performance Impact:

No impact on tokens per second. The core inference functions (llama_decode, llama_encode, llama_tokenize) remain unmodified. All changes are confined to server-level request handling in tools/server/server-context.cpp, tools/server/server-task.cpp, and tools/server/server-task.h.

The implementation adds per-token overhead of approximately 25-205 ns for string-based reasoning marker detection when reasoning_budget is enabled. For a typical 1000-token generation, this accumulates to 25000-205000 ns total overhead. Relative to typical per-token inference time of 1000000-10000000 ns, this represents less than 0.02% impact.

Token sampling logic modification adds a conditional branch that bypasses sampling when forced tokens are queued. This branch is highly predictable with approximately 2 ns overhead during normal generation. During forced closure events, sampling bypass actually reduces overhead by 100-1000 ns per forced token.

Power Consumption Analysis:

Four binaries show complete elimination of measured power consumption: build.bin.libllama.so (194249 nJ reduction), build.bin.llama-cvector-generator (249018 nJ reduction), build.bin.llama-run (218769 nJ reduction), and build.bin.llama-tts (253758 nJ reduction). Total reduction of 915794 nJ represents 58.5% decrease in overall power consumption.

Core GGML libraries remain stable with zero change: build.bin.libggml-cpu.so (116309 nJ), build.bin.libmtmd.so (130976 nJ), and build.bin.libggml-base.so (59071 nJ). This indicates the power consumption changes are unrelated to PR #425 code modifications and likely reflect architectural refactoring or build system changes between versions.

Implementation Details:

The PR adds five new state variables per server slot (40 bytes overhead) for tracking reasoning tokens. String search operations using rfind() execute on every token generation when reasoning_budget is positive, searching for hardcoded markers like "" and "<|START_THINKING|>". Speculative decoding is disabled during forced token injection, affecting approximately 5-15 tokens per forced closure event.

@loci-review
Copy link

loci-review bot commented Dec 4, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #425

Overview

PR #425 introduces reasoning budget enforcement for LLM inference, adding logic to track and limit tokens generated during thinking phases. The implementation modifies 5 files with 136 additions and 6 deletions, primarily in server-context.cpp.

Key Findings

Performance-Critical Area Impact

The changes do not directly affect core inference functions. Analysis of the top 10 regressed functions reveals they are all STL template instantiations (iterator accessors: begin(), end(), predicate operators) from system headers, not application code. These functions show modifications in assembly but are not part of the PR's source changes.

Most-Impacted Functions (Absolute Changes):

  • std::_Rb_tree<...>::end (llama-tts): +116 ns response time, +135 ns throughput
  • std::vector<std::pair<...>>::end (llama-cvector-generator): +114 ns response time, +135 ns throughput
  • __gnu_cxx::__ops::__negate (llama-cvector-generator): +86 ns response time, +121 ns throughput
  • Remaining 7 functions: +31 ns response time, +24 ns throughput each

These regressions appear systematic across STL containers, suggesting compiler or build configuration differences rather than code-level changes. The absolute deltas are small (24-135 ns per call).

Inference Performance Impact

Core inference functions remain unaffected. The PR adds budget tracking logic that executes per-token during reasoning mode but does not modify:

  • llama_decode - No changes detected
  • llama_encode - No changes detected
  • llama_tokenize - No changes detected

The new code path introduces string searches (rfind()) and deque operations in the token generation loop, but only when reasoning budget is enabled and active. For typical inference without reasoning limits or outside thinking blocks, the overhead is a single branch check per token.

Tokens per second impact: Given that core tokenization and inference functions show no performance changes, and the reference showing 7% tokens/sec reduction correlates with 2 ms slower llama_decode, the expected impact is negligible for standard inference. Reasoning-enabled requests may experience 5-15% throughput reduction due to per-token string operations, but this affects only the subset of requests using the reasoning budget feature.

Power Consumption Analysis

Power consumption changes are minimal across all binaries:

  • llama-cvector-generator: +0.073% (+183 nJ)
  • llama-tts: -0.041% (-105 nJ)
  • All other binaries: 0.0% change

These changes are within measurement noise and do not indicate meaningful efficiency impact.

Code Change Context

The PR implements a state machine to track reasoning tokens and inject closing tags when budgets are exceeded. The systematic STL iterator regressions (all showing is_modified: true) indicate assembly-level changes unrelated to the PR's functional modifications. The correlation suggests build environment differences (compiler flags, optimization levels, or toolchain version) rather than performance degradation from the feature implementation itself.

@loci-dev loci-dev force-pushed the main branch 26 times, most recently from cb46586 to 1a14b3a Compare December 6, 2025 13:13
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 074b005 to ff6ae69 Compare December 9, 2025 12:15
@loci-dev loci-dev force-pushed the upstream-PR17750-branch_aviallon-feat/handle-limiting-maximum-budget branch from 687c4b0 to 73e95c3 Compare December 9, 2025 12:46
@loci-dev loci-dev force-pushed the upstream-PR17750-branch_aviallon-feat/handle-limiting-maximum-budget branch from 73e95c3 to fb40266 Compare December 9, 2025 15:41
@loci-review
Copy link

loci-review bot commented Dec 9, 2025

Explore the complete analysis inside the Version Insights

@loci-dev loci-dev force-pushed the main branch 13 times, most recently from 1daebfe to 75a97fd Compare December 10, 2025 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants