UPSTREAM PR #17750: server: handle limiting maximum reasoning budget#425
UPSTREAM PR #17750: server: handle limiting maximum reasoning budget#425
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #425OverviewPR #425 implements reasoning budget enforcement for the llama.cpp server, adding runtime token counting and automatic closure of reasoning blocks when token limits are exceeded. The changes are isolated to server components with no modifications to core inference functions. Key FindingsInference Performance Impact: No impact on tokens per second. The core inference functions (llama_decode, llama_encode, llama_tokenize) remain unmodified. All changes are confined to server-level request handling in tools/server/server-context.cpp, tools/server/server-task.cpp, and tools/server/server-task.h. The implementation adds per-token overhead of approximately 25-205 ns for string-based reasoning marker detection when reasoning_budget is enabled. For a typical 1000-token generation, this accumulates to 25000-205000 ns total overhead. Relative to typical per-token inference time of 1000000-10000000 ns, this represents less than 0.02% impact. Token sampling logic modification adds a conditional branch that bypasses sampling when forced tokens are queued. This branch is highly predictable with approximately 2 ns overhead during normal generation. During forced closure events, sampling bypass actually reduces overhead by 100-1000 ns per forced token. Power Consumption Analysis: Four binaries show complete elimination of measured power consumption: build.bin.libllama.so (194249 nJ reduction), build.bin.llama-cvector-generator (249018 nJ reduction), build.bin.llama-run (218769 nJ reduction), and build.bin.llama-tts (253758 nJ reduction). Total reduction of 915794 nJ represents 58.5% decrease in overall power consumption. Core GGML libraries remain stable with zero change: build.bin.libggml-cpu.so (116309 nJ), build.bin.libmtmd.so (130976 nJ), and build.bin.libggml-base.so (59071 nJ). This indicates the power consumption changes are unrelated to PR #425 code modifications and likely reflect architectural refactoring or build system changes between versions. Implementation Details: The PR adds five new state variables per server slot (40 bytes overhead) for tracking reasoning tokens. String search operations using rfind() execute on every token generation when reasoning_budget is positive, searching for hardcoded markers like "" and "<|START_THINKING|>". Speculative decoding is disabled during forced token injection, affecting approximately 5-15 tokens per forced closure event. |
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #425OverviewPR #425 introduces reasoning budget enforcement for LLM inference, adding logic to track and limit tokens generated during thinking phases. The implementation modifies 5 files with 136 additions and 6 deletions, primarily in Key FindingsPerformance-Critical Area ImpactThe changes do not directly affect core inference functions. Analysis of the top 10 regressed functions reveals they are all STL template instantiations (iterator accessors: Most-Impacted Functions (Absolute Changes):
These regressions appear systematic across STL containers, suggesting compiler or build configuration differences rather than code-level changes. The absolute deltas are small (24-135 ns per call). Inference Performance ImpactCore inference functions remain unaffected. The PR adds budget tracking logic that executes per-token during reasoning mode but does not modify:
The new code path introduces string searches ( Tokens per second impact: Given that core tokenization and inference functions show no performance changes, and the reference showing 7% tokens/sec reduction correlates with 2 ms slower Power Consumption AnalysisPower consumption changes are minimal across all binaries:
These changes are within measurement noise and do not indicate meaningful efficiency impact. Code Change ContextThe PR implements a state machine to track reasoning tokens and inject closing tags when budgets are exceeded. The systematic STL iterator regressions (all showing |
cb46586 to
1a14b3a
Compare
074b005 to
ff6ae69
Compare
687c4b0 to
73e95c3
Compare
73e95c3 to
fb40266
Compare
|
Explore the complete analysis inside the Version Insights |
1daebfe to
75a97fd
Compare
Mirrored from ggml-org/llama.cpp#17750
Work in progress. Written at ungodly hours by a very tired me + LLMs. NOT ready for review, unless you want to comment early on the idea.
Basically, this enables support for limiting the maximum reasoning by counting how much token we output since entered reasoning mode, and then append a closing sentence + reasoning end token.
I hardcoded some often used tokens, and the closing phrase, so this is extremely hacky. There is surely a better way of doing that.
In any case, it does work.
Make sure to read the contributing guidelines before submitting a PR