UPSTREAM PR #20424: common/parser: add proper reasoning tag prefill reading#1247
UPSTREAM PR #20424: common/parser: add proper reasoning tag prefill reading#1247
Conversation
OverviewAnalysis of 120,012 functions across 15 binaries (205 modified, 55 new, 0 removed) implementing "Reasoning prefill" functionality. Overall impact: Minor — negligible performance changes with zero impact on inference hot paths. Power consumption changes:
Function AnalysisIntentional improvements:
Compiler artifacts: STL functions ( 🔎 Full breakdown: Loci Inspector |
56aaa36 to
21147c2
Compare
6fa8e23 to
f2637dc
Compare
5601b45 to
d288b5e
Compare
OverviewAnalysis of 121,172 functions across 15 binaries shows negligible overall performance impact from the "Reasoning prefill" feature implementation (6 commits, 28 files modified). 193 functions modified (0.16%), 1,081 new, 1,106 removed. All performance-critical paths (GGML kernels, attention, KV cache, quantization) remain completely unchanged. Power consumption changes:
Aggregate system impact: +0.014% (+223nJ total) Function AnalysisIntentional optimizations (improved performance):
Correctness improvements with acceptable costs:
Compiler artifacts (no source changes):
Other analyzed functions showed compiler-generated code layout changes with minimal absolute impact (<200ns) in non-critical paths (logging, JSON operations, HTTP processing). Additional FindingsZero GPU/ML impact: All GGML libraries (CPU, CUDA, Metal backends), matrix operations, attention mechanisms, and quantization kernels show 0.000% power consumption change. Modified components are CPU-only preprocessing (chat template parsing, reasoning mode detection). Architectural improvement: Reasoning mode consolidation (6→3 enum values) and Python dict removal represent strategic simplification, improving both performance (37-63% gains in hot paths) and maintainability while focusing on standard JSON conventions. 🔎 Full breakdown: Loci Inspector |
3c7b997 to
5ac00d6
Compare
Note
Source pull request: ggml-org/llama.cpp#20424
This changes the erroneous behavior of the autoparser that ascribed thinking behavior to templates. As people rightly mentioned, some models have dynamic or hybrid reasoning - they can reason or not depending on some switches and even the template behavior can change due to this (i.e. inserting
<think>in assistant prefill after a "no_think" appears in a user message).Therefore, the
FORCED_OPENandFORCED_CLOSEDformats are gone. The parser will now just detect models with tagged reasoning, i.e. an opening and closing reasoning marker (deletedDELIMITERalso since it's a special case with the opening marker being empty). However, it will check the assistant prefill for those markers and will append them to the input for the grammar and the parser, so that they are taken into account, therefore just simplifying the parsing mechanism since it doesn't now have to differentiate whether the<think>' /` was added by the template or generated by the model.