Skip to content

UPSTREAM PR #20424: common/parser: add proper reasoning tag prefill reading#1247

Open
loci-dev wants to merge 6 commits intomainfrom
loci/pr-20424-reasoning-prefill
Open

UPSTREAM PR #20424: common/parser: add proper reasoning tag prefill reading#1247
loci-dev wants to merge 6 commits intomainfrom
loci/pr-20424-reasoning-prefill

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#20424

This changes the erroneous behavior of the autoparser that ascribed thinking behavior to templates. As people rightly mentioned, some models have dynamic or hybrid reasoning - they can reason or not depending on some switches and even the template behavior can change due to this (i.e. inserting <think> in assistant prefill after a "no_think" appears in a user message).

Therefore, the FORCED_OPEN and FORCED_CLOSED formats are gone. The parser will now just detect models with tagged reasoning, i.e. an opening and closing reasoning marker (deleted DELIMITER also since it's a special case with the opening marker being empty). However, it will check the assistant prefill for those markers and will append them to the input for the grammar and the parser, so that they are taken into account, therefore just simplifying the parsing mechanism since it doesn't now have to differentiate whether the <think>' / ` was added by the template or generated by the model.

@loci-review
Copy link

loci-review bot commented Mar 12, 2026

Overview

Analysis of 120,012 functions across 15 binaries (205 modified, 55 new, 0 removed) implementing "Reasoning prefill" functionality. Overall impact: Minor — negligible performance changes with zero impact on inference hot paths.

Power consumption changes:

  • build.bin.llama-tts: +0.108%
  • build.bin.llama-cvector-generator: +0.133%
  • build.bin.llama-tokenize: +0.411%
  • build.bin.llama-quantize: +0.386%
  • build.bin.llama-bench: -0.019%
  • build.bin.libllama.so: 0.0% (inference library)
  • build.bin.libggml-base.so: 0.0%
  • build.bin.libggml-cpu.so: 0.0%
  • build.bin.libggml.so: 0.0%
  • build.bin.libmtmd.so: 0.0%
  • build.bin.llama-qwen2vl-cli: 0.0%
  • build.bin.llama-minicpmv-cli: 0.0%
  • build.bin.llama-llava-cli: 0.0%
  • build.bin.llama-gguf-split: 0.0%
  • build.bin.llama-gemma3-cli: 0.0%

Function Analysis

Intentional improvements:

  • autoparser::peg_generator::generate_parser() (llama-cvector-generator, llama-tts): Response time +1.3-6.6% (+12-64 μs), throughput time +95.6% (+596 ns). Added 40+ lines of reasoning prefill extraction logic to detect template artifacts in last 500 chars of prompts. Prevents parser corruption from <think></think> tags. Initialization-time code, not inference hot path.

  • common_chat_peg_mapper::from_ast() (llama-cvector-generator, llama-tts): Response time +5.3-5.4% (+290 ns), throughput time +83-84% (+198 ns). Added 16-line whitespace validation loop to filter empty reasoning blocks. Improves output quality for reasoning models.

  • autoparser::operator<<(reasoning_mode) (llama-tts, llama-cvector-generator): Response time -37.5% to -38% (-41 ns), throughput time -40% (-41 ns). Simplified from 6 to 3 reasoning modes, reducing switch cases and branching complexity.

Compiler artifacts:

STL functions (std::vector::end(), std::unordered_map::end(), char_traits::length()) show 115-225% regressions (+160-185 ns absolute) due to compiler code generation differences, not source changes. Other STL functions improved 28-76% from better optimization. All changes are sub-microsecond and not in inference paths.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 56aaa36 to 21147c2 Compare March 13, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 6fa8e23 to f2637dc Compare March 15, 2026 02:18
@loci-dev loci-dev force-pushed the loci/pr-20424-reasoning-prefill branch from 5601b45 to d288b5e Compare March 15, 2026 03:09
@loci-review
Copy link

loci-review bot commented Mar 15, 2026

Overview

Analysis of 121,172 functions across 15 binaries shows negligible overall performance impact from the "Reasoning prefill" feature implementation (6 commits, 28 files modified). 193 functions modified (0.16%), 1,081 new, 1,106 removed. All performance-critical paths (GGML kernels, attention, KV cache, quantization) remain completely unchanged.

Power consumption changes:

  • build.bin.llama-tokenize: +0.43%
  • build.bin.llama-quantize: +0.37%
  • build.bin.llama-bench: +0.17%
  • build.bin.llama-tts: -0.08%
  • build.bin.llama-cvector-generator: -0.06%
  • build.bin.libllama.so: 0.00%
  • build.bin.libmtmd.so: 0.00%
  • build.bin.libggml-base.so: 0.00%
  • build.bin.libggml-cpu.so: 0.00%
  • build.bin.libggml.so: 0.00%
  • build.bin.llama-qwen2vl-cli: 0.00%
  • build.bin.llama-gemma3-cli: 0.00%
  • build.bin.llama-gguf-split: 0.00%
  • build.bin.llama-llava-cli: 0.00%
  • build.bin.llama-minicpmv-cli: 0.00%

Aggregate system impact: +0.014% (+223nJ total)

Function Analysis

Intentional optimizations (improved performance):

  • autoparser::operator<< (reasoning_mode): Response time -37.8% (110ns → 68ns), throughput -40.3% (103ns → 61ns). Simplified enum from 6 to 3 values, reducing comparison blocks by 64%. Affects both llama-tts and llama-cvector-generator.

  • Lambda in analyze_tool_call_format: Response time -32.7% (21.0µs → 14.1µs), throughput -63.1% (334ns → 123ns). Removed Python dict format detection, eliminated map operations and string allocations. Reduced basic blocks by 56%. Affects both llama-tts and llama-cvector-generator.

  • std::vector::back() (llama-cvector-generator): Response time -42.1% (451ns → 261ns), throughput -73.1% (260ns → 70ns). Compiler optimization consolidated entry block.

Correctness improvements with acceptable costs:

  • autoparser::build_parser() (llama-tts): Response time +31.6% (73.6µs → 96.9µs), throughput -35.3% (355ns → 230ns). Added p.space() call for proper whitespace handling (+7.7µs), simplified from 6 to 3 reasoning modes. One-time setup cost, not in inference path.

  • common_chat_peg_mapper::from_ast() (llama-cvector-generator): Response time +5.4% (5.4µs → 5.7µs), throughput +83.8% (238ns → 437ns). Added whitespace-only reasoning content filtering to prevent empty <think></think> blocks. Improves output quality for streaming.

Compiler artifacts (no source changes):

  • std::unordered_set::begin() (llama-tts): Response time +179.9% (104ns → 290ns), throughput +310.1% (60ns → 247ns). Extra unconditional jump in entry block. Used in HTTP header processing, not inference path.

  • std::_Rb_tree_const_iterator::operator++ (llama-tts): Response time +138.5% (81ns → 194ns). Used in ProgressBar map operations during downloads.

  • std::future::_M_complete_async() (llama-tts): Response time +17.9% (1.0µs → 1.2µs), throughput +191.9% (97ns → 283ns). Async download infrastructure, called once per task.

Other analyzed functions showed compiler-generated code layout changes with minimal absolute impact (<200ns) in non-critical paths (logging, JSON operations, HTTP processing).

Additional Findings

Zero GPU/ML impact: All GGML libraries (CPU, CUDA, Metal backends), matrix operations, attention mechanisms, and quantization kernels show 0.000% power consumption change. Modified components are CPU-only preprocessing (chat template parsing, reasoning mode detection).

Architectural improvement: Reasoning mode consolidation (6→3 enum values) and Python dict removal represent strategic simplification, improving both performance (37-63% gains in hot paths) and maintainability while focusing on standard JSON conventions.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 3c7b997 to 5ac00d6 Compare March 17, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants