Skip to content

UPSTREAM PR #20660: Fix chat parser regressions: inference crashes/frozen; output backtracked#1266

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-20660-master
Open

UPSTREAM PR #20660: Fix chat parser regressions: inference crashes/frozen; output backtracked#1266
loci-dev wants to merge 1 commit intomainfrom
loci/pr-20660-master

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#20660

Two regressions introduced in #18675.

Bug 1: Final parse throws instead of using AST fallback (all models)

common_chat_peg_parse gates the AST fallback on is_partial. During streaming, every partial parse that can't fully match the grammar falls back to AST extraction: reasoning streams, tool names appear, everything works. Then server_task_result_cmpl_final::update() calls the same function on the same accumulated text with is_partial=false, and it throws std::runtime_error instead of using the fallback. The client never receives finish_reason. In practice, they see "Failed to parse input at pos X:" at some point after starting inference.

This means every model is one slightly-unexpected token away from a 500 error. Truncated output, max_tokens mid-tool-call, a single malformed character in args — any of these trigger it.

Repro: curl -d '{"messages":[{"role":"user","content":"hello"}],"max_tokens":1}' http://localhost:8080/v1/chat/completions on any thinking model.

Fix: remove the is_partial guard. One line. If you

Bug 2: Strict JSON validation in TAG_WITH_TAGGED causes catastrophic PEG backtrack (Qwen 3.5, Qwen3-Coder, Nemotron, etc.)

build_tool_parser_tag_tagged validates parameter values with schema(json()). Small models frequently produce slightly malformed JSON (e.g. extra trailing brace). When json() rejects the value, the tool_call rule fails, PEG backtracks the entire root sequence, and wipes all AST nodes — including reasoning and tool names that parsed successfully. A second <tool_call> after the first is never attempted.

Fix: use until(value_suffix) for parser-side arg capture. Grammar constraints still enforce the schema during constrained generation; this only affects parsing of already-generated output.

Related issues:

  • #20193
  • #20198
  • #20229
  • #20345

@loci-review
Copy link

loci-review bot commented Mar 17, 2026

Overview

Performance Impact: Negligible. Analysis of 120,707 functions across 15 binaries shows 97 modified functions (0.08%), all in non-critical preprocessing paths. Changes stem from a single commit improving chat template parsing robustness by switching from strict schema validation to lenient parsing, eliminating catastrophic PEG backtracking.

Function Counts: 97 modified, 0 new, 0 removed, 120,610 unchanged

Power Consumption: Negligible changes (≤0.1%) across all binaries:

Binary Change
build.bin.llama-tts -0.031%
build.bin.llama-cvector-generator -0.101%
build.bin.libmtmd.so 0.000%
build.bin.libllama.so -0.000%
build.bin.llama-bench -0.000%
build.bin.llama-llava-cli 0.000%
build.bin.llama-minicpmv-cli 0.000%
build.bin.llama-quantize 0.000%
build.bin.llama-qwen2vl-cli 0.000%
build.bin.llama-tokenize 0.000%
build.bin.llama-gemma3-cli 0.000%
build.bin.llama-gguf-split 0.000%
build.bin.libggml.so 0.000%
build.bin.libggml-base.so 0.000%
build.bin.libggml-cpu.so 0.000%

Function Analysis

Modified functions show mixed compiler optimization artifacts (40-74% improvements vs 14-306% regressions) with no source code changes to the functions themselves:

Improvements:

  • std::vector::begin() (llama-tts): 264.64ns → 83.83ns (-68.3%, -180.81ns) — entry block consolidation
  • std::vector::empty() (llama-tts): 465.90ns → 276.02ns (-40.8%, -189.88ns) — eliminated redundant control flow
  • std::_Rb_tree_const_iterator::_M_const_cast() (cvector-generator): 265.86ns → 84.35ns (-68.3%, -181.51ns) — entry block optimization

Regressions:

  • std::_Rb_tree::end() (llama-tts): 79.66ns → 262.94ns (+230.1%, +183.28ns) — added entry block indirection
  • jinja::value_kwarg_t::type() (llama-tts): 681.89ns → 782.36ns (+14.7%, +100.47ns) — extra memory load at entry

All other analyzed functions showed compiler-generated variations in STL containers, JSON parsing, and template processing infrastructure. Critical finding: No changes to inference hot path (matrix operations, attention, KV cache, tokenization) where 70-90% of execution time is spent.

Additional Findings

Architectural Benefit: The commit successfully eliminates catastrophic PEG backtracking when models generate malformed JSON with tool_choice=auto, preventing complete AST loss. This robustness improvement justifies minor performance variations in preprocessing code.

GPU/ML Operations: Zero impact — no changes to CUDA, Metal, HIP, Vulkan, or SYCL backends, quantization operations, or inference pipeline.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-review
Copy link

loci-review bot commented Mar 18, 2026

Overview

Performance Impact: Minor — Neutral with Functional Improvements

Analysis of 120,779 functions shows 106 modified (0.09%), 0 new, 0 removed, 120,673 unchanged. Commit 0083bb1 improves chat template parsing robustness by replacing strict JSON schema validation with lenient capture, trading validation strictness for real-world reliability.

Power Consumption Changes:

  • build.bin.llama-cvector-generator: -0.088%
  • build.bin.llama-tts: +0.039%
  • build.bin.llama-bench, libllama.so, libmtmd.so, libggml-base.so, libggml-cpu.so, libggml.so, llama-quantize, llama-qwen2vl-cli, llama-tokenize, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli: 0.000%

All changes <0.1%, within measurement noise.

Function Analysis

Primary Improvements:

  • chat-auto-parser-generator.cpp::build_tool_parser_tag_tagged::operator() (both binaries): Response time -12.3% (-161μs), throughput time -17.5% (-394ns). Source change removed expensive p.schema() validation, replaced with lenient p.until() capture to prevent AST corruption on malformed JSON.
  • jinja::statement::type() (cvector-generator): Throughput time -50.7% (-110ns). Compiler optimization merged entry blocks, eliminated unnecessary branch.
  • std::vector<llama_chat_message>::begin() (cvector-generator): Response time -68.2% (-181ns). Compiler consolidated prologue operations.

Notable Regressions:

  • std::vector<std::pair<string,json>>::end() (cvector-generator): Throughput time +306.6% (+183ns). Compiler inlining decision change.
  • std::vector<std::thread>::begin() (llama-tts): Throughput time +289.3% (+181ns). Added intermediate block with extra indirection.

Other analyzed functions (STL utilities, function wrappers, variant visitors) showed sub-microsecond changes from compiler optimization variations. None are in inference hot paths (matrix ops, attention, KV cache). Net improvement: ~161μs in parser construction vs ~438ns cumulative regression in utilities.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 11 times, most recently from 88f82d8 to 8c39ead Compare March 25, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants