UPSTREAM PR #20660: Fix chat parser regressions: inference crashes/frozen; output backtracked by loci-dev · Pull Request #1266 · auroralabs-loci/llama.cpp

loci-dev · 2026-03-17T03:06:07Z

Note

Source pull request: ggml-org/llama.cpp#20660

Two regressions introduced in #18675.

Bug 1: Final parse throws instead of using AST fallback (all models)

common_chat_peg_parse gates the AST fallback on is_partial. During streaming, every partial parse that can't fully match the grammar falls back to AST extraction: reasoning streams, tool names appear, everything works. Then server_task_result_cmpl_final::update() calls the same function on the same accumulated text with is_partial=false, and it throws std::runtime_error instead of using the fallback. The client never receives finish_reason. In practice, they see "Failed to parse input at pos X:" at some point after starting inference.

This means every model is one slightly-unexpected token away from a 500 error. Truncated output, max_tokens mid-tool-call, a single malformed character in args — any of these trigger it.

Repro: curl -d '{"messages":[{"role":"user","content":"hello"}],"max_tokens":1}' http://localhost:8080/v1/chat/completions on any thinking model.

Fix: remove the is_partial guard. One line. If you

Bug 2: Strict JSON validation in TAG_WITH_TAGGED causes catastrophic PEG backtrack (Qwen 3.5, Qwen3-Coder, Nemotron, etc.)

build_tool_parser_tag_tagged validates parameter values with schema(json()). Small models frequently produce slightly malformed JSON (e.g. extra trailing brace). When json() rejects the value, the tool_call rule fails, PEG backtracks the entire root sequence, and wipes all AST nodes — including reasoning and tool names that parsed successfully. A second <tool_call> after the first is never attempted.

Fix: use until(value_suffix) for parser-side arg capture. Grammar constraints still enforce the schema during constrained generation; this only affects parsing of already-generated output.

Related issues:

#20193
#20198
#20229
#20345

loci-review · 2026-03-17T04:34:05Z

Overview

Performance Impact: Negligible. Analysis of 120,707 functions across 15 binaries shows 97 modified functions (0.08%), all in non-critical preprocessing paths. Changes stem from a single commit improving chat template parsing robustness by switching from strict schema validation to lenient parsing, eliminating catastrophic PEG backtracking.

Function Counts: 97 modified, 0 new, 0 removed, 120,610 unchanged

Power Consumption: Negligible changes (≤0.1%) across all binaries:

Binary	Change
build.bin.llama-tts	-0.031%
build.bin.llama-cvector-generator	-0.101%
build.bin.libmtmd.so	0.000%
build.bin.libllama.so	-0.000%
build.bin.llama-bench	-0.000%
build.bin.llama-llava-cli	0.000%
build.bin.llama-minicpmv-cli	0.000%
build.bin.llama-quantize	0.000%
build.bin.llama-qwen2vl-cli	0.000%
build.bin.llama-tokenize	0.000%
build.bin.llama-gemma3-cli	0.000%
build.bin.llama-gguf-split	0.000%
build.bin.libggml.so	0.000%
build.bin.libggml-base.so	0.000%
build.bin.libggml-cpu.so	0.000%

Function Analysis

Modified functions show mixed compiler optimization artifacts (40-74% improvements vs 14-306% regressions) with no source code changes to the functions themselves:

Improvements:

std::vector::begin() (llama-tts): 264.64ns → 83.83ns (-68.3%, -180.81ns) — entry block consolidation
std::vector::empty() (llama-tts): 465.90ns → 276.02ns (-40.8%, -189.88ns) — eliminated redundant control flow
std::_Rb_tree_const_iterator::_M_const_cast() (cvector-generator): 265.86ns → 84.35ns (-68.3%, -181.51ns) — entry block optimization

Regressions:

std::_Rb_tree::end() (llama-tts): 79.66ns → 262.94ns (+230.1%, +183.28ns) — added entry block indirection
jinja::value_kwarg_t::type() (llama-tts): 681.89ns → 782.36ns (+14.7%, +100.47ns) — extra memory load at entry

All other analyzed functions showed compiler-generated variations in STL containers, JSON parsing, and template processing infrastructure. Critical finding: No changes to inference hot path (matrix operations, attention, KV cache, tokenization) where 70-90% of execution time is spent.

Additional Findings

Architectural Benefit: The commit successfully eliminates catastrophic PEG backtracking when models generate malformed JSON with tool_choice=auto, preventing complete AST loss. This robustness improvement justifies minor performance variations in preprocessing code.

GPU/ML Operations: Zero impact — no changes to CUDA, Metal, HIP, Vulkan, or SYCL backends, quantization operations, or inference pipeline.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

…ED arg capture

loci-review · 2026-03-18T04:29:14Z

Overview

Performance Impact: Minor — Neutral with Functional Improvements

Analysis of 120,779 functions shows 106 modified (0.09%), 0 new, 0 removed, 120,673 unchanged. Commit 0083bb1 improves chat template parsing robustness by replacing strict JSON schema validation with lenient capture, trading validation strictness for real-world reliability.

Power Consumption Changes:

build.bin.llama-cvector-generator: -0.088%
build.bin.llama-tts: +0.039%
build.bin.llama-bench, libllama.so, libmtmd.so, libggml-base.so, libggml-cpu.so, libggml.so, llama-quantize, llama-qwen2vl-cli, llama-tokenize, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli: 0.000%

All changes <0.1%, within measurement noise.

Function Analysis

Primary Improvements:

chat-auto-parser-generator.cpp::build_tool_parser_tag_tagged::operator() (both binaries): Response time -12.3% (-161μs), throughput time -17.5% (-394ns). Source change removed expensive p.schema() validation, replaced with lenient p.until() capture to prevent AST corruption on malformed JSON.
jinja::statement::type() (cvector-generator): Throughput time -50.7% (-110ns). Compiler optimization merged entry blocks, eliminated unnecessary branch.
std::vector<llama_chat_message>::begin() (cvector-generator): Response time -68.2% (-181ns). Compiler consolidated prologue operations.

Notable Regressions:

std::vector<std::pair<string,json>>::end() (cvector-generator): Throughput time +306.6% (+183ns). Compiler inlining decision change.
std::vector<std::thread>::begin() (llama-tts): Throughput time +289.3% (+181ns). Added intermediate block with extra indirection.

Other analyzed functions (STL utilities, function wrappers, variant visitors) showed sub-microsecond changes from compiler optimization variations. None are in inference hot paths (matrix ops, attention, KV cache). Net improvement: ~161μs in parser construction vs ~438ns cumulative regression in utilities.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

loci-dev temporarily deployed to PROD__AL_DEMO March 17, 2026 03:06 — with GitHub Actions Inactive

chat : fix AST fallback for non-partial parses, lenient TAG_WITH_TAGG…

0083bb1

…ED arg capture

loci-dev force-pushed the main branch from 5ac00d6 to 998dd7a Compare March 18, 2026 02:17

loci-dev force-pushed the loci/pr-20660-master branch from 773a7db to 0083bb1 Compare March 18, 2026 02:17

loci-dev temporarily deployed to PROD__AL_DEMO March 18, 2026 02:17 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 11 times, most recently from 88f82d8 to 8c39ead Compare March 25, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #20660: Fix chat parser regressions: inference crashes/frozen; output backtracked#1266

UPSTREAM PR #20660: Fix chat parser regressions: inference crashes/frozen; output backtracked#1266
loci-dev wants to merge 1 commit intomainfrom
loci/pr-20660-master

loci-dev commented Mar 17, 2026

Uh oh!

loci-review bot commented Mar 17, 2026

Uh oh!

loci-review bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Mar 17, 2026

Uh oh!

loci-review bot commented Mar 17, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Mar 18, 2026

Overview

Function Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants