UPSTREAM PR #20660: Fix chat parser regressions: inference crashes/frozen; output backtracked#1266
UPSTREAM PR #20660: Fix chat parser regressions: inference crashes/frozen; output backtracked#1266
Conversation
OverviewPerformance Impact: Negligible. Analysis of 120,707 functions across 15 binaries shows 97 modified functions (0.08%), all in non-critical preprocessing paths. Changes stem from a single commit improving chat template parsing robustness by switching from strict schema validation to lenient parsing, eliminating catastrophic PEG backtracking. Function Counts: 97 modified, 0 new, 0 removed, 120,610 unchanged Power Consumption: Negligible changes (≤0.1%) across all binaries:
Function AnalysisModified functions show mixed compiler optimization artifacts (40-74% improvements vs 14-306% regressions) with no source code changes to the functions themselves: Improvements:
Regressions:
All other analyzed functions showed compiler-generated variations in STL containers, JSON parsing, and template processing infrastructure. Critical finding: No changes to inference hot path (matrix operations, attention, KV cache, tokenization) where 70-90% of execution time is spent. Additional FindingsArchitectural Benefit: The commit successfully eliminates catastrophic PEG backtracking when models generate malformed JSON with GPU/ML Operations: Zero impact — no changes to CUDA, Metal, HIP, Vulkan, or SYCL backends, quantization operations, or inference pipeline. 🔎 Full breakdown: Loci Inspector |
773a7db to
0083bb1
Compare
OverviewPerformance Impact: Minor — Neutral with Functional Improvements Analysis of 120,779 functions shows 106 modified (0.09%), 0 new, 0 removed, 120,673 unchanged. Commit 0083bb1 improves chat template parsing robustness by replacing strict JSON schema validation with lenient capture, trading validation strictness for real-world reliability. Power Consumption Changes:
All changes <0.1%, within measurement noise. Function AnalysisPrimary Improvements:
Notable Regressions:
Other analyzed functions (STL utilities, function wrappers, variant visitors) showed sub-microsecond changes from compiler optimization variations. None are in inference hot paths (matrix ops, attention, KV cache). Net improvement: ~161μs in parser construction vs ~438ns cumulative regression in utilities. 🔎 Full breakdown: Loci Inspector |
88f82d8 to
8c39ead
Compare
Note
Source pull request: ggml-org/llama.cpp#20660
Two regressions introduced in #18675.
Bug 1: Final parse throws instead of using AST fallback (all models)
common_chat_peg_parsegates the AST fallback onis_partial. During streaming, every partial parse that can't fully match the grammar falls back to AST extraction: reasoning streams, tool names appear, everything works. Thenserver_task_result_cmpl_final::update()calls the same function on the same accumulated text withis_partial=false, and it throwsstd::runtime_errorinstead of using the fallback. The client never receivesfinish_reason. In practice, they see "Failed to parse input at pos X:" at some point after starting inference.This means every model is one slightly-unexpected token away from a 500 error. Truncated output,
max_tokensmid-tool-call, a single malformed character in args — any of these trigger it.Repro:
curl -d '{"messages":[{"role":"user","content":"hello"}],"max_tokens":1}' http://localhost:8080/v1/chat/completionson any thinking model.Fix: remove the
is_partialguard. One line. If youBug 2: Strict JSON validation in TAG_WITH_TAGGED causes catastrophic PEG backtrack (Qwen 3.5, Qwen3-Coder, Nemotron, etc.)
build_tool_parser_tag_taggedvalidates parameter values withschema(json()). Small models frequently produce slightly malformed JSON (e.g. extra trailing brace). Whenjson()rejects the value, the tool_call rule fails, PEG backtracks the entire root sequence, and wipes all AST nodes — including reasoning and tool names that parsed successfully. A second<tool_call>after the first is never attempted.Fix: use
until(value_suffix)for parser-side arg capture. Grammar constraints still enforce the schema during constrained generation; this only affects parsing of already-generated output.Related issues: