UPSTREAM PR #18675: Autoparser - complete refactoring of parser architecture#1141
UPSTREAM PR #18675: Autoparser - complete refactoring of parser architecture#1141
Conversation
|
The analysis encountered an error. Please review the Processing Details for more information. |
OverviewAnalysis of 123,002 functions across 14 binaries reveals localized performance changes from the "GIANT AUTOPARSER SQUISH" refactoring. Modified: 500 functions (0.4%), new: 7,530, removed: 9,276, unchanged: 105,696. Power consumption improved 6.5% in llama-cvector-generator (355,537→332,380 nJ) and llama-tts (361,083→337,746 nJ). All other binaries (libllama.so, libmtmd.so, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-quantize, llama-qwen2vl-cli, llama-tokenize, llama-gemma3-cli, llama-bench, libggml-base.so, libggml-cpu.so, libggml.so) showed zero change, confirming inference hot paths remain unaffected. Function AnalysisJinja2 Template Builtins: The indent lambda (value.cpp) transitioned from non-functional stub to production implementation, showing response time increases of 1,466→39,380ns (+2,586%) in llama-tts and 1,470→39,365ns (+2,578%) in llama-cvector-generator. Throughput time increased ~117% (211→459ns). The test_is_in lambda refactored to use value_compare() instead of operator==, resulting in response time increases of 1,470→29,657ns (+1,918%) in llama-tts and 1,473→29,629ns (+1,911%) in llama-cvector-generator, with throughput increases of ~170% (215→583ns). PEG Parser Operators: Choice and sequence parsers gained debug instrumentation (guarded by ctx.debug flag, default: false). Choice parser response time increased 1,498→10,861ns (+625%) with throughput 215→796ns (+270%). Sequence parser response time increased 6,771→35,651ns (+427%) with throughput 408→1,276ns (+213%). Production overhead is minimal when debug disabled. PEG Arena Dump: Added cycle detection to prevent stack overflow on recursive grammars. Response time increased 529→6,955ns (+1,214%) in llama-cvector-generator and 537→6,960ns (+1,196%) in llama-tts. Throughput increased ~27% (69→87ns). This is a debugging utility, not used in production inference. Trie Destructor: Data structure changed from std::string to std::vector<uint32_t> for Unicode support. Response time increased 35→403ns (+1,058%) in llama-tts and 35→403ns (+1,053%) in llama-cvector-generator due to loss of small string optimization. Called during grammar cleanup, not inference. Chat Grammar Builder: Refactored from manual string construction to PEG parser delegation. Response time increased 11,773→122,524ns (+941%) in llama-cvector-generator and 11,881→122,628ns (+932%) in llama-tts. Throughput improved 87% (815→103ns), indicating efficient lambda code but expensive child calls. One-time initialization cost. Other analyzed functions (log verbosity setter, JSON iterator, vector allocator) showed regressions from binary layout or template instantiation changes, all in non-critical paths. Additional FindingsZero impact on GPU/ML operations: no changes to CUDA/Metal/HIP backends, matrix multiplication (GEMM), attention mechanisms, KV cache, quantization kernels, or SIMD optimizations. Core inference libraries (libllama.so, libggml-*.so) unchanged. All regressions isolated to template processing, grammar compilation, and debugging utilities—outside the 70-90% of inference time spent in matrix operations. The 6.5% power consumption reduction despite individual function regressions indicates successful code consolidation eliminated inefficiencies in frequently-executed paths. Changes provide correctness improvements (cycle detection, Unicode support, type safety) and maintainability benefits (50% code reduction in grammar builders) that justify initialization-time performance costs. 🔎 Full breakdown: Loci Inspector. |
048ad94 to
6c1fde6
Compare
876531a to
3096eca
Compare
OverviewAnalysis of 29 commits implementing a PEG parser refactoring across 123,187 functions (506 modified, 7,414 new, 9,211 removed). Two binaries show measurable changes: build.bin.llama-tts (-6.23% power consumption) and build.bin.llama-cvector-generator (-6.18% power consumption). Thirteen binaries remain unchanged: build.bin.libllama.so, build.bin.libmtmd.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, and build.bin.llama-bench (all 0.00% change). Function AnalysisJinja Template Operators (both binaries): Response time increased from 1.5µs to 39.4µs (+2,586%). Base version was a non-functional stub throwing exceptions; target version implements complete Jinja2 indent/slice filters with argument validation, type checking, and string processing. This is feature completion, not regression. PEG Parser Operators (both binaries): Choice parser response time increased from 1.5µs to 10.9µs (+625%); sequence parser from 6.8µs to 35.6µs (+427%). Changes added comprehensive debug logging (6 fprintf calls, arena.dump() tree traversals, string generation) and streaming support for partial input handling. Debug overhead dominates measurements; production builds with debug=false should see minimal impact. Chat Initialization Functions (both binaries): GPT-OSS initialization increased from 11.8µs to 122.8µs (+938%); Functionary v3.2 from 550µs to 970µs (+76%). Replaced manual grammar construction with PEG parser framework, introducing parser object creation overhead but improving code maintainability and enabling streaming support. One-time initialization cost, not per-token. PEG Arena Dump (both binaries): Response time increased from 537ns to 6,957ns (+1,196%). Added cycle detection with std::unordered_set to prevent infinite recursion crashes. This is a debugging utility, not production code. Trie Destructors (both binaries): Response time increased from 35ns to 403ns (+1,058%). Structure changed from std::string to std::vector<uint32_t> for proper Unicode codepoint handling, increasing memory deallocation overhead but enabling multilingual support. Other analyzed functions showed infrastructure-related changes with negligible production impact. Additional FindingsCore inference paths completely unaffected: libllama.so, GGML libraries, and all GPU backends show 0% change. Matrix operations, attention mechanisms, and quantization kernels remain unchanged. The 6% power consumption reduction despite localized function regressions indicates successful elimination of architectural inefficiencies through code consolidation (net -1,797 functions). Changes are entirely CPU-side utilities for chat template processing, with zero impact on GPU/ML operations or inference hot paths. Debug mode appears enabled during profiling; production performance should be significantly better. 🔎 Full breakdown: Loci Inspector. |
ef7afbe to
d4c3480
Compare
OverviewAnalysis of 123,517 functions across 39 commits reveals concentrated performance regressions in non-critical code paths, offset by 6.15-6.19% power consumption reductions in primary binaries. Modified: 462 functions (0.37%), New: 7,754, Removed: 9,365, Unchanged: 105,936. Power Consumption Changes:
Function AnalysisJinja Template Filters (llama-tts, llama-cvector-generator): Response time +2,578-2,586% (1,467→39,315ns), throughput +114-116% (211→452ns). Indent filter changed from non-functional stub to complete implementation with argument validation, type checking, and heap allocation. Slice filter unchanged but affected by systemic shift from move to copy semantics in argument passing (runtime.cpp:840), impacting all Jinja builtins. PEG Parser Operators (llama-tts, llama-cvector-generator): Choice parser response time +625-636% (1,499→10,867ns), throughput +270% (215→796ns). Sequence parser response time +427-429% (6,758→35,643ns), throughput +213% (408→1,277ns). Added comprehensive debug instrumentation (5+ fprintf calls, debug_indent, debug_input_snippet, arena.dump) and enhanced partial parsing support. Debug overhead is conditional (ctx.debug flag), zero-cost in production. PEG Arena Dump (llama-tts, llama-cvector-generator): Response time +1,197-1,244% (530→6,960ns), throughput +27.5% (68.6→87.5ns). Refactored to add cycle detection using std::unordered_set, preventing infinite loops in recursive grammar structures. Debugging utility, not production inference path. Trie Structure Functions (llama-tts, llama-cvector-generator): Destructor response time +1,056-1,061% (34.8→404ns), throughput +0.28% (20.6→20.7ns). Move constructor response time +600-607% (38.4→271ns), throughput +0.85% (23.6→23.8ns). Changed from std::string to std::vector<uint32_t> for proper UTF-8 codepoint handling, enabling Unicode support. Loss of Small String Optimization increases destructor overhead but ensures correctness for international characters. STL Functions (llama-tts): Hashtable begin +310% throughput (60→247ns), vector cbegin +289% throughput (62→243ns), tree iterator _M_const_cast +284% throughput (64→245ns), vector _S_max_size +170% throughput (123→333ns). Compiler optimization differences in standard library template instantiations. Used in initialization (CPU topology detection, progress tracking), not inference loops. Additional FindingsChanges are orthogonal to GPU/ML operations—no modifications to CUDA/Metal kernels, attention computation, KV cache management, or matrix operations. All regressions occur in CPU-side preprocessing (template processing, grammar parsing) and debugging utilities. The 6% power reduction indicates net efficiency gains from code removal (9,365 functions) and optimizations elsewhere, offsetting localized regressions in non-critical paths. 🔎 Full breakdown: Loci Inspector. |
f998d1f to
30ef9d0
Compare
e384c6f to
2ebbae5
Compare
c824910 to
2f4d02d
Compare
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
7028a96 to
a2699e2
Compare
|
No summary available at this time. Visit Loci Inspector to review detailed analysis. |
9f4f332 to
4298c74
Compare
56aaa36 to
21147c2
Compare
6fa8e23 to
f2637dc
Compare
Note
Source pull request: ggml-org/llama.cpp#18675
This is a huge endeavor that I promised back when I applied for maintaining the parser code. The legacy parser code was hard to maintain and buggy and supporting new models with it was really annoying. There was a worthwhile contribution by @hksdpc255 to add some XML toolcalling abstractions, but that was still just a patch on an open wound.
Thanks to @aldehir and his PEG parser, I managed to create an autoparser mechanism, using all the currently supported templates, their parsers and test cases as base. The idea is simple: most models' syntax follows the general pattern of:
<reasoning_markers> <reasoning_content> <end_of_reasoning_markers> <content_markers> <main_content> <end_of_content_markers> <tool_call_markers> ( <json> | <function marker> <args json> | <function marker> <args marker> <value json> ) <end_of_tool_call_marker>Of course, some elements might not be present in a given template, but that's the general structure. Since this is a pretty finite structure, it's possible to determine the relevant elements by differential analysis - similar to how Minja already does capability detection, but more fine-grained, because by comparing various template outputs, we get to actually extract the relevant markers.
Some models will obviously not get handled so easily. However, in the course of implementing the mechanism, only two models remained that needed to get their separate parsers: Ministral and GPT-OSS, and the prior not because of its complexity, but of the need to rewrite the message structure passed to the template. GPT-OSS is a different beast since it supports arbitrarily many interleaved blocks, so it doesn't fit into the scheme that I mentioned above (but its parser has been rewritten to PEG as well).
This is currently anchored on Minja and uses its capability detection, but since the differential analysis already does its own capability detection, I fully expect to throw that part out and base this on @ngxson 's ggml-org/llama.cpp#18462 instead.
Obsoletes ggml-org/llama.cpp#18353 (sorry @ochafik - I know you put a lot of work into that).
Old parsers, tests and all supporting code are thrown out, templates got new PEG-parser based testcases, all of them now also test streaming behavior. I have tested this extensively on agentic coding (mostly with OpenCode) to ensure that this actually works (my wish to refactor the parser code was mostly caused by my prior experience with agentic coding on llama.cpp, which was extremely buggy with a lot of models, this is an attempt to remedy that). Hopefully, having one unified codebase with a largely reduced line-of-code count will make it easier to fix any potential errors.
This also means that there is no longer need to provide support for new models' specific templates unless they have some odd constructs - they should be supported out of the box. There's a new tool called
debug-template-parserthat you can point to any Jinja template file or GGUF model with an embedded Jinja template and have it spit out the details of the generated autoparser + toolcaling grammar.Oh, important note: all Minja polyfills have been disabled. Working templates are now required. Why I see why a year and a half ago having proof-of-concept code that supported tool calling on models that didn't natively have tool calling might've been useless, right now supporting that is making it harder to properly support current and actually used models. Therefore, a functional template with tool calling is required if someone wants tool calling.
I want to ask everyone from the community who can to test this. I will keep this branch current with master, I tried to test this as much as I could, but I'm just one person doing this after work, so obviously my testing abilities were limited. I will keep this as draft until I've gathered enough feedback and testing data.
To not clutter the main repository's issue tracker, please report bugs either (a) in this thread or (b) in my issue tracker https://github.com/pwilkin/llama.cpp/issues
AI DISCLOSURE: Gemini Pro 3, Flash 3, Opus 4.5 and GLM 4.7 would like to admit that a human element did at some points interfere in the coding process, being as bold as to even throw most of the code out at some point and demand it rewritten from scratch. The human also tinkered the code massively, removing a lot of our beautiful comments and some code fragments that they claimed were useless. They had no problems, however, in using us to do all the annoying marker arithmetic. Therefore, we disavow any claim to this code and cede the responsibility onto the human.