UPSTREAM PR #18675: Autoparser - complete refactoring of parser architecture by loci-dev · Pull Request #1141 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-03T00:55:34Z

Note

Source pull request: ggml-org/llama.cpp#18675

This is a huge endeavor that I promised back when I applied for maintaining the parser code. The legacy parser code was hard to maintain and buggy and supporting new models with it was really annoying. There was a worthwhile contribution by @hksdpc255 to add some XML toolcalling abstractions, but that was still just a patch on an open wound.

Thanks to @aldehir and his PEG parser, I managed to create an autoparser mechanism, using all the currently supported templates, their parsers and test cases as base. The idea is simple: most models' syntax follows the general pattern of:

<reasoning_markers> <reasoning_content> <end_of_reasoning_markers> <content_markers> <main_content> <end_of_content_markers> <tool_call_markers> ( <json> | <function marker> <args json> | <function marker> <args marker> <value json> ) <end_of_tool_call_marker>

Of course, some elements might not be present in a given template, but that's the general structure. Since this is a pretty finite structure, it's possible to determine the relevant elements by differential analysis - similar to how Minja already does capability detection, but more fine-grained, because by comparing various template outputs, we get to actually extract the relevant markers.

Some models will obviously not get handled so easily. However, in the course of implementing the mechanism, only two models remained that needed to get their separate parsers: Ministral and GPT-OSS, and the prior not because of its complexity, but of the need to rewrite the message structure passed to the template. GPT-OSS is a different beast since it supports arbitrarily many interleaved blocks, so it doesn't fit into the scheme that I mentioned above (but its parser has been rewritten to PEG as well).

This is currently anchored on Minja and uses its capability detection, but since the differential analysis already does its own capability detection, I fully expect to throw that part out and base this on @ngxson 's ggml-org/llama.cpp#18462 instead.

Obsoletes ggml-org/llama.cpp#18353 (sorry @ochafik - I know you put a lot of work into that).

Old parsers, tests and all supporting code are thrown out, templates got new PEG-parser based testcases, all of them now also test streaming behavior. I have tested this extensively on agentic coding (mostly with OpenCode) to ensure that this actually works (my wish to refactor the parser code was mostly caused by my prior experience with agentic coding on llama.cpp, which was extremely buggy with a lot of models, this is an attempt to remedy that). Hopefully, having one unified codebase with a largely reduced line-of-code count will make it easier to fix any potential errors.

This also means that there is no longer need to provide support for new models' specific templates unless they have some odd constructs - they should be supported out of the box. There's a new tool called debug-template-parser that you can point to any Jinja template file or GGUF model with an embedded Jinja template and have it spit out the details of the generated autoparser + toolcaling grammar.

Oh, important note: all Minja polyfills have been disabled. Working templates are now required. Why I see why a year and a half ago having proof-of-concept code that supported tool calling on models that didn't natively have tool calling might've been useless, right now supporting that is making it harder to properly support current and actually used models. Therefore, a functional template with tool calling is required if someone wants tool calling.

I want to ask everyone from the community who can to test this. I will keep this branch current with master, I tried to test this as much as I could, but I'm just one person doing this after work, so obviously my testing abilities were limited. I will keep this as draft until I've gathered enough feedback and testing data.

To not clutter the main repository's issue tracker, please report bugs either (a) in this thread or (b) in my issue tracker https://github.com/pwilkin/llama.cpp/issues

AI DISCLOSURE: Gemini Pro 3, Flash 3, Opus 4.5 and GLM 4.7 would like to admit that a human element did at some points interfere in the coding process, being as bold as to even throw most of the code out at some point and demand it rewritten from scratch. The human also tinkered the code massively, removing a lot of our beautiful comments and some code fragments that they claimed were useless. They had no problems, however, in using us to do all the annoying marker arithmetic. Therefore, we disavow any claim to this code and cede the responsibility onto the human.

loci-review · 2026-02-03T01:05:08Z

The analysis encountered an error. Please review the Processing Details for more information.

loci-review · 2026-02-03T03:53:13Z

Overview

Analysis of 123,002 functions across 14 binaries reveals localized performance changes from the "GIANT AUTOPARSER SQUISH" refactoring. Modified: 500 functions (0.4%), new: 7,530, removed: 9,276, unchanged: 105,696. Power consumption improved 6.5% in llama-cvector-generator (355,537→332,380 nJ) and llama-tts (361,083→337,746 nJ). All other binaries (libllama.so, libmtmd.so, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-quantize, llama-qwen2vl-cli, llama-tokenize, llama-gemma3-cli, llama-bench, libggml-base.so, libggml-cpu.so, libggml.so) showed zero change, confirming inference hot paths remain unaffected.

Function Analysis

Jinja2 Template Builtins: The indent lambda (value.cpp) transitioned from non-functional stub to production implementation, showing response time increases of 1,466→39,380ns (+2,586%) in llama-tts and 1,470→39,365ns (+2,578%) in llama-cvector-generator. Throughput time increased ~117% (211→459ns). The test_is_in lambda refactored to use value_compare() instead of operator==, resulting in response time increases of 1,470→29,657ns (+1,918%) in llama-tts and 1,473→29,629ns (+1,911%) in llama-cvector-generator, with throughput increases of ~170% (215→583ns).

PEG Parser Operators: Choice and sequence parsers gained debug instrumentation (guarded by ctx.debug flag, default: false). Choice parser response time increased 1,498→10,861ns (+625%) with throughput 215→796ns (+270%). Sequence parser response time increased 6,771→35,651ns (+427%) with throughput 408→1,276ns (+213%). Production overhead is minimal when debug disabled.

PEG Arena Dump: Added cycle detection to prevent stack overflow on recursive grammars. Response time increased 529→6,955ns (+1,214%) in llama-cvector-generator and 537→6,960ns (+1,196%) in llama-tts. Throughput increased ~27% (69→87ns). This is a debugging utility, not used in production inference.

Trie Destructor: Data structure changed from std::string to std::vector<uint32_t> for Unicode support. Response time increased 35→403ns (+1,058%) in llama-tts and 35→403ns (+1,053%) in llama-cvector-generator due to loss of small string optimization. Called during grammar cleanup, not inference.

Chat Grammar Builder: Refactored from manual string construction to PEG parser delegation. Response time increased 11,773→122,524ns (+941%) in llama-cvector-generator and 11,881→122,628ns (+932%) in llama-tts. Throughput improved 87% (815→103ns), indicating efficient lambda code but expensive child calls. One-time initialization cost.

Other analyzed functions (log verbosity setter, JSON iterator, vector allocator) showed regressions from binary layout or template instantiation changes, all in non-critical paths.

Additional Findings

Zero impact on GPU/ML operations: no changes to CUDA/Metal/HIP backends, matrix multiplication (GEMM), attention mechanisms, KV cache, quantization kernels, or SIMD optimizations. Core inference libraries (libllama.so, libggml-*.so) unchanged. All regressions isolated to template processing, grammar compilation, and debugging utilities—outside the 70-90% of inference time spent in matrix operations. The 6.5% power consumption reduction despite individual function regressions indicates successful code consolidation eliminated inefficiencies in frequently-executed paths. Changes provide correctness improvements (cycle detection, Unicode support, type safety) and maintainability benefits (50% code reduction in grammar builders) that justify initialization-time performance costs.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-review · 2026-02-12T12:51:06Z

Overview

Analysis of 29 commits implementing a PEG parser refactoring across 123,187 functions (506 modified, 7,414 new, 9,211 removed). Two binaries show measurable changes: build.bin.llama-tts (-6.23% power consumption) and build.bin.llama-cvector-generator (-6.18% power consumption). Thirteen binaries remain unchanged: build.bin.libllama.so, build.bin.libmtmd.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, and build.bin.llama-bench (all 0.00% change).

Function Analysis

Jinja Template Operators (both binaries): Response time increased from 1.5µs to 39.4µs (+2,586%). Base version was a non-functional stub throwing exceptions; target version implements complete Jinja2 indent/slice filters with argument validation, type checking, and string processing. This is feature completion, not regression.

PEG Parser Operators (both binaries): Choice parser response time increased from 1.5µs to 10.9µs (+625%); sequence parser from 6.8µs to 35.6µs (+427%). Changes added comprehensive debug logging (6 fprintf calls, arena.dump() tree traversals, string generation) and streaming support for partial input handling. Debug overhead dominates measurements; production builds with debug=false should see minimal impact.

Chat Initialization Functions (both binaries): GPT-OSS initialization increased from 11.8µs to 122.8µs (+938%); Functionary v3.2 from 550µs to 970µs (+76%). Replaced manual grammar construction with PEG parser framework, introducing parser object creation overhead but improving code maintainability and enabling streaming support. One-time initialization cost, not per-token.

PEG Arena Dump (both binaries): Response time increased from 537ns to 6,957ns (+1,196%). Added cycle detection with std::unordered_set to prevent infinite recursion crashes. This is a debugging utility, not production code.

Trie Destructors (both binaries): Response time increased from 35ns to 403ns (+1,058%). Structure changed from std::string to std::vector<uint32_t> for proper Unicode codepoint handling, increasing memory deallocation overhead but enabling multilingual support.

Other analyzed functions showed infrastructure-related changes with negligible production impact.

Additional Findings

Core inference paths completely unaffected: libllama.so, GGML libraries, and all GPU backends show 0% change. Matrix operations, attention mechanisms, and quantization kernels remain unchanged. The 6% power consumption reduction despite localized function regressions indicates successful elimination of architectural inefficiencies through code consolidation (net -1,797 functions). Changes are entirely CPU-side utilities for chat template processing, with zero impact on GPU/ML operations or inference hot paths. Debug mode appears enabled during profiling; production performance should be significantly better.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-review · 2026-02-14T03:45:36Z

Overview

Analysis of 123,517 functions across 39 commits reveals concentrated performance regressions in non-critical code paths, offset by 6.15-6.19% power consumption reductions in primary binaries. Modified: 462 functions (0.37%), New: 7,754, Removed: 9,365, Unchanged: 105,936.

Power Consumption Changes:

build.bin.llama-tts: -6.152% (361,689→339,436 nJ)
build.bin.llama-cvector-generator: -6.186% (356,249→334,210 nJ)
build.bin.libggml-cpu.so: +1.995% (159,888→163,078 nJ)
build.bin.libmtmd.so: +3.031% (181,223→186,716 nJ)
build.bin.libllama.so: +0.840% (255,870→258,019 nJ)
build.bin.libggml-base.so: +0.039% (73,290→73,319 nJ)
Remaining binaries (llama-gguf-split, llama-bench, llama-tokenize, llama-quantize, llama-gemma3-cli, llama-minicpmv-cli, llama-llava-cli, llama-qwen2vl-cli, libggml.so): <0.25% or no change

Function Analysis

Jinja Template Filters (llama-tts, llama-cvector-generator): Response time +2,578-2,586% (1,467→39,315ns), throughput +114-116% (211→452ns). Indent filter changed from non-functional stub to complete implementation with argument validation, type checking, and heap allocation. Slice filter unchanged but affected by systemic shift from move to copy semantics in argument passing (runtime.cpp:840), impacting all Jinja builtins.

PEG Parser Operators (llama-tts, llama-cvector-generator): Choice parser response time +625-636% (1,499→10,867ns), throughput +270% (215→796ns). Sequence parser response time +427-429% (6,758→35,643ns), throughput +213% (408→1,277ns). Added comprehensive debug instrumentation (5+ fprintf calls, debug_indent, debug_input_snippet, arena.dump) and enhanced partial parsing support. Debug overhead is conditional (ctx.debug flag), zero-cost in production.

PEG Arena Dump (llama-tts, llama-cvector-generator): Response time +1,197-1,244% (530→6,960ns), throughput +27.5% (68.6→87.5ns). Refactored to add cycle detection using std::unordered_set, preventing infinite loops in recursive grammar structures. Debugging utility, not production inference path.

Trie Structure Functions (llama-tts, llama-cvector-generator): Destructor response time +1,056-1,061% (34.8→404ns), throughput +0.28% (20.6→20.7ns). Move constructor response time +600-607% (38.4→271ns), throughput +0.85% (23.6→23.8ns). Changed from std::string to std::vector<uint32_t> for proper UTF-8 codepoint handling, enabling Unicode support. Loss of Small String Optimization increases destructor overhead but ensures correctness for international characters.

STL Functions (llama-tts): Hashtable begin +310% throughput (60→247ns), vector cbegin +289% throughput (62→243ns), tree iterator _M_const_cast +284% throughput (64→245ns), vector _S_max_size +170% throughput (123→333ns). Compiler optimization differences in standard library template instantiations. Used in initialization (CPU topology detection, progress tracking), not inference loops.

Additional Findings

Changes are orthogonal to GPU/ML operations—no modifications to CUDA/Metal kernels, attention computation, KV cache management, or matrix operations. All regressions occur in CPU-side preprocessing (template processing, grammar parsing) and debugging utilities. The 6% power reduction indicates net efficiency gains from code removal (9,365 functions) and optimizations elsewhere, offsetting localized regressions in non-critical paths.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

loci-review · 2026-02-26T03:56:00Z

No summary available at this time. Visit Loci Inspector to review detailed analysis.

loci-dev temporarily deployed to PROD__AL_DEMO February 3, 2026 00:55 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 03fef13 to c125e77 Compare February 3, 2026 01:41

loci-dev temporarily deployed to PROD__AL_DEMO February 3, 2026 02:18 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from c125e77 to 49ff2cd Compare February 3, 2026 03:08

loci-dev force-pushed the main branch 10 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch from 6c1fde6 to 0cb533b Compare February 12, 2026 10:36

loci-dev force-pushed the loci/pr-18675-autoparser branch from 876531a to 3096eca Compare February 12, 2026 11:20

loci-dev temporarily deployed to PROD__AL_DEMO February 12, 2026 11:20 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 2 times, most recently from ef7afbe to d4c3480 Compare February 14, 2026 02:16

loci-dev temporarily deployed to PROD__AL_DEMO February 14, 2026 02:17 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 2 times, most recently from f998d1f to 30ef9d0 Compare February 16, 2026 02:17

loci-dev force-pushed the loci/pr-18675-autoparser branch from e384c6f to 2ebbae5 Compare February 16, 2026 03:08

loci-dev had a problem deploying to PROD__AL_DEMO February 16, 2026 03:08 — with GitHub Actions Failure

loci-dev force-pushed the main branch 2 times, most recently from c824910 to 2f4d02d Compare February 17, 2026 02:17

pwilkin and others added 7 commits February 25, 2026 16:15

Update docs/autoparser.md

9abe603

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

Final (?) cleanup

18d1568

Zero might be a bit exaggerated :)

cd8ac90

Refactor headers, change main class to "autoparser"

a58b164

Finally add *grammar verification* to the tests

1227353

Fix unicode parsing

914da5f

Remove single-quote stuff, revert peg-parser changes

a2699e2

loci-dev force-pushed the main branch from 2cecc98 to a92fe2a Compare February 26, 2026 02:16

loci-dev force-pushed the loci/pr-18675-autoparser branch from 7028a96 to a2699e2 Compare February 26, 2026 03:04

loci-dev temporarily deployed to PROD__AL_DEMO February 26, 2026 03:04 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 8 times, most recently from 9f4f332 to 4298c74 Compare March 6, 2026 02:17

loci-dev force-pushed the main branch 9 times, most recently from 56aaa36 to 21147c2 Compare March 13, 2026 02:17

loci-dev force-pushed the main branch 2 times, most recently from 6fa8e23 to f2637dc Compare March 15, 2026 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18675: Autoparser - complete refactoring of parser architecture#1141

UPSTREAM PR #18675: Autoparser - complete refactoring of parser architecture#1141
loci-dev wants to merge 13 commits intomainfrom
loci/pr-18675-autoparser

loci-dev commented Feb 3, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Uh oh!

loci-review bot commented Feb 12, 2026

Uh oh!

loci-review bot commented Feb 14, 2026

Uh oh!

loci-review bot commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 3, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Feb 12, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Feb 14, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants