Skip to content

UPSTREAM PR #17136: common : introduce composable PEG parser combinators for chat parsing#359

Open
loci-dev wants to merge 219 commits intomainfrom
upstream-PR17136-branch_aldehir-parser-combinators
Open

UPSTREAM PR #17136: common : introduce composable PEG parser combinators for chat parsing#359
loci-dev wants to merge 219 commits intomainfrom
upstream-PR17136-branch_aldehir-parser-combinators

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17136

Supporting new models requires implementing several features:

  • Lazy grammar for tool calling (tool_choice = auto)
  • Full grammar for forced tool calls and response_format (reasoning models)
  • Parallel tool calls support
  • Parsing of reasoning and tool call outputs

For reasoning models, the grammar must include reasoning or performance degrades significantly.

The real challenge is that each model uses a different output format:

  • Harmony response output (gpt-oss)
  • XML with typed parameters (Qwen3-Coder, MiniMax M2)
    • These models expect string arguments as raw content rather than JSON, which requires type awareness at parse time.
  • Pseudo-function call (LFM2 e.g. [get_weather(location="..."), ...])

Currently, the grammar and parsing exist as separate functions, which works but feels a bit fragile. I believe we can unify the two by using parser combinators to compose a PEG parser. That way the grammar definition becomes the parser.

Proposed Solution

This PR introduces a generic PEG (Parsing Expression Grammar) parser to the common library, along with chat-specific extensions and a complete reference implementation for Qwen3-Coder.

I've noticed there's often a lag between when a model is supported by llama.cpp and when proper tool calling is fully implemented. This parser aims to close that gap by letting you define the grammar and parser at the same time, making it easier to add full tool calling support for new models.

Parsing Expression Grammars (PEG)

PEG parsers are straightforward to implement as recursive descent parsers. While recursive descent parsers are known for backtracking, the majority of model output can be parsed with minimal backtracking, making them practical for this use case.

Parser combinators allow us to compose complex parsers from simple, reusable building blocks. This creates a DSL that closely mimics the grammar itself.

Rather than defining both a grammar and parsing function, we can build a PEG parser that generates a compatible GBNF grammar (with exceptions) and parses model output.

Features

  • Partial parsing for streaming input
  • Built-in JSON parsers for common patterns
  • Grammar generation for generating compatible GBNF grammars
  • AST generation with semantic tags for structured extraction
  • Three common AST shapes covering most model formats:
    • simple - Content with optional reasoning
    • native - Tool arguments as JSON objects
    • constructed - Tool arguments as separate entities (XML or pseudo-functions)

Examples

Parser for models that emit tool arguments as JSON
auto parser = build_chat_peg_native_parser([&](common_chat_peg_native_builder & p) {
    // Build choice of available tools
    auto tool_choice = p.choice();
    for (const auto & tool : tools) {
        const auto & function = tool.at("function");
        std::string name = function.at("name");
        const auto & schema = function.at("parameters");

        auto tool_name = p.json_member("name", "\"" + p.literal(name) + "\"");
        auto tool_args = p.json_member("arguments", p.schema(p.json(), "tool-" + name + "-schema", schema));

        tool_choice |= p.rule("tool-" + name, "{" << tool_name << "," << tool_args << "}");
    }

    // Define tool call structure
    auto tool_call = p.trigger_rule("tool-call",
        p.sequence({
            p.literal("<tool_call>["),
            tool_choice,
            p.literal("]</tool_call>")
        })
    );

    return p.sequence({
        p.content(p.until("<tool_call>")),
        p.optional(tool_call),
        p.end()
    });
});
Parser for models that emit XML tags for each argument
auto parser = build_chat_peg_constructed_parser([&](common_chat_peg_constructed_builder & p) {
    auto location_arg = p.tool_arg(
        p.tool_arg_open("<parameter name=\"" + p.tool_arg_name(p.literal("location")) + "\">"),
        p.tool_arg_string_value(p.until("</parameter>")),
        p.tool_arg_close(p.literal("</parameter>"))
    );

    auto get_weather_tool = p.tool(p.sequence({
        p.tool_open("<function name=\"" + p.tool_name(p.literal("get_weather")) + "\">"),
        location_arg,
        p.tool_close(p.literal("</function>"))
    }));

    return p.sequence({
        p.content(p.until("<tool_call>")),
        p.literal("<tool_call>"),
        get_weather_tool,
        p.literal("</tool_call>"),
        p.end()
    });
});
Grammar generation
data.grammar = build_grammar([&](const common_grammar_builder & builder) {
    foreach_function(params.tools, [&](const json & fn) {
        builder.resolve_refs(fn.at("parameters"));
    });
    parser.build_grammar(builder, data.grammar_lazy);
});

Implementation Details

The PEG parsers are implemented using std::variant rather than traditional inheritance. This reduces boilerplate and leverages std::visit for type-safety. I initially had an OOP implementation, but it started becoming quite cumbersome and this seems like the lesser evil of the two.

using common_peg_parser_variant = std::variant<
    common_peg_epsilon_parser,
    common_peg_start_parser,
    common_peg_end_parser,
    common_peg_literal_parser,
    common_peg_sequence_parser,
    common_peg_choice_parser,
    common_peg_repetition_parser,
    common_peg_and_parser,
    common_peg_not_parser,
    common_peg_any_parser,
    common_peg_space_parser,
    common_peg_chars_parser,
    common_peg_json_string_parser,
    common_peg_until_parser,
    common_peg_schema_parser,
    common_peg_rule_parser,
    common_peg_ref_parser,
    common_peg_atomic_parser,
    common_peg_tag_parser
>;

Both parsers and AST nodes are allocated in arena structures to minimize memory allocations.

class common_peg_arena {
    std::vector<common_peg_parser_variant> parsers_;
    std::unordered_map<std::string, common_peg_parser_id> rules_;
    common_peg_parser_id root_ = COMMON_PEG_INVALID_PARSER_ID;
    ...

class common_peg_ast_arena {
    std::vector<common_peg_ast_node> nodes_;
    ...

Each parser variant is wrapped in a common_peg_parser value type to produce a DSL for composing parser combinators.

Parsers can return results FAIL, SUCCESS, or NEED_MORE_INPUT. This is how the partial parsing is implemented. It does not raise an exception on partial parse like common/chat-parser.cpp, because partial parses are still valid for streaming.

Additional Changes

  • Added common_chat_peg_parse() to common/chat.cpp and chat formats COMMON_CHAT_FORMAT_PEG_(SIMPLE|NATIVE|CONSTRUCTED) to support models parsed by a PEG parser.
    • The parser must be passed from chat param initialization to the parse function. To do this, I currently serialize the parser to JSON and then deserialize to common_chat_syntax.parser. I'm not a fan, but this seems the least intrusive method to integrate. I'll implement any alternative mechanisms if desired.
  • Added common/unicode.{cpp,h} derived from src/unicode.{cpp,h}. As I understand, we should not include headers from src/, so I had to copy the implementation. It does deviate by returning a result rather than raising an exception.

More comprehensive documentation is added in docs/development/parsing.md. The tests are also fairly comprehensive, tests/test-chat-peg-parser.cpp.


I know this is a big PR. I tried to minimize the implementation, while keeping enough to demonstrate value. #15703 shows community desire for something like this, although it doesn't have to be this implementation.

@loci-dev loci-dev force-pushed the main branch 24 times, most recently from c217e38 to a73de67 Compare December 6, 2025 10:08
@loci-review
Copy link

loci-review bot commented Dec 17, 2025

Explore the complete analysis inside the Version Insights

@loci-review
Copy link

loci-review bot commented Dec 18, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Project: llama.cpp
Versions Compared: 5b4d46ec (target) vs 56a6e1ed (base)


Analysis Scope

This PR introduces minimal code changes: one blank line in chat-parser.cpp and a 6-line validation check in server-common.cpp for the OpenAI-compatible API n parameter. The PR description references a large PEG parser implementation, but only these minor changes are visible in the current diff.

Performance Impact

Modified Function:

  • common_chat_msg_parser::str in chat-parser.cpp shows throughput increase of 5-7 ns across three binaries (llama-run: +7 ns, llama-cvector-generator: +6 ns, llama-tts: +5 ns)

Inference Path: No functions in the critical inference path (llama_decode, llama_encode, llama_tokenize) were modified. The chat parser function operates outside the token generation pipeline, handling post-processing of model outputs for tool calling and structured responses.

Tokens Per Second: No impact. The modified function processes chat message parsing after token generation completes, not during the inference loop. Model throughput remains unchanged.

Power Consumption: All binaries show changes within measurement noise (< 0.001%). The 5-7 ns overhead in a non-critical utility function has no measurable energy impact.

Assessment: The visible code changes (whitespace and parameter validation) introduce negligible overhead. The 5-7 ns increase in common_chat_msg_parser::str likely originates from changes not present in the current diff, possibly related to the PEG parser implementation mentioned in the PR description but not yet committed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants