Skip to content

UPSTREAM PR #17816: llama : add token matching support to llama-grammar#468

Open
loci-dev wants to merge 5 commits intomainfrom
upstream-PR17816-branch_aldehir-grammar-token
Open

UPSTREAM PR #17816: llama : add token matching support to llama-grammar#468
loci-dev wants to merge 5 commits intomainfrom
upstream-PR17816-branch_aldehir-grammar-token

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Dec 6, 2025

Mirrored from ggml-org/llama.cpp#17816

Implementation of idea by @ngxson: ggml-org/llama.cpp#17750 (comment)

cc: @pwilkin @aviallon

Problem

The llama-grammar implementation doesn't have a way to accept tokens directly, which creates a few problems:

  • Can't disambiguate between a special token (e.g. <|end|>) and the tokenized form <|, end, |> that may occur in content.
  • Requires awkward "exclusion" rules such as ( [^<] | "<" [^|] | "<|" [^e] | ... | "<|end|" [^>] )* to match chunks of characters that don't accumulate to the desired delimiter (<|end|>).
  • Adds extra work to grammar sampling from recursively applying character rules.

Proposed Solution

Borrowing some ideas from llguidance, you can define a token by id <[id]> or as raw token text <token> if encased in </>. I'm leaving out support for token id ranges/alternates since I don't see an immediate need for it.

You can negate by prefixing the token with !, e.g. !<|end|>.

Example (gpt-oss)

By token id:

root ::= analysis response
analysis ::= <[200005]> "analysis" <[200008]> (!<[200007]>)* <[200007]>
response ::= <[200006]> "assistant" <[200005]> "final" <[200008]> .*

That's not very readable, but useful for tokens not wrapped in </>. If they are, you can use them directly:

root ::= analysis response
analysis ::= <|channel|> "analysis" <|message|> (!<|end|>)* <|end|>
response ::= <|start|> "assistant" <|channel|> "final" <|message|> .*

Use Case: Reasoning Budget Enforcement

Assuming the model's vocab has unique tokens for its thinking tags, adopting a reasoning budget is fairly trivial via grammar:

root ::= analysis response
analysis ::= <|channel|> "analysis" <|message|> reasoning-with-budget
reasoning-with-budget ::= (!<|end|>){0,200} <|end|>
response ::= <|start|> "assistant" <|channel|> "final" <|message|> .*

# optionally, inject pieces to guide the model when it goes over
reasoning-with-budget ::= (!<|end|>){0,200} (<|end|> | "--I need to provide an immediate response" <|end|>)

Notes:

  • It is important the grammar is unambiguous, otherwise the model may find a way to continue thinking via other paths in the grammar.
  • gpt-oss may be a poor example since it has reasoning_effort, but the budget approach works pretty well.

To Do

  • Implement token support in llama-grammar
  • Add special consideration for trigger_patterns by temporarily turning all token rules to literals (char rules).
  • Update grammar documentation and provide an example in grammars/

AI Disclosure: I used an LLM at the start to help dissect the code, but its understanding had some holes. I didn't use an LLM to write the code.

@loci-review
Copy link

loci-review bot commented Dec 6, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #468 Token Matching Support

Overview

PR #468 introduces token matching support to llama-grammar, adding 327 lines across 4 files. The implementation enables grammars to match tokens by ID using <[id]> or !<[id]> syntax, addressing special token disambiguation and enabling reasoning budget enforcement. Performance analysis reveals localized regressions in grammar processing functions with negligible impact on inference throughput.

Key Findings

Grammar Processing Functions:

The most significant changes occur in llama_grammar_parser::parse_sequence and llama_grammar_accept_impl. The parse_sequence function shows a response time increase of 399,049 ns (from 160,387 ns to 559,436 ns), driven by the new parse_token function which adds token parsing logic and potential vocabulary tokenization calls. The throughput time increased modestly by 107 ns (from 2,557 ns to 2,664 ns), indicating most overhead resides in callees rather than the function itself.

The llama_grammar_accept_impl function exhibits a response time increase of 32,392 ns (from 41,335 ns to 73,727 ns). This stems from replacing llama_grammar_accept_str calls with the new llama_grammar_accept_token function, which implements nested loops for character matching and linear search for duplicate stack detection. The throughput time decreased by 21 ns, showing the overhead is primarily in downstream processing.

Structural Changes:

The llama_grammar_candidate structure gained a 4-byte llama_token id field, increasing memory footprint. New grammar element types LLAMA_GRETYPE_TOKEN and LLAMA_GRETYPE_TOKEN_NOT add two switch cases across multiple functions, increasing branching complexity. The grammar parser now requires vocabulary access for token text resolution.

Inference Impact:

Core inference functions llama_decode, llama_encode, and llama_tokenize show no changes in response time or throughput. Grammar processing occurs during constrained generation setup and token validation, not in the primary inference loop. The observed grammar function regressions do not affect tokens per second for standard inference workloads. Token matching is invoked only when grammars with token rules are active, leaving unconstrained generation unaffected.

Power Consumption:

Binary-level analysis shows build.bin.libllama.so increased power consumption by 0.319% (618.90 nJ increase from 193,963.96 nJ to 194,582.86 nJ). All other binaries including libggml.so, llama-bench, and llama-run show zero power consumption change. The increase correlates with cumulative throughput time changes in grammar functions.

Code Implementation:

The changes implement legitimate functionality for token-level grammar matching. The parse_token function handles two formats: direct ID parsing <[1000]> (fast path) and token text tokenization <|end|> (slow path requiring vocabulary lookup). The llama_grammar_accept_token function provides unified token and character acceptance with branching logic for token rules versus character rules. The implementation maintains backward compatibility with existing character-based grammars.

@loci-dev loci-dev force-pushed the main branch 25 times, most recently from a2add8a to 6d9272a Compare December 9, 2025 09:10
@loci-dev loci-dev force-pushed the main branch 25 times, most recently from adf9533 to 7103504 Compare December 14, 2025 14:07
@loci-review
Copy link

loci-review bot commented Dec 17, 2025

Explore the complete analysis inside the Version Insights

@loci-review
Copy link

loci-review bot commented Dec 18, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

PR #468: Token Matching Support in Grammar

This PR adds a single line to llama_grammar_accept_impl in src/llama-grammar.cpp, calling llama_grammar_accept_token() with the constrained string after regex trigger pattern matching. The change fixes a grammar state consistency issue in lazy evaluation mode.

Performance Impact

The modification affects grammar-constrained generation, not the core inference pipeline. Analysis of the top 10 functions by response time change shows variations in STL container operations and grammar processing functions, but none are in the tokenization or inference hot path.

Core Inference Functions: No changes detected in llama_decode, llama_encode, or llama_tokenize. These functions remain unmodified, indicating zero impact on tokens per second for standard inference workloads.

Grammar Processing: The function llama_grammar_match_token shows a response time increase from 158 ns to 166 ns (+8 ns absolute). This function is only invoked during grammar-constrained generation, not during normal token sampling.

Power Consumption: The binary build.bin.libllama.so shows a reduction of 62 nJ (-0.033%), indicating negligible energy impact. All other binaries show no measurable change.

Affected Functions: The performance variations observed are in STL accessors (std::vector::cbegin, std::vector::end) and grammar validation functions. These show changes ranging from -135 ns to +133 ns in throughput, but occur outside the inference critical path. The added function call in the trigger pattern matching path adds approximately 10-25 ns per trigger match, which is infrequent.

Tokens Per Second: No impact expected. The change does not modify llama_decode, llama_encode, or tokenization functions. Grammar processing occurs in the sampling layer after logit generation, and only when grammar constraints are active with trigger patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants